1 Introduction

This practical focuses on reading external data and using aesthetics in ggplot2. You will learn how to:

The following datasets are used in this practical:

2 Part A: Reading external data

Before we can visualise data, we often need to read it from external files. Two key packages help with this:

As usual, load these 2 packages together with the usual suspects - ggplot2 & dplyr:

library(readr)
library(readxl)
library(ggplot2)
library(dplyr)

As a one-off, if there’s an error saying that there are no such packages, install them using install.packages("readr") and so on.

2.1 Reading CSV files with readr

Download moneydemand.csv from Canvas to your working directory, and run the following, where read_csv() reads comma-separated value files:

moneydemand <- read_csv("moneydemand.csv")
str(moneydemand) # to have a look at the data

This dataset contains US money demand data from 1879 to 1974, with variables:

  • year: Year of observation
  • logM: Log of real money stock
  • logYp: Log of permanent income
  • Rs: Short-term interest rate
  • Rl: Long-term interest rate
  • Rm: Interest rate on money
  • logSpp: Log of stock prices

2.1.1 Common read_csv() arguments

Argument Purpose
file Path to the file
col_names Use first row as names (TRUE) or provide names
col_types Specify column types explicitly
skip Number of lines to skip before reading
na Character vector of strings to treat as NA

2.2 Reading Excel files with readxl

The read_excel() function reads both .xls and .xlsx files. Unlike CSV files, Excel spreadsheets can contain multiple sheets and often have header rows that need to be skipped.

As an example, we will read population data from the Northern Ireland Census 2021, available from NISRA (https://www.nisra.gov.uk/publications/census-2021-person-and-household-estimates-data-zones-northern-ireland). We will use this dataset later in this module for geospatial visualisation.

Download the file census-2021-ms-a14.xlsx, displayed as “MS-A14 Population density” from the URL above (also available on Canvas), to your working directory, and run the following:

census_ni <- read_excel("census-2021-ms-a14.xlsx",
                        sheet = "SDZ",
                        skip = 5)
str(census_ni) # notice the clunky column names

The column names from this file are quite clunky (e.g., All usual residents, Population density (number of usual residents per hectare)). We can use dplyr::rename() to give them cleaner names:

census_ni <- census_ni |>
  rename(
    name = `Geography`,
    code = `Geography code`,
    residents = `All usual residents`,
    area = `Area (hectares)\r\n[note 1, 2]`,
    density = `Population density (number of usual residents per hectare)`,
    profile = `Access census area explorer`
  )
str(census_ni) # much cleaner

2.2.1 Common read_excel() arguments

Argument Purpose
path Path to the file
sheet Sheet name or number to read
range Cell range to read (e.g., “A1:D10”)
col_names Use first row as names or provide names
skip Number of rows to skip before reading

3 Part B: Computed variables in aesthetics

Before we explore colours and other aesthetics, let’s consider an important feature of ggplot2: you can compute variables directly within aes().

3.1 A mathematical relationship

  1. Using the census_ni data, think about what would happen if you plot residents (on the \(y\)-axis) against the product of area and density (on the \(x\)-axis). What shape would you expect the plot to have? Why?

    Answer: The plot would be a straight line with slope 1. This is because density is defined as residents/area, so area \(\times\) density = area \(\times\) (residents/area) = residents. Therefore, we are plotting residents against residents, which gives a straight line \(y = x\).

  2. Create this plot using aes(x = area * density, y = residents). Verify that the relationship is indeed a straight line.

    ggplot(census_ni, aes(x = area * density, y = residents)) +
      geom_point() +
      labs(x = "Area x Density", y = "Usual Residents")

    Note that we computed area * density directly within aes(). This is a powerful feature that allows you to transform or combine variables without creating new columns in your data frame.

4 Part C: Colour aesthetics

Colour is an aesthetic (see Chapter 4 in notes) that maps variables to visual colours. The scale functions control how this mapping works.

4.1 Discrete colour scales

  1. Using the mtcars dataset, create a scatterplot of mpg (\(y\)-axis) vs wt (\(x\)-axis), with points coloured by cyl (number of cylinders). Remember to convert cyl to a factor using factor(cyl).

    ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
      geom_point(size = 3) +
      labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")

  2. Using your plot from Q3, try three different colour scales:

    1. scale_colour_brewer() with palette “Set1”
    2. scale_colour_viridis_d()
    3. scale_colour_grey()
    p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
      geom_point(size = 3) +
      labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")
    
    p + scale_colour_brewer(palette = "Set1")

    p + scale_colour_viridis_d()

    p + scale_colour_grey()

  3. Create a bar chart of cyl from mtcars, with bars filled by cyl. Use scale_fill_viridis_d(option = "E") for a colour-blind-friendly palette.

    ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +
      geom_bar() +
      scale_fill_viridis_d(option = "E") +
      labs(x = "Cylinders", y = "Count", fill = "Cylinders")

    Is this plot truly informative? Explain why or why not, and suggest a fix.

    Answer: This plot is not truly informative because the fill colour is redundant with the \(x\)-axis — both encode the same variable (cyl). The colour adds no new information. A better approach would be to fill by a different variable, such as am (transmission type), which would show how many cars of each cylinder count have automatic vs manual transmission:

    ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
      geom_bar(position = "dodge") +
      scale_fill_viridis_d(option = "E") +
      labs(x = "Cylinders", y = "Count", fill = "Transmission")

4.2 Continuous colour scales

  1. Using moneydemand, create a scatterplot of Rl (long-term rate) vs Rs (short-term rate), with colour mapped to year. Apply scale_colour_viridis_c().

    ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
      geom_point(size = 2) +
      scale_colour_viridis_c() +
      labs(x = "Short-term Interest Rate (%)",
           y = "Long-term Interest Rate (%)",
           colour = "Year")

  2. Create a line plot of logM (log money stock) over year, with colour mapped to logYp (log permanent income). Try both:

    1. scale_colour_gradient(low = "yellow", high = "red")
    2. scale_colour_distiller(palette = "Spectral", direction = 1)
    p <- ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
      geom_line(linewidth = 1) +
      geom_point() +
      labs(x = "Year", y = "Log Money Stock",
           colour = "Log Permanent\nIncome")
    
    p + scale_colour_gradient(low = "yellow", high = "red")

    p + scale_colour_distiller(palette = "Spectral", direction = 1)

4.3 Binned colour scales

  1. Create the same plot as Q7 (logM over year, coloured by logYp), but use scale_colour_viridis_b() instead of a continuous scale. Compare the two versions — which do you find more informative for this data?

    ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
      geom_line(linewidth = 1) +
      geom_point() +
      scale_colour_viridis_b() +
      labs(x = "Year", y = "Log Money Stock",
           colour = "Log Permanent\nIncome")

    The binned scale may be more useful here as it clearly shows distinct periods (e.g., values below 6.5, 6.5–7.0, above 7.0). The continuous scale shows subtle gradations but the discrete boundaries can be easier to interpret.

  2. What error do you get if you try to use scale_colour_viridis_d() with a continuous variable like year? Why does this happen?

    ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
      geom_point() +
      scale_colour_viridis_d()

    Answer: The error is ‘Discrete value supplied to continuous scale’ (or similar). This occurs because year is a continuous numeric variable, but scale_colour_viridis_d() is designed for discrete (categorical) variables. Use scale_colour_viridis_c() for continuous data or scale_colour_viridis_b() for binned continuous data.

5 Part D: Other aesthetics

These aesthetics are covered in Chapter 5 in notes.

5.1 Shape

  1. Using mtcars, create a scatterplot of mpg vs wt with points shaped by gear (number of gears). Use scale_shape_manual() to set shapes 16 (filled circle), 17 (filled triangle), and 15 (filled square) for the three gear values.

    ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear))) +
      geom_point(size = 3) +
      scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
      labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", shape = "Gears")

  2. Create the same plot as Q10, but add both shape and colour to represent the number of gears. This is called “redundant encoding”. Apply scale_colour_brewer(palette = "Set1") for the colour.

    ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear),
                       colour = factor(gear))) +
      geom_point(size = 3) +
      scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
      scale_colour_brewer(palette = "Set1") +
      labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
           shape = "Gears", colour = "Gears")

5.2 Size

  1. Using moneydemand, create a scatterplot of Rl vs Rs with point size mapped to logM. Use scale_size_continuous(range = c(1, 8)) and set alpha = 0.6 for transparency.

    ggplot(moneydemand, aes(x = Rs, y = Rl, size = logM)) +
      geom_point(alpha = 0.6) +
      scale_size_continuous(range = c(1, 8)) +
      labs(x = "Short-term Interest Rate (%)",
           y = "Long-term Interest Rate (%)",
           size = "Log Money\nStock")

  2. Create a scatterplot of mpg vs wt from mtcars that uses three aesthetics: colour for cyl, shape for gear, and size for hp. Comment on whether this plot is easy to interpret.

    ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl),
                       shape = factor(gear), size = hp)) +
      geom_point(alpha = 0.7) +
      scale_size_continuous(range = c(2, 8)) +
      labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
           colour = "Cylinders", shape = "Gears", size = "Horsepower")

    This plot is getting close to the limit of perception and interpretation because it encodes quite a few variables at once. While it’s technically possible, the cognitive load is high. Generally, limit plots to 2–3 encoded variables for clarity.

5.3 Line type

In the notes (Chapter 5), we used economics and economics_long — both built into ggplot2 — as examples of the same data in wide and long format respectively. Here, we create a similar long-format dataset manually from moneydemand.

To compare multiple variables on the same plot using different line types, we sometimes need to reshape the data from “wide” format (one column per variable) to “long” format (one column for the variable name, one for the value). While reshaping data is out of scope for this module, the tidyr package provides useful functions like pivot_longer() for this purpose.

For now, we will create the reshaped data manually:

# Create long-format data manually
moneydemand_long <- data.frame(
  year = rep(moneydemand$year, 2),
  rate_type = rep(c("Rs", "Rl"), each = nrow(moneydemand)),
  rate = c(moneydemand$Rs, moneydemand$Rl)
)
head(moneydemand_long)
##   year rate_type  rate
## 1 1879        Rs 5.067
## 2 1880        Rs 5.230
## 3 1881        Rs 5.200
## 4 1882        Rs 5.640
## 5 1883        Rs 5.620
## 6 1884        Rs 5.200
  1. Using moneydemand_long, create a plot with two lines: one for Rs and one for Rl, both plotted against year. Map linetype to rate_type.

    ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type)) +
      geom_line() +
      labs(x = "Year", y = "Interest Rate (%)", linetype = "Rate Type")

  2. Enhance the plot from Q14 by also mapping colour to rate_type, creating redundant encoding. Use scale_colour_viridis_d().

    ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type,
                                  colour = rate_type)) +
      geom_line(linewidth = 1) +
      scale_colour_viridis_d() +
      labs(x = "Year", y = "Interest Rate (%)",
           linetype = "Rate Type", colour = "Rate Type")

6 Summary

Reading data:

Computed variables:

Colour aesthetics:

Other aesthetics: