1 Introduction

This practical focuses on reading external data and using aesthetics in ggplot2. You will learn how to:

Read CSV files with readr and Excel files with readxl
Understand computed variables within aes()
Apply colour scales for discrete and continuous variables (Chapter 4 in notes)
Use shape, line type, and size aesthetics (Chapter 5 in notes)

The following datasets are used in this practical:

mtcars: Motor Trend car road tests data (1974), available in base R
moneydemand: US money demand data (1879–1974), external — available on Canvas
census_ni: Northern Ireland Census 2021 population data, external — available on Canvas
moneydemand_long: Long-format version of moneydemand, derived in this practical from moneydemand

2 Part A: Reading external data

Before we can visualise data, we often need to read it from external files. Two key packages help with this:

readr: For reading CSV and other flat files (plain text with values separated by commas, tabs, or other delimiters)
readxl: For reading Excel spreadsheets (.xls and .xlsx files), which are not flat files as they can contain multiple sheets, formatting, and formulas

As usual, load these 2 packages together with the usual suspects - ggplot2 & dplyr:

library(readr)
library(readxl)
library(ggplot2)
library(dplyr)

As a one-off, if there’s an error saying that there are no such packages, install them using install.packages("readr") and so on.

2.1 Reading CSV files with `readr`

Download moneydemand.csv from Canvas to your working directory, and run the following, where read_csv() reads comma-separated value files:

moneydemand <- read_csv("moneydemand.csv")
str(moneydemand) # to have a look at the data

This dataset contains US money demand data from 1879 to 1974, with variables:

year: Year of observation
logM: Log of real money stock
logYp: Log of permanent income
Rs: Short-term interest rate
Rl: Long-term interest rate
Rm: Interest rate on money
logSpp: Log of stock prices

2.1.1 Common `read_csv()` arguments

Argument	Purpose
`file`	Path to the file
`col_names`	Use first row as names (`TRUE`) or provide names
`col_types`	Specify column types explicitly
`skip`	Number of lines to skip before reading
`na`	Character vector of strings to treat as NA

2.2 Reading Excel files with `readxl`

The read_excel() function reads both .xls and .xlsx files. Unlike CSV files, Excel spreadsheets can contain multiple sheets and often have header rows that need to be skipped.

As an example, we will read population data from the Northern Ireland Census 2021, available from NISRA (https://www.nisra.gov.uk/publications/census-2021-person-and-household-estimates-data-zones-northern-ireland). We will use this dataset later in this module for geospatial visualisation.

Download the file census-2021-ms-a14.xlsx, displayed as “MS-A14 Population density” from the URL above (also available on Canvas), to your working directory, and run the following:

census_ni <- read_excel("census-2021-ms-a14.xlsx",
                        sheet = "SDZ",
                        skip = 5)
str(census_ni) # notice the clunky column names

The column names from this file are quite clunky (e.g., All usual residents, Population density (number of usual residents per hectare)). We can use dplyr::rename() to give them cleaner names:

census_ni <- census_ni |>
  rename(
    name = `Geography`,
    code = `Geography code`,
    residents = `All usual residents`,
    area = `Area (hectares)\r\n[note 1, 2]`,
    density = `Population density (number of usual residents per hectare)`,
    profile = `Access census area explorer`
  )
str(census_ni) # much cleaner

2.2.1 Common `read_excel()` arguments

Argument	Purpose
`path`	Path to the file
`sheet`	Sheet name or number to read
`range`	Cell range to read (e.g., “A1:D10”)
`col_names`	Use first row as names or provide names
`skip`	Number of rows to skip before reading

3 Part B: Computed variables in aesthetics

Before we explore colours and other aesthetics, let’s consider an important feature of ggplot2: you can compute variables directly within aes().

3.1 A mathematical relationship

Using the census_ni data, think about what would happen if you plot residents (on the \(y\)-axis) against the product of area and density (on the \(x\)-axis). What shape would you expect the plot to have? Why?

Answer: The plot would be a straight line with slope 1. This is because density is defined as residents/area, so area \(\times\) density = area \(\times\) (residents/area) = residents. Therefore, we are plotting residents against residents, which gives a straight line \(y = x\).
Create this plot using aes(x = area * density, y = residents). Verify that the relationship is indeed a straight line.
```
ggplot(census_ni, aes(x = area * density, y = residents)) +
  geom_point() +
  labs(x = "Area x Density", y = "Usual Residents")
```
Note that we computed area * density directly within aes(). This is a powerful feature that allows you to transform or combine variables without creating new columns in your data frame.

4 Part C: Colour aesthetics

Colour is an aesthetic (see Chapter 4 in notes) that maps variables to visual colours. The scale functions control how this mapping works.

4.1 Discrete colour scales

Using the mtcars dataset, create a scatterplot of mpg (\(y\)-axis) vs wt (\(x\)-axis), with points coloured by cyl (number of cylinders). Remember to convert cyl to a factor using factor(cyl).
```
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
  geom_point(size = 3) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")
```

Using your plot from Q3, try three different colour scales:

scale_colour_brewer() with palette “Set1”
scale_colour_viridis_d()
scale_colour_grey()

p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
  geom_point(size = 3) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")

p + scale_colour_brewer(palette = "Set1")

p + scale_colour_viridis_d()

p + scale_colour_grey()

Create a bar chart of cyl from mtcars, with bars filled by cyl. Use scale_fill_viridis_d(option = "E") for a colour-blind-friendly palette.
```
ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +
  geom_bar() +
  scale_fill_viridis_d(option = "E") +
  labs(x = "Cylinders", y = "Count", fill = "Cylinders")
```
Is this plot truly informative? Explain why or why not, and suggest a fix.

Answer: This plot is not truly informative because the fill colour is redundant with the \(x\)-axis — both encode the same variable (cyl). The colour adds no new information. A better approach would be to fill by a different variable, such as am (transmission type), which would show how many cars of each cylinder count have automatic vs manual transmission:
```
ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
  geom_bar(position = "dodge") +
  scale_fill_viridis_d(option = "E") +
  labs(x = "Cylinders", y = "Count", fill = "Transmission")
```

4.2 Continuous colour scales

Using moneydemand, create a scatterplot of Rl (long-term rate) vs Rs (short-term rate), with colour mapped to year. Apply scale_colour_viridis_c().

ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
  geom_point(size = 2) +
  scale_colour_viridis_c() +
  labs(x = "Short-term Interest Rate (%)",
       y = "Long-term Interest Rate (%)",
       colour = "Year")

Create a line plot of logM (log money stock) over year, with colour mapped to logYp (log permanent income). Try both:

scale_colour_gradient(low = "yellow", high = "red")
scale_colour_distiller(palette = "Spectral", direction = 1)

p <- ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(x = "Year", y = "Log Money Stock",
       colour = "Log Permanent\nIncome")

p + scale_colour_gradient(low = "yellow", high = "red")

p + scale_colour_distiller(palette = "Spectral", direction = 1)

4.3 Binned colour scales

Create the same plot as Q7 (logM over year, coloured by logYp), but use scale_colour_viridis_b() instead of a continuous scale. Compare the two versions — which do you find more informative for this data?
```
ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
  geom_line(linewidth = 1) +
  geom_point() +
  scale_colour_viridis_b() +
  labs(x = "Year", y = "Log Money Stock",
       colour = "Log Permanent\nIncome")
```
The binned scale may be more useful here as it clearly shows distinct periods (e.g., values below 6.5, 6.5–7.0, above 7.0). The continuous scale shows subtle gradations but the discrete boundaries can be easier to interpret.
What error do you get if you try to use scale_colour_viridis_d() with a continuous variable like year? Why does this happen?
```
ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
  geom_point() +
  scale_colour_viridis_d()
```
Answer: The error is ‘Discrete value supplied to continuous scale’ (or similar). This occurs because year is a continuous numeric variable, but scale_colour_viridis_d() is designed for discrete (categorical) variables. Use scale_colour_viridis_c() for continuous data or scale_colour_viridis_b() for binned continuous data.

5 Part D: Other aesthetics

These aesthetics are covered in Chapter 5 in notes.

5.1 Shape

Using mtcars, create a scatterplot of mpg vs wt with points shaped by gear (number of gears). Use scale_shape_manual() to set shapes 16 (filled circle), 17 (filled triangle), and 15 (filled square) for the three gear values.

ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear))) +
  geom_point(size = 3) +
  scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", shape = "Gears")

Create the same plot as Q10, but add both shape and colour to represent the number of gears. This is called “redundant encoding”. Apply scale_colour_brewer(palette = "Set1") for the colour.

ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear),
                   colour = factor(gear))) +
  geom_point(size = 3) +
  scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
  scale_colour_brewer(palette = "Set1") +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
       shape = "Gears", colour = "Gears")

5.2 Size

Using moneydemand, create a scatterplot of Rl vs Rs with point size mapped to logM. Use scale_size_continuous(range = c(1, 8)) and set alpha = 0.6 for transparency.

ggplot(moneydemand, aes(x = Rs, y = Rl, size = logM)) +
  geom_point(alpha = 0.6) +
  scale_size_continuous(range = c(1, 8)) +
  labs(x = "Short-term Interest Rate (%)",
       y = "Long-term Interest Rate (%)",
       size = "Log Money\nStock")

Create a scatterplot of mpg vs wt from mtcars that uses three aesthetics: colour for cyl, shape for gear, and size for hp. Comment on whether this plot is easy to interpret.
```
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl),
                   shape = factor(gear), size = hp)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(2, 8)) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
       colour = "Cylinders", shape = "Gears", size = "Horsepower")
```
This plot is getting close to the limit of perception and interpretation because it encodes quite a few variables at once. While it’s technically possible, the cognitive load is high. Generally, limit plots to 2–3 encoded variables for clarity.

5.3 Line type

In the notes (Chapter 5), we used economics and economics_long — both built into ggplot2 — as examples of the same data in wide and long format respectively. Here, we create a similar long-format dataset manually from moneydemand.

To compare multiple variables on the same plot using different line types, we sometimes need to reshape the data from “wide” format (one column per variable) to “long” format (one column for the variable name, one for the value). While reshaping data is out of scope for this module, the tidyr package provides useful functions like pivot_longer() for this purpose.

For now, we will create the reshaped data manually:

# Create long-format data manually
moneydemand_long <- data.frame(
  year = rep(moneydemand$year, 2),
  rate_type = rep(c("Rs", "Rl"), each = nrow(moneydemand)),
  rate = c(moneydemand$Rs, moneydemand$Rl)
)
head(moneydemand_long)

##   year rate_type  rate
## 1 1879        Rs 5.067
## 2 1880        Rs 5.230
## 3 1881        Rs 5.200
## 4 1882        Rs 5.640
## 5 1883        Rs 5.620
## 6 1884        Rs 5.200

Using moneydemand_long, create a plot with two lines: one for Rs and one for Rl, both plotted against year. Map linetype to rate_type.

ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type)) +
  geom_line() +
  labs(x = "Year", y = "Interest Rate (%)", linetype = "Rate Type")

Enhance the plot from Q14 by also mapping colour to rate_type, creating redundant encoding. Use scale_colour_viridis_d().

ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type,
                              colour = rate_type)) +
  geom_line(linewidth = 1) +
  scale_colour_viridis_d() +
  labs(x = "Year", y = "Interest Rate (%)",
       linetype = "Rate Type", colour = "Rate Type")

6 Summary

Reading data:

read_csv() from readr: for CSV and other flat files
read_excel() from readxl: for Excel spreadsheets (not flat)
Common arguments: skip, col_names, col_types, sheet, range
Use dplyr::rename() to fix clunky column names

Computed variables:

You can compute expressions directly within aes(), e.g., aes(x = a * b)
factor(cyl) used throughout this practical is another example: it converts the numeric cyl to a categorical factor, directly inside aes()
Understanding mathematical relationships helps predict plot shapes

Colour aesthetics:

Use colour for points/lines/outlines; use fill for bar/box interiors
Match scale type to variable type:
- Discrete (categorical): scale_*_viridis_d(), scale_*_brewer()
- Continuous: scale_*_viridis_c(), scale_*_distiller(), scale_*_gradient()
- Binned continuous: scale_*_viridis_b(), scale_*_fermenter()
Using the wrong scale type causes errors

Other aesthetics:

Shape: scale_shape_manual() to specify exact shapes (only for categorical)
Line type: for distinguishing lines, especially in black-and-white
Size: scale_size_continuous(range = c(min, max)) to control size range
Redundant encoding (colour + shape) helps for accessibility but uses up aesthetics
Avoid encoding too many variables at once — 2–3 is usually the limit

MAS2908 - Practical 04 (Solutions)

Clement Lee

Semester 2, 2025/2026

1 Introduction

2 Part A: Reading external data

2.1 Reading CSV files with `readr`

2.1.1 Common `read_csv()` arguments

2.2 Reading Excel files with `readxl`

2.2.1 Common `read_excel()` arguments

3 Part B: Computed variables in aesthetics

3.1 A mathematical relationship

4 Part C: Colour aesthetics

4.1 Discrete colour scales

4.2 Continuous colour scales

4.3 Binned colour scales

5 Part D: Other aesthetics

5.1 Shape

5.2 Size

5.3 Line type

6 Summary

MAS2908 - Practical 04 (Solutions)

Clement Lee

Semester 2, 2025/2026

1 Introduction

2 Part A: Reading external data

2.1 Reading CSV files with readr

2.1.1 Common read_csv() arguments

2.2 Reading Excel files with readxl

2.2.1 Common read_excel() arguments

3 Part B: Computed variables in aesthetics

3.1 A mathematical relationship

4 Part C: Colour aesthetics

4.1 Discrete colour scales

4.2 Continuous colour scales

4.3 Binned colour scales

5 Part D: Other aesthetics

5.1 Shape

5.2 Size

5.3 Line type

6 Summary

2.1 Reading CSV files with `readr`

2.1.1 Common `read_csv()` arguments

2.2 Reading Excel files with `readxl`

2.2.1 Common `read_excel()` arguments