This practical focuses on reading external data and using aesthetics in
ggplot2. You will learn how to:
readr and Excel files with readxlaes()The following datasets are used in this practical:
mtcars: Motor Trend car road tests data (1974), available in base Rmoneydemand: US money demand data (1879–1974), external — available on Canvascensus_ni: Northern Ireland Census 2021 population data, external —
available on Canvasmoneydemand_long: Long-format version of moneydemand, derived in this
practical from moneydemandBefore we can visualise data, we often need to read it from external files. Two key packages help with this:
readr: For reading CSV and other flat files (plain text with values
separated by commas, tabs, or other delimiters)readxl: For reading Excel spreadsheets (.xls and .xlsx files),
which are not flat files as they can contain multiple sheets, formatting,
and formulasAs usual, load these 2 packages together with the usual suspects - ggplot2 & dplyr:
library(readr)
library(readxl)
library(ggplot2)
library(dplyr)
As a one-off, if there’s an error saying that there are no such packages, install them using install.packages("readr") and so on.
readrDownload moneydemand.csv from Canvas to your working directory, and run the
following, where read_csv() reads comma-separated value files:
moneydemand <- read_csv("moneydemand.csv")
str(moneydemand) # to have a look at the data
This dataset contains US money demand data from 1879 to 1974, with variables:
year: Year of observationlogM: Log of real money stocklogYp: Log of permanent incomeRs: Short-term interest rateRl: Long-term interest rateRm: Interest rate on moneylogSpp: Log of stock pricesread_csv() arguments| Argument | Purpose |
|---|---|
file |
Path to the file |
col_names |
Use first row as names (TRUE) or provide names |
col_types |
Specify column types explicitly |
skip |
Number of lines to skip before reading |
na |
Character vector of strings to treat as NA |
readxlThe read_excel() function reads both .xls and .xlsx files. Unlike CSV
files, Excel spreadsheets can contain multiple sheets and often have header
rows that need to be skipped.
As an example, we will read population data from the Northern Ireland Census 2021, available from NISRA (https://www.nisra.gov.uk/publications/census-2021-person-and-household-estimates-data-zones-northern-ireland). We will use this dataset later in this module for geospatial visualisation.
Download the file census-2021-ms-a14.xlsx, displayed as “MS-A14 Population density” from the URL above (also available on Canvas), to your
working directory, and run the following:
census_ni <- read_excel("census-2021-ms-a14.xlsx",
sheet = "SDZ",
skip = 5)
str(census_ni) # notice the clunky column names
The column names from this file are quite clunky (e.g., All usual residents,
Population density (number of usual residents per hectare)). We can use dplyr::rename() to give them cleaner names:
census_ni <- census_ni |>
rename(
name = `Geography`,
code = `Geography code`,
residents = `All usual residents`,
area = `Area (hectares)\r\n[note 1, 2]`,
density = `Population density (number of usual residents per hectare)`,
profile = `Access census area explorer`
)
str(census_ni) # much cleaner
read_excel() arguments| Argument | Purpose |
|---|---|
path |
Path to the file |
sheet |
Sheet name or number to read |
range |
Cell range to read (e.g., “A1:D10”) |
col_names |
Use first row as names or provide names |
skip |
Number of rows to skip before reading |
Before we explore colours and other aesthetics, let’s consider an important
feature of ggplot2: you can compute variables directly within aes().
Using the census_ni data, think about what would happen if you plot
residents (on the \(y\)-axis) against the product of area and density
(on the \(x\)-axis). What shape would you expect the plot to have? Why?
Answer: The plot would be a straight line with slope 1. This is because density is defined as residents/area, so area \(\times\) density = area \(\times\) (residents/area) = residents. Therefore, we are plotting residents against residents, which gives a straight line \(y = x\).
Create this plot using aes(x = area * density, y = residents). Verify
that the relationship is indeed a straight line.
ggplot(census_ni, aes(x = area * density, y = residents)) +
geom_point() +
labs(x = "Area x Density", y = "Usual Residents")
Note that we computed area * density directly within aes(). This is a
powerful feature that allows you to transform or combine variables without
creating new columns in your data frame.
Colour is an aesthetic (see Chapter 4 in notes) that maps variables to visual colours. The scale functions control how this mapping works.
Using the mtcars dataset, create a scatterplot of mpg (\(y\)-axis) vs
wt (\(x\)-axis), with points coloured by cyl (number of cylinders).
Remember to convert cyl to a factor using factor(cyl).
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")
Using your plot from Q3, try three different colour scales:
scale_colour_brewer() with palette “Set1”scale_colour_viridis_d()scale_colour_grey()p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", colour = "Cylinders")
p + scale_colour_brewer(palette = "Set1")
p + scale_colour_viridis_d()
p + scale_colour_grey()
Create a bar chart of cyl from mtcars, with bars filled by cyl.
Use scale_fill_viridis_d(option = "E") for a colour-blind-friendly palette.
ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +
geom_bar() +
scale_fill_viridis_d(option = "E") +
labs(x = "Cylinders", y = "Count", fill = "Cylinders")
Is this plot truly informative? Explain why or why not, and suggest a fix.
Answer: This plot is not truly informative because the fill colour is redundant with the \(x\)-axis — both encode the same variable (cyl). The colour adds no new information. A better approach would be to fill by a different variable, such as am (transmission type), which would show how many cars of each cylinder count have automatic vs manual transmission:
ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
geom_bar(position = "dodge") +
scale_fill_viridis_d(option = "E") +
labs(x = "Cylinders", y = "Count", fill = "Transmission")
Using moneydemand, create a scatterplot of Rl (long-term rate) vs Rs
(short-term rate), with colour mapped to year. Apply
scale_colour_viridis_c().
ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
geom_point(size = 2) +
scale_colour_viridis_c() +
labs(x = "Short-term Interest Rate (%)",
y = "Long-term Interest Rate (%)",
colour = "Year")
Create a line plot of logM (log money stock) over year, with colour
mapped to logYp (log permanent income). Try both:
scale_colour_gradient(low = "yellow", high = "red")scale_colour_distiller(palette = "Spectral", direction = 1)p <- ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
geom_line(linewidth = 1) +
geom_point() +
labs(x = "Year", y = "Log Money Stock",
colour = "Log Permanent\nIncome")
p + scale_colour_gradient(low = "yellow", high = "red")
p + scale_colour_distiller(palette = "Spectral", direction = 1)
Create the same plot as Q7 (logM over year, coloured by logYp), but use
scale_colour_viridis_b() instead of a continuous scale. Compare the two
versions — which do you find more informative for this data?
ggplot(moneydemand, aes(x = year, y = logM, colour = logYp)) +
geom_line(linewidth = 1) +
geom_point() +
scale_colour_viridis_b() +
labs(x = "Year", y = "Log Money Stock",
colour = "Log Permanent\nIncome")
The binned scale may be more useful here as it clearly shows distinct periods (e.g., values below 6.5, 6.5–7.0, above 7.0). The continuous scale shows subtle gradations but the discrete boundaries can be easier to interpret.
What error do you get if you try to use scale_colour_viridis_d() with
a continuous variable like year? Why does this happen?
ggplot(moneydemand, aes(x = Rs, y = Rl, colour = year)) +
geom_point() +
scale_colour_viridis_d()
Answer: The error is ‘Discrete value supplied to continuous scale’ (or similar). This occurs because year is a continuous numeric variable, but scale_colour_viridis_d() is designed for discrete (categorical) variables. Use scale_colour_viridis_c() for continuous data or scale_colour_viridis_b() for binned continuous data.
These aesthetics are covered in Chapter 5 in notes.
Using mtcars, create a scatterplot of mpg vs wt with points shaped
by gear (number of gears). Use scale_shape_manual() to set shapes 16
(filled circle), 17 (filled triangle), and 15 (filled square) for the
three gear values.
ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear))) +
geom_point(size = 3) +
scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", shape = "Gears")
Create the same plot as Q10, but add both shape and colour to represent
the number of gears. This is called “redundant encoding”. Apply
scale_colour_brewer(palette = "Set1") for the colour.
ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(gear),
colour = factor(gear))) +
geom_point(size = 3) +
scale_shape_manual(values = c("3" = 16, "4" = 17, "5" = 15)) +
scale_colour_brewer(palette = "Set1") +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
shape = "Gears", colour = "Gears")
Using moneydemand, create a scatterplot of Rl vs Rs with point
size mapped to logM. Use scale_size_continuous(range = c(1, 8))
and set alpha = 0.6 for transparency.
ggplot(moneydemand, aes(x = Rs, y = Rl, size = logM)) +
geom_point(alpha = 0.6) +
scale_size_continuous(range = c(1, 8)) +
labs(x = "Short-term Interest Rate (%)",
y = "Long-term Interest Rate (%)",
size = "Log Money\nStock")
Create a scatterplot of mpg vs wt from mtcars that uses three
aesthetics: colour for cyl, shape for gear, and size for hp. Comment
on whether this plot is easy to interpret.
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl),
shape = factor(gear), size = hp)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(2, 8)) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
colour = "Cylinders", shape = "Gears", size = "Horsepower")
This plot is getting close to the limit of perception and interpretation because it encodes quite a few variables at once. While it’s technically possible, the cognitive load is high. Generally, limit plots to 2–3 encoded variables for clarity.
In the notes (Chapter 5), we used economics and economics_long — both
built into ggplot2 — as examples of the same data in wide and long format
respectively. Here, we create a similar long-format dataset manually from
moneydemand.
To compare multiple variables on the same plot using different line types, we
sometimes need to reshape the data from “wide” format (one column per variable)
to “long” format (one column for the variable name, one for the value). While
reshaping data is out of scope for this module, the tidyr package provides
useful functions like pivot_longer() for this purpose.
For now, we will create the reshaped data manually:
# Create long-format data manually
moneydemand_long <- data.frame(
year = rep(moneydemand$year, 2),
rate_type = rep(c("Rs", "Rl"), each = nrow(moneydemand)),
rate = c(moneydemand$Rs, moneydemand$Rl)
)
head(moneydemand_long)
## year rate_type rate
## 1 1879 Rs 5.067
## 2 1880 Rs 5.230
## 3 1881 Rs 5.200
## 4 1882 Rs 5.640
## 5 1883 Rs 5.620
## 6 1884 Rs 5.200
Using moneydemand_long, create a plot with two lines: one for Rs and
one for Rl, both plotted against year. Map linetype to rate_type.
ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type)) +
geom_line() +
labs(x = "Year", y = "Interest Rate (%)", linetype = "Rate Type")
Enhance the plot from Q14 by also mapping colour to rate_type, creating
redundant encoding. Use scale_colour_viridis_d().
ggplot(moneydemand_long, aes(x = year, y = rate, linetype = rate_type,
colour = rate_type)) +
geom_line(linewidth = 1) +
scale_colour_viridis_d() +
labs(x = "Year", y = "Interest Rate (%)",
linetype = "Rate Type", colour = "Rate Type")
Reading data:
read_csv() from readr: for CSV and other flat filesread_excel() from readxl: for Excel spreadsheets (not flat)skip, col_names, col_types, sheet, rangedplyr::rename() to fix clunky column namesComputed variables:
aes(), e.g., aes(x = a * b)factor(cyl) used throughout this practical is another example: it converts
the numeric cyl to a categorical factor, directly inside aes()Colour aesthetics:
colour for points/lines/outlines; use fill for bar/box interiorsscale_*_viridis_d(), scale_*_brewer()scale_*_viridis_c(), scale_*_distiller(), scale_*_gradient()scale_*_viridis_b(), scale_*_fermenter()Other aesthetics:
scale_shape_manual() to specify exact shapes (only for categorical)scale_size_continuous(range = c(min, max)) to control size range