This practical reinforces Chapter 7 of the notes, exploring two complementary
approaches to the same visualisation: manipulating data explicitly with dplyr,
or using built-in functions in ggplot2. You will practice:
dplyr::count() vs letting geom_bar() count automaticallyfilter(), scale_*_continuous(), and coord_cartesian()stat_summary() vs a separate data frameafter_stat(density)Throughout this practical we use the built-in mpg and mtcars datasets.
# geom_bar() automatically counts from raw data
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar() +
labs(x = "Cylinders", y = "Count")
Using mtcars, create a bar chart of gear (number of gears) using
geom_bar(). Add appropriate axis labels.
ggplot(mtcars, aes(x = factor(gear))) +
geom_bar() +
labs(x = "Number of Gears", y = "Count")
Use count() to compute the number of cars with each value of gear from
mtcars. Store this as gear_counts, then recreate the same bar chart
using geom_col(). Verify the two charts look identical.
gear_counts <- mtcars |>
count(gear) |>
rename(gears = gear, count = n)
ggplot(gear_counts, aes(x = factor(gears), y = count)) +
geom_col() +
labs(x = "Number of Gears", y = "Count")
What happens if you supply gear_counts to geom_bar() without
stat = "identity"? Demonstrate this and explain why the bars all have
height 1.
ggplot(gear_counts, aes(x = factor(gears))) +
geom_bar() +
labs(x = "Number of Gears", y = "Count (wrong!)")
geom_bar() counts the rows of the data frame passed to it. gear_counts has one row per gear value (3 rows), so each bar has height 1. To use pre-computed counts, switch to geom_col(), or add stat = 'identity' to geom_bar().
Using mpg, create a horizontal bar chart showing the count of cars for
each manufacturer. Which manufacturer has the most cars in the dataset?
Hint: Map manufacturer to the y aesthetic (not x) for a
horizontal chart.
ggplot(mpg, aes(y = manufacturer)) +
geom_bar() +
labs(x = "Count", y = "Manufacturer")
# Dodge, Toyota, and Volkswagen have the most carsThe following code creates a pre-counted summary. Use geom_col() to
create a bar chart, ordered from most to fewest cars per class, displayed
horizontally.
Hint: Use reorder(class, n) inside aes() to order the bars.
class_summary <- mpg |>
count(class) |>
arrange(desc(n))
class_summary
## # A tibble: 7 × 2
## class n
## <chr> <int>
## 1 suv 62
## 2 compact 47
## 3 midsize 41
## 4 subcompact 35
## 5 pickup 33
## 6 minivan 11
## 7 2seater 5
ggplot(class_summary, aes(x = reorder(class, n), y = n)) +
geom_col() +
coord_flip() +
labs(x = "Class", y = "Count")
All three approaches below produce visually similar plots of displ vs hwy
restricted to displacement values between 2 and 6.
Create the three versions of the plot using: (a) filter(), (b)
scale_x_continuous(limits = c(2, 6)), (c) coord_cartesian(xlim = c(2, 6)).
mpg |>
filter(displ >= 2, displ <= 6) |>
ggplot(aes(x = displ, y = hwy)) +
geom_point() +
labs(x = "Displacement (L)", y = "Highway MPG",
title = "(a) filter()")
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_continuous(limits = c(2, 6)) +
labs(x = "Displacement (L)", y = "Highway MPG",
title = "(b) scale_x_continuous()")
## Warning: Removed 27 rows containing missing values or values outside
## the scale range (`geom_point()`).
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
coord_cartesian(xlim = c(2, 6)) +
labs(x = "Displacement (L)", y = "Highway MPG",
title = "(c) coord_cartesian()")
Add geom_smooth() to each of the three plots from Q6. Do the smooth lines
differ between (a) and (c)? Why?
mpg |>
filter(displ >= 2, displ <= 6) |>
ggplot(aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth() +
labs(x = "Displacement (L)", y = "Highway MPG",
title = "(a) filter() + smooth")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth() +
coord_cartesian(xlim = c(2, 6)) +
labs(x = "Displacement (L)", y = "Highway MPG",
title = "(c) coord_cartesian() + smooth")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The smooth lines can differ. With filter(), geom_smooth() is fitted using only the subset (displacement between 2 and 6). With coord_cartesian(), the smooth is fitted using all observations and then zoomed into. The smooth from coord_cartesian() is informed by the full data range, which can change the shape of the curve near the edges.
In the notes (Section on stat_summary()), we saw that the same plot can be
produced by either computing group means with dplyr first, or letting
stat_summary() compute them during plotting.
Using stat_summary(), create a bar chart showing the mean highway MPG
(hwy) for each vehicle class in mpg. Do not pre-compute the means
with dplyr.
ggplot(mpg, aes(x = class, y = hwy)) +
stat_summary(fun = mean, geom = "bar") +
labs(x = "Class", y = "Mean Highway MPG")
Using mpg, create a jitter plot of displ (\(y\)-axis) by drv
(\(x\)-axis). Overlay the group means as large red points using
stat_summary(fun = mean, geom = "point").
ggplot(mpg, aes(x = drv, y = displ)) +
geom_jitter(alpha = 0.3, width = 0.2) +
stat_summary(fun = mean, geom = "point",
colour = "red", size = 4) +
labs(x = "Drive Type", y = "Engine Displacement (L)")
Achieve the same result as Q9 using group_by() and summarise() to
compute the group means first, then adding a second geom_point() layer
with the summary data frame. Do the red points appear in the same
positions?
drv_means <- mpg |>
group_by(drv) |>
summarise(mean_displ = mean(displ))
ggplot() +
geom_jitter(data = mpg,
aes(x = drv, y = displ),
alpha = 0.3, width = 0.2) +
geom_point(data = drv_means,
aes(x = drv, y = mean_displ),
colour = "red", size = 4) +
labs(x = "Drive Type", y = "Engine Displacement (L)")
Yes — the red points appear in exactly the same positions. Both approaches compute the same group means; they differ only in when: stat_summary() does it during plotting, while group_by() + summarise() does it beforehand.
Which approach would you prefer if you also needed the group means in a separate summary table? Which would you prefer for a quick exploratory plot?
If you need the group means elsewhere (e.g., in a table or for further analysis), the dplyr approach is better: you compute the summary once and can reuse it. For a quick exploratory plot where the summary is only needed for the visualisation, stat_summary() is cleaner and more concise.
Create a histogram of displ from mpg using geom_histogram(binwidth = 0.5).
Overlay geom_density() without any modification. Describe what you observe.
ggplot(mpg, aes(x = displ)) +
geom_histogram(binwidth = 0.5) +
geom_density() +
labs(x = "Engine Displacement (L)", y = "Count")
The density curve appears as a nearly flat line near zero. The histogram \(y\)-axis is in count units (up to around 40), while density values are much smaller (below 0.5). The two geoms are on completely different scales.
Fix the overlay from Q12 using after_stat(density). Does the density
curve now align with the histogram?
ggplot(mpg, aes(x = displ)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 0.5) +
geom_density() +
labs(x = "Engine Displacement (L)", y = "Density")
Yes — with after_stat(density), the histogram bars are rescaled to density units and the density curve now overlays them on the same scale.
Looking at the histogram + density from Q13, does displ appear
approximately normally distributed? What features suggest it might not be?
displ does not appear normally distributed. The distribution is right-skewed with a long tail towards larger values, and possibly bimodal (two peaks around 2 and 3.5 litres). A symmetric bell shape would be expected for normality.
Create a Q-Q plot of hwy (highway miles per gallon) from mpg. Does
highway MPG appear normally distributed? Describe any departures from the
reference line.
ggplot(mpg, aes(sample = hwy)) +
geom_qq() +
geom_qq_line(linewidth = 1) +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
title = "Q-Q plot of highway MPG")
The points deviate from the reference line, particularly in the upper tail (curving upward), suggesting heavier right tails than a normal distribution. The distribution of hwy is right-skewed.
Simulate 200 values from the standard normal distribution using
set.seed(42); rnorm(200). First, plot a histogram with density overlay
(as in Q13). Then, create a Q-Q plot. Which diagnostic is more informative
about the tails?
set.seed(42)
sim_data <- data.frame(x = rnorm(200))
ggplot(sim_data, aes(x = x)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 0.4) +
geom_density(linewidth = 1) +
labs(x = "x", y = "Density", title = "Histogram + density")
ggplot(sim_data, aes(sample = x)) +
geom_qq() +
geom_qq_line(linewidth = 1) +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
title = "Q-Q plot")
Both look reasonable for normally distributed data. However, the Q-Q plot is more informative about the tails: even for genuinely normal data, the tail bars in a histogram are short and hard to evaluate. The Q-Q plot makes tail behaviour explicit — points hugging the reference line throughout (including the extremes) confirm that the tails conform to normality.
| Visualisation | Approach A: dplyr first | Approach B: ggplot2 built-in |
|---|---|---|
| Bar chart of counts | count() \(\to\) geom_col() |
geom_bar() |
| Subset a scatterplot | filter() |
scale_*_continuous(limits=) or coord_cartesian() |
| Raw data \(+\) group means | group_by() \(+\) summarise() |
stat_summary(fun = mean, ...) |
| Histogram \(+\) density | (not straightforward) | geom_histogram(aes(y = after_stat(density))) \(+\) geom_density() |
| Q-Q plot | (not available) | geom_qq() \(+\) geom_qq_line() |
By completing Practicals 1–6, you should be able to create polished and informative visualisations and understand the data manipulation steps behind them. Use this checklist to review your understanding.
geom_point() for two continuous variablesgeom_line() for trends over timegeom_bar() for counts, geom_col() for valuesgeom_histogram() for distributionsgeom_boxplot() for comparing distributionscolour (outlines/points) and fill
(interiors)aes(colour = ...) or aes(fill = ...)scale_*_brewer(), scale_*_viridis_d()scale_*_gradient(), scale_*_viridis_c()scale_*_fermenter(), scale_*_viridis_b()scale_x_log10(), scale_y_sqrt(), etc.scale_*_continuous(limits = ...)readr and readxl) (Practical 4)readr::read_csv()readxl::read_excel()skip, col_names, sheet, range| Function | Package | File type |
|---|---|---|
read_csv() |
readr |
Comma-separated values (.csv) |
read_excel() |
readxl |
Excel workbook (.xlsx / .xls) |
library(readr)
data <- read_csv("myfile.csv")
labs(colour = "Nice Title")scales::label_comma()scale_*_continuous(breaks = ...)theme(legend.position = ...)theme_bw(), theme_minimal(), theme_classic()p <- ggplot(...) + ...p + theme_bw()last_plot() for interactive exploration (console only, not in R Markdown)dplyr (Chapter 7, Practical 6)dplyr functions: filter(), mutate(), count(),
arrange(), group_by(), summarise()geom_bar() (counts automatically) vs geom_col()
(uses values as-is)filter(), scale_*_continuous(limits=), and
coord_cartesian() for subsetting datastat_summary(fun = ...) or a
separate data frameafter_stat(density)geom_qq() and geom_qq_line()| Function | What it does | Closest Excel operation |
|---|---|---|
filter() |
Keep rows matching a condition | AutoFilter |
mutate() |
Create or modify columns | Adding a formula to a new column |
count() |
Count occurrences of each value | COUNTIF, or a pivot table |
arrange() |
Sort rows | Data \(\to\) Sort |
group_by() |
Group data for subsequent operations | Pivot table grouping |
summarise() |
Compute summary statistics per group | Pivot table values |
Important: Always load dplyr before using these functions. If you don’t,
R may silently use a different filter() from the stats package, which
behaves completely differently and causes confusing errors.
library(dplyr) # Load first!
data |>
filter(year >= 2000) |>
mutate(rate = count / total * 100) |>
ggplot(aes(x = year, y = rate)) +
geom_line()
Data visualisation is one step in a larger workflow. A polished, informative plot requires:
Choosing the right geom(s) for the data and message, and working through the pipeline:
readr, readxl)dplyr, tidyr)ggplot2For each visualisation, also ensure:
factor(x))theme_minimal())Mastering data visualisation means mastering the entire pipeline, not just the plotting step.