1 Introduction

This practical reinforces Chapter 7 of the notes, exploring two complementary approaches to the same visualisation: manipulating data explicitly with dplyr, or using built-in functions in ggplot2. You will practice:

Counting with dplyr::count() vs letting geom_bar() count automatically
Subsetting data with filter(), scale_*_continuous(), and coord_cartesian()
Overlaying group summaries using stat_summary() vs a separate data frame
Overlaying a density curve on a histogram using after_stat(density)
Diagnosing normality with Q-Q plots

Throughout this practical we use the built-in mpg and mtcars datasets.

2 Bar charts and counting

# geom_bar() automatically counts from raw data
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(x = "Cylinders", y = "Count")

Using mtcars, create a bar chart of gear (number of gears) using geom_bar(). Add appropriate axis labels.
```
ggplot(mtcars, aes(x = factor(gear))) +
  geom_bar() +
  labs(x = "Number of Gears", y = "Count")
```

Use count() to compute the number of cars with each value of gear from mtcars. Store this as gear_counts, then recreate the same bar chart using geom_col(). Verify the two charts look identical.

gear_counts <- mtcars |>
  count(gear) |>
  rename(gears = gear, count = n)

ggplot(gear_counts, aes(x = factor(gears), y = count)) +
  geom_col() +
  labs(x = "Number of Gears", y = "Count")

What happens if you supply gear_counts to geom_bar() without stat = "identity"? Demonstrate this and explain why the bars all have height 1.
```
ggplot(gear_counts, aes(x = factor(gears))) +
  geom_bar() +
  labs(x = "Number of Gears", y = "Count (wrong!)")
```
geom_bar() counts the rows of the data frame passed to it. gear_counts has one row per gear value (3 rows), so each bar has height 1. To use pre-computed counts, switch to geom_col(), or add stat = 'identity' to geom_bar().
Using mpg, create a horizontal bar chart showing the count of cars for each manufacturer. Which manufacturer has the most cars in the dataset?

Hint: Map manufacturer to the y aesthetic (not x) for a horizontal chart.
```
ggplot(mpg, aes(y = manufacturer)) +
  geom_bar() +
  labs(x = "Count", y = "Manufacturer")
```
```
# Dodge, Toyota, and Volkswagen have the most cars
```

The following code creates a pre-counted summary. Use geom_col() to create a bar chart, ordered from most to fewest cars per class, displayed horizontally.

Hint: Use reorder(class, n) inside aes() to order the bars.

class_summary <- mpg |>
  count(class) |>
  arrange(desc(n))
class_summary

## # A tibble: 7 × 2
##   class          n
##   <chr>      <int>
## 1 suv           62
## 2 compact       47
## 3 midsize       41
## 4 subcompact    35
## 5 pickup        33
## 6 minivan       11
## 7 2seater        5

ggplot(class_summary, aes(x = reorder(class, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Class", y = "Count")

3 Subsetting data

All three approaches below produce visually similar plots of displ vs hwy restricted to displacement values between 2 and 6.

Create the three versions of the plot using: (a) filter(), (b) scale_x_continuous(limits = c(2, 6)), (c) coord_cartesian(xlim = c(2, 6)).

mpg |>
  filter(displ >= 2, displ <= 6) |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  labs(x = "Displacement (L)", y = "Highway MPG",
       title = "(a) filter()")

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_continuous(limits = c(2, 6)) +
  labs(x = "Displacement (L)", y = "Highway MPG",
       title = "(b) scale_x_continuous()")

## Warning: Removed 27 rows containing missing values or values outside
## the scale range (`geom_point()`).

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  coord_cartesian(xlim = c(2, 6)) +
  labs(x = "Displacement (L)", y = "Highway MPG",
       title = "(c) coord_cartesian()")

Add geom_smooth() to each of the three plots from Q6. Do the smooth lines differ between (a) and (c)? Why?

mpg |>
  filter(displ >= 2, displ <= 6) |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Displacement (L)", y = "Highway MPG",
       title = "(a) filter() + smooth")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth() +
  coord_cartesian(xlim = c(2, 6)) +
  labs(x = "Displacement (L)", y = "Highway MPG",
       title = "(c) coord_cartesian() + smooth")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The smooth lines can differ. With filter(), geom_smooth() is fitted using only the subset (displacement between 2 and 6). With coord_cartesian(), the smooth is fitted using all observations and then zoomed into. The smooth from coord_cartesian() is informed by the full data range, which can change the shape of the curve near the edges.

4 Overlaying group summaries

In the notes (Section on stat_summary()), we saw that the same plot can be produced by either computing group means with dplyr first, or letting stat_summary() compute them during plotting.

Using stat_summary(), create a bar chart showing the mean highway MPG (hwy) for each vehicle class in mpg. Do not pre-compute the means with dplyr.
```
ggplot(mpg, aes(x = class, y = hwy)) +
  stat_summary(fun = mean, geom = "bar") +
  labs(x = "Class", y = "Mean Highway MPG")
```

Using mpg, create a jitter plot of displ (\(y\)-axis) by drv (\(x\)-axis). Overlay the group means as large red points using stat_summary(fun = mean, geom = "point").

ggplot(mpg, aes(x = drv, y = displ)) +
  geom_jitter(alpha = 0.3, width = 0.2) +
  stat_summary(fun = mean, geom = "point",
               colour = "red", size = 4) +
  labs(x = "Drive Type", y = "Engine Displacement (L)")

Achieve the same result as Q9 using group_by() and summarise() to compute the group means first, then adding a second geom_point() layer with the summary data frame. Do the red points appear in the same positions?
```
drv_means <- mpg |>
  group_by(drv) |>
  summarise(mean_displ = mean(displ))

ggplot() +
  geom_jitter(data = mpg,
              aes(x = drv, y = displ),
              alpha = 0.3, width = 0.2) +
  geom_point(data = drv_means,
             aes(x = drv, y = mean_displ),
             colour = "red", size = 4) +
  labs(x = "Drive Type", y = "Engine Displacement (L)")
```
Yes — the red points appear in exactly the same positions. Both approaches compute the same group means; they differ only in when: stat_summary() does it during plotting, while group_by() + summarise() does it beforehand.
Which approach would you prefer if you also needed the group means in a separate summary table? Which would you prefer for a quick exploratory plot?

If you need the group means elsewhere (e.g., in a table or for further analysis), the dplyr approach is better: you compute the summary once and can reuse it. For a quick exploratory plot where the summary is only needed for the visualisation, stat_summary() is cleaner and more concise.

5 Histogram with density overlay

Create a histogram of displ from mpg using geom_histogram(binwidth = 0.5). Overlay geom_density() without any modification. Describe what you observe.
```
ggplot(mpg, aes(x = displ)) +
  geom_histogram(binwidth = 0.5) +
  geom_density() +
  labs(x = "Engine Displacement (L)", y = "Count")
```
The density curve appears as a nearly flat line near zero. The histogram \(y\)-axis is in count units (up to around 40), while density values are much smaller (below 0.5). The two geoms are on completely different scales.
Fix the overlay from Q12 using after_stat(density). Does the density curve now align with the histogram?
```
ggplot(mpg, aes(x = displ)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.5) +
  geom_density() +
  labs(x = "Engine Displacement (L)", y = "Density")
```
Yes — with after_stat(density), the histogram bars are rescaled to density units and the density curve now overlays them on the same scale.
Looking at the histogram + density from Q13, does displ appear approximately normally distributed? What features suggest it might not be?

displ does not appear normally distributed. The distribution is right-skewed with a long tail towards larger values, and possibly bimodal (two peaks around 2 and 3.5 litres). A symmetric bell shape would be expected for normality.

6 Q-Q plots

Create a Q-Q plot of hwy (highway miles per gallon) from mpg. Does highway MPG appear normally distributed? Describe any departures from the reference line.
```
ggplot(mpg, aes(sample = hwy)) +
  geom_qq() +
  geom_qq_line(linewidth = 1) +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
       title = "Q-Q plot of highway MPG")
```
The points deviate from the reference line, particularly in the upper tail (curving upward), suggesting heavier right tails than a normal distribution. The distribution of hwy is right-skewed.
Simulate 200 values from the standard normal distribution using set.seed(42); rnorm(200). First, plot a histogram with density overlay (as in Q13). Then, create a Q-Q plot. Which diagnostic is more informative about the tails?
```
set.seed(42)
sim_data <- data.frame(x = rnorm(200))

ggplot(sim_data, aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.4) +
  geom_density(linewidth = 1) +
  labs(x = "x", y = "Density", title = "Histogram + density")
```
```
ggplot(sim_data, aes(sample = x)) +
  geom_qq() +
  geom_qq_line(linewidth = 1) +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
       title = "Q-Q plot")
```
Both look reasonable for normally distributed data. However, the Q-Q plot is more informative about the tails: even for genuinely normal data, the tail bars in a histogram are short and hard to evaluate. The Q-Q plot makes tail behaviour explicit — points hugging the reference line throughout (including the extremes) confirm that the tails conform to normality.

7 Summary: This practical

Visualisation	Approach A: dplyr first	Approach B: ggplot2 built-in
Bar chart of counts	`count()` \(\to\) `geom_col()`	`geom_bar()`
Subset a scatterplot	`filter()`	`scale_*_continuous(limits=)` or `coord_cartesian()`
Raw data \(+\) group means	`group_by()` \(+\) `summarise()`	`stat_summary(fun = mean, ...)`
Histogram \(+\) density	(not straightforward)	`geom_histogram(aes(y = after_stat(density)))` \(+\) `geom_density()`
Q-Q plot	(not available)	`geom_qq()` \(+\) `geom_qq_line()`

8 Checklist

By completing Practicals 1–6, you should be able to create polished and informative visualisations and understand the data manipulation steps behind them. Use this checklist to review your understanding.

8.1 Basic geoms (Chapter 3, Practical 3)

Choose appropriate geoms for your data type:
- Scatterplots with geom_point() for two continuous variables
- Line plots with geom_line() for trends over time
- Bar charts with geom_bar() for counts, geom_col() for values
- Histograms with geom_histogram() for distributions
- Boxplots with geom_boxplot() for comparing distributions
Combine multiple geoms in one plot (e.g., points + smooth line)
Understand the difference between colour (outlines/points) and fill (interiors)

8.2 Colours (Chapter 4, Practical 4)

Map variables to colour using aes(colour = ...) or aes(fill = ...)
Choose appropriate colour scales:
- Discrete: scale_*_brewer(), scale_*_viridis_d()
- Continuous: scale_*_gradient(), scale_*_viridis_c()
- Binned: scale_*_fermenter(), scale_*_viridis_b()
Use colourblind-friendly palettes (viridis, Okabe-Ito)
Recognise errors from mismatched scale types

8.3 Other scales (Chapter 5, Practical 4)

Map variables to shape, size, alpha, and linetype
Use redundant encoding (e.g., colour + shape) for accessibility
Transform axes with scale_x_log10(), scale_y_sqrt(), etc.
Control axis limits with scale_*_continuous(limits = ...)

8.4 Reading external data (`readr` and `readxl`) (Practical 4)

Read CSV files with readr::read_csv()
Read Excel files with readxl::read_excel()
Handle common arguments: skip, col_names, sheet, range

Function	Package	File type
`read_csv()`	`readr`	Comma-separated values (.csv)
`read_excel()`	`readxl`	Excel workbook (.xlsx / .xls)

library(readr)
data <- read_csv("myfile.csv")

8.5 Cosmetics (Chapter 6, Practical 5)

Fix legend titles with labs(colour = "Nice Title")
Avoid scientific notation with scales::label_comma()
Control axis breaks with scale_*_continuous(breaks = ...)
Position legends with theme(legend.position = ...)
Apply built-in themes: theme_bw(), theme_minimal(), theme_classic()

8.6 Building plots incrementally (Practical 5)

Save plots as objects: p <- ggplot(...) + ...
Add layers or scales later: p + theme_bw()
Use last_plot() for interactive exploration (console only, not in R Markdown)

8.7 Data manipulation with `dplyr` (Chapter 7, Practical 6)

Know the six key dplyr functions: filter(), mutate(), count(), arrange(), group_by(), summarise()
Understand geom_bar() (counts automatically) vs geom_col() (uses values as-is)
Compare filter(), scale_*_continuous(limits=), and coord_cartesian() for subsetting data
Overlay group summaries on raw data using stat_summary(fun = ...) or a separate data frame
Overlay a density curve on a histogram using after_stat(density)
Create Q-Q plots using geom_qq() and geom_qq_line()

Function	What it does	Closest Excel operation
`filter()`	Keep rows matching a condition	AutoFilter
`mutate()`	Create or modify columns	Adding a formula to a new column
`count()`	Count occurrences of each value	COUNTIF, or a pivot table
`arrange()`	Sort rows	Data \(\to\) Sort
`group_by()`	Group data for subsequent operations	Pivot table grouping
`summarise()`	Compute summary statistics per group	Pivot table values

Important: Always load dplyr before using these functions. If you don’t, R may silently use a different filter() from the stats package, which behaves completely differently and causes confusing errors.

library(dplyr)  # Load first!

data |>
  filter(year >= 2000) |>
  mutate(rate = count / total * 100) |>
  ggplot(aes(x = year, y = rate)) +
  geom_line()

8.8 The data science pipeline

Data visualisation is one step in a larger workflow. A polished, informative plot requires:

Choosing the right geom(s) for the data and message, and working through the pipeline:

Read data from files (readr, readxl)
Clean and transform data (dplyr, tidyr)
Visualise with ggplot2

For each visualisation, also ensure:

Clear axis labels with units where applicable
Informative legend title (not factor(x))
Colourblind-friendly palette (e.g., viridis)
Readable scales (no unnecessary scientific notation)
Clean theme (e.g., theme_minimal())
Title or caption if needed for context

Mastering data visualisation means mastering the entire pipeline, not just the plotting step.

MAS2908 - Practical 06 (Solutions)

Clement Lee

Semester 2, 2025/2026

1 Introduction

2 Bar charts and counting

3 Subsetting data

4 Overlaying group summaries

5 Histogram with density overlay

6 Q-Q plots

7 Summary: This practical

8 Checklist

8.1 Basic geoms (Chapter 3, Practical 3)

8.2 Colours (Chapter 4, Practical 4)

8.3 Other scales (Chapter 5, Practical 4)

8.4 Reading external data (`readr` and `readxl`) (Practical 4)

8.5 Cosmetics (Chapter 6, Practical 5)

8.6 Building plots incrementally (Practical 5)

8.7 Data manipulation with `dplyr` (Chapter 7, Practical 6)

8.8 The data science pipeline

MAS2908 - Practical 06 (Solutions)

Clement Lee

Semester 2, 2025/2026

1 Introduction

2 Bar charts and counting

3 Subsetting data

4 Overlaying group summaries

5 Histogram with density overlay

6 Q-Q plots

7 Summary: This practical

8 Checklist

8.1 Basic geoms (Chapter 3, Practical 3)

8.2 Colours (Chapter 4, Practical 4)

8.3 Other scales (Chapter 5, Practical 4)

8.4 Reading external data (readr and readxl) (Practical 4)

8.5 Cosmetics (Chapter 6, Practical 5)

8.6 Building plots incrementally (Practical 5)

8.7 Data manipulation with dplyr (Chapter 7, Practical 6)

8.8 The data science pipeline

8.4 Reading external data (`readr` and `readxl`) (Practical 4)

8.7 Data manipulation with `dplyr` (Chapter 7, Practical 6)