1 Introduction

This practical reinforces Chapter 7 of the notes, exploring two complementary approaches to the same visualisation: manipulating data explicitly with dplyr, or using built-in functions in ggplot2. You will practice:

Throughout this practical we use the built-in mpg and mtcars datasets.

2 Bar charts and counting

# geom_bar() automatically counts from raw data
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(x = "Cylinders", y = "Count")

  1. Using mtcars, create a bar chart of gear (number of gears) using geom_bar(). Add appropriate axis labels.

    ggplot(mtcars, aes(x = factor(gear))) +
      geom_bar() +
      labs(x = "Number of Gears", y = "Count")

  2. Use count() to compute the number of cars with each value of gear from mtcars. Store this as gear_counts, then recreate the same bar chart using geom_col(). Verify the two charts look identical.

    gear_counts <- mtcars |>
      count(gear) |>
      rename(gears = gear, count = n)
    
    ggplot(gear_counts, aes(x = factor(gears), y = count)) +
      geom_col() +
      labs(x = "Number of Gears", y = "Count")

  3. What happens if you supply gear_counts to geom_bar() without stat = "identity"? Demonstrate this and explain why the bars all have height 1.

    ggplot(gear_counts, aes(x = factor(gears))) +
      geom_bar() +
      labs(x = "Number of Gears", y = "Count (wrong!)")

    geom_bar() counts the rows of the data frame passed to it. gear_counts has one row per gear value (3 rows), so each bar has height 1. To use pre-computed counts, switch to geom_col(), or add stat = 'identity' to geom_bar().

  4. Using mpg, create a horizontal bar chart showing the count of cars for each manufacturer. Which manufacturer has the most cars in the dataset?

    Hint: Map manufacturer to the y aesthetic (not x) for a horizontal chart.

    ggplot(mpg, aes(y = manufacturer)) +
      geom_bar() +
      labs(x = "Count", y = "Manufacturer")

    # Dodge, Toyota, and Volkswagen have the most cars
  5. The following code creates a pre-counted summary. Use geom_col() to create a bar chart, ordered from most to fewest cars per class, displayed horizontally.

    Hint: Use reorder(class, n) inside aes() to order the bars.

    class_summary <- mpg |>
      count(class) |>
      arrange(desc(n))
    class_summary
    ## # A tibble: 7 × 2
    ##   class          n
    ##   <chr>      <int>
    ## 1 suv           62
    ## 2 compact       47
    ## 3 midsize       41
    ## 4 subcompact    35
    ## 5 pickup        33
    ## 6 minivan       11
    ## 7 2seater        5
    ggplot(class_summary, aes(x = reorder(class, n), y = n)) +
      geom_col() +
      coord_flip() +
      labs(x = "Class", y = "Count")

3 Subsetting data

All three approaches below produce visually similar plots of displ vs hwy restricted to displacement values between 2 and 6.

  1. Create the three versions of the plot using: (a) filter(), (b) scale_x_continuous(limits = c(2, 6)), (c) coord_cartesian(xlim = c(2, 6)).

    mpg |>
      filter(displ >= 2, displ <= 6) |>
      ggplot(aes(x = displ, y = hwy)) +
      geom_point() +
      labs(x = "Displacement (L)", y = "Highway MPG",
           title = "(a) filter()")

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      scale_x_continuous(limits = c(2, 6)) +
      labs(x = "Displacement (L)", y = "Highway MPG",
           title = "(b) scale_x_continuous()")
    ## Warning: Removed 27 rows containing missing values or values outside
    ## the scale range (`geom_point()`).

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      coord_cartesian(xlim = c(2, 6)) +
      labs(x = "Displacement (L)", y = "Highway MPG",
           title = "(c) coord_cartesian()")

  2. Add geom_smooth() to each of the three plots from Q6. Do the smooth lines differ between (a) and (c)? Why?

    mpg |>
      filter(displ >= 2, displ <= 6) |>
      ggplot(aes(x = displ, y = hwy)) +
      geom_point() +
      geom_smooth() +
      labs(x = "Displacement (L)", y = "Highway MPG",
           title = "(a) filter() + smooth")
    ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      geom_smooth() +
      coord_cartesian(xlim = c(2, 6)) +
      labs(x = "Displacement (L)", y = "Highway MPG",
           title = "(c) coord_cartesian() + smooth")
    ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

    The smooth lines can differ. With filter(), geom_smooth() is fitted using only the subset (displacement between 2 and 6). With coord_cartesian(), the smooth is fitted using all observations and then zoomed into. The smooth from coord_cartesian() is informed by the full data range, which can change the shape of the curve near the edges.

4 Overlaying group summaries

In the notes (Section on stat_summary()), we saw that the same plot can be produced by either computing group means with dplyr first, or letting stat_summary() compute them during plotting.

  1. Using stat_summary(), create a bar chart showing the mean highway MPG (hwy) for each vehicle class in mpg. Do not pre-compute the means with dplyr.

    ggplot(mpg, aes(x = class, y = hwy)) +
      stat_summary(fun = mean, geom = "bar") +
      labs(x = "Class", y = "Mean Highway MPG")

  2. Using mpg, create a jitter plot of displ (\(y\)-axis) by drv (\(x\)-axis). Overlay the group means as large red points using stat_summary(fun = mean, geom = "point").

    ggplot(mpg, aes(x = drv, y = displ)) +
      geom_jitter(alpha = 0.3, width = 0.2) +
      stat_summary(fun = mean, geom = "point",
                   colour = "red", size = 4) +
      labs(x = "Drive Type", y = "Engine Displacement (L)")

  3. Achieve the same result as Q9 using group_by() and summarise() to compute the group means first, then adding a second geom_point() layer with the summary data frame. Do the red points appear in the same positions?

    drv_means <- mpg |>
      group_by(drv) |>
      summarise(mean_displ = mean(displ))
    
    ggplot() +
      geom_jitter(data = mpg,
                  aes(x = drv, y = displ),
                  alpha = 0.3, width = 0.2) +
      geom_point(data = drv_means,
                 aes(x = drv, y = mean_displ),
                 colour = "red", size = 4) +
      labs(x = "Drive Type", y = "Engine Displacement (L)")

    Yes — the red points appear in exactly the same positions. Both approaches compute the same group means; they differ only in when: stat_summary() does it during plotting, while group_by() + summarise() does it beforehand.

  4. Which approach would you prefer if you also needed the group means in a separate summary table? Which would you prefer for a quick exploratory plot?

    If you need the group means elsewhere (e.g., in a table or for further analysis), the dplyr approach is better: you compute the summary once and can reuse it. For a quick exploratory plot where the summary is only needed for the visualisation, stat_summary() is cleaner and more concise.

5 Histogram with density overlay

  1. Create a histogram of displ from mpg using geom_histogram(binwidth = 0.5). Overlay geom_density() without any modification. Describe what you observe.

    ggplot(mpg, aes(x = displ)) +
      geom_histogram(binwidth = 0.5) +
      geom_density() +
      labs(x = "Engine Displacement (L)", y = "Count")

    The density curve appears as a nearly flat line near zero. The histogram \(y\)-axis is in count units (up to around 40), while density values are much smaller (below 0.5). The two geoms are on completely different scales.

  2. Fix the overlay from Q12 using after_stat(density). Does the density curve now align with the histogram?

    ggplot(mpg, aes(x = displ)) +
      geom_histogram(aes(y = after_stat(density)), binwidth = 0.5) +
      geom_density() +
      labs(x = "Engine Displacement (L)", y = "Density")

    Yes — with after_stat(density), the histogram bars are rescaled to density units and the density curve now overlays them on the same scale.

  3. Looking at the histogram + density from Q13, does displ appear approximately normally distributed? What features suggest it might not be?

    displ does not appear normally distributed. The distribution is right-skewed with a long tail towards larger values, and possibly bimodal (two peaks around 2 and 3.5 litres). A symmetric bell shape would be expected for normality.

6 Q-Q plots

  1. Create a Q-Q plot of hwy (highway miles per gallon) from mpg. Does highway MPG appear normally distributed? Describe any departures from the reference line.

    ggplot(mpg, aes(sample = hwy)) +
      geom_qq() +
      geom_qq_line(linewidth = 1) +
      labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
           title = "Q-Q plot of highway MPG")

    The points deviate from the reference line, particularly in the upper tail (curving upward), suggesting heavier right tails than a normal distribution. The distribution of hwy is right-skewed.

  2. Simulate 200 values from the standard normal distribution using set.seed(42); rnorm(200). First, plot a histogram with density overlay (as in Q13). Then, create a Q-Q plot. Which diagnostic is more informative about the tails?

    set.seed(42)
    sim_data <- data.frame(x = rnorm(200))
    
    ggplot(sim_data, aes(x = x)) +
      geom_histogram(aes(y = after_stat(density)), binwidth = 0.4) +
      geom_density(linewidth = 1) +
      labs(x = "x", y = "Density", title = "Histogram + density")

    ggplot(sim_data, aes(sample = x)) +
      geom_qq() +
      geom_qq_line(linewidth = 1) +
      labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
           title = "Q-Q plot")

    Both look reasonable for normally distributed data. However, the Q-Q plot is more informative about the tails: even for genuinely normal data, the tail bars in a histogram are short and hard to evaluate. The Q-Q plot makes tail behaviour explicit — points hugging the reference line throughout (including the extremes) confirm that the tails conform to normality.

7 Summary: This practical

Visualisation Approach A: dplyr first Approach B: ggplot2 built-in
Bar chart of counts count() \(\to\) geom_col() geom_bar()
Subset a scatterplot filter() scale_*_continuous(limits=) or coord_cartesian()
Raw data \(+\) group means group_by() \(+\) summarise() stat_summary(fun = mean, ...)
Histogram \(+\) density (not straightforward) geom_histogram(aes(y = after_stat(density))) \(+\) geom_density()
Q-Q plot (not available) geom_qq() \(+\) geom_qq_line()

8 Checklist

By completing Practicals 1–6, you should be able to create polished and informative visualisations and understand the data manipulation steps behind them. Use this checklist to review your understanding.

8.1 Basic geoms (Chapter 3, Practical 3)

  • Choose appropriate geoms for your data type:
    • Scatterplots with geom_point() for two continuous variables
    • Line plots with geom_line() for trends over time
    • Bar charts with geom_bar() for counts, geom_col() for values
    • Histograms with geom_histogram() for distributions
    • Boxplots with geom_boxplot() for comparing distributions
  • Combine multiple geoms in one plot (e.g., points + smooth line)
  • Understand the difference between colour (outlines/points) and fill (interiors)

8.2 Colours (Chapter 4, Practical 4)

  • Map variables to colour using aes(colour = ...) or aes(fill = ...)
  • Choose appropriate colour scales:
    • Discrete: scale_*_brewer(), scale_*_viridis_d()
    • Continuous: scale_*_gradient(), scale_*_viridis_c()
    • Binned: scale_*_fermenter(), scale_*_viridis_b()
  • Use colourblind-friendly palettes (viridis, Okabe-Ito)
  • Recognise errors from mismatched scale types

8.3 Other scales (Chapter 5, Practical 4)

  • Map variables to shape, size, alpha, and linetype
  • Use redundant encoding (e.g., colour + shape) for accessibility
  • Transform axes with scale_x_log10(), scale_y_sqrt(), etc.
  • Control axis limits with scale_*_continuous(limits = ...)

8.4 Reading external data (readr and readxl) (Practical 4)

  • Read CSV files with readr::read_csv()
  • Read Excel files with readxl::read_excel()
  • Handle common arguments: skip, col_names, sheet, range
Function Package File type
read_csv() readr Comma-separated values (.csv)
read_excel() readxl Excel workbook (.xlsx / .xls)
library(readr)
data <- read_csv("myfile.csv")

8.5 Cosmetics (Chapter 6, Practical 5)

  • Fix legend titles with labs(colour = "Nice Title")
  • Avoid scientific notation with scales::label_comma()
  • Control axis breaks with scale_*_continuous(breaks = ...)
  • Position legends with theme(legend.position = ...)
  • Apply built-in themes: theme_bw(), theme_minimal(), theme_classic()

8.6 Building plots incrementally (Practical 5)

  • Save plots as objects: p <- ggplot(...) + ...
  • Add layers or scales later: p + theme_bw()
  • Use last_plot() for interactive exploration (console only, not in R Markdown)

8.7 Data manipulation with dplyr (Chapter 7, Practical 6)

  • Know the six key dplyr functions: filter(), mutate(), count(), arrange(), group_by(), summarise()
  • Understand geom_bar() (counts automatically) vs geom_col() (uses values as-is)
  • Compare filter(), scale_*_continuous(limits=), and coord_cartesian() for subsetting data
  • Overlay group summaries on raw data using stat_summary(fun = ...) or a separate data frame
  • Overlay a density curve on a histogram using after_stat(density)
  • Create Q-Q plots using geom_qq() and geom_qq_line()
Function What it does Closest Excel operation
filter() Keep rows matching a condition AutoFilter
mutate() Create or modify columns Adding a formula to a new column
count() Count occurrences of each value COUNTIF, or a pivot table
arrange() Sort rows Data \(\to\) Sort
group_by() Group data for subsequent operations Pivot table grouping
summarise() Compute summary statistics per group Pivot table values

Important: Always load dplyr before using these functions. If you don’t, R may silently use a different filter() from the stats package, which behaves completely differently and causes confusing errors.

library(dplyr)  # Load first!

data |>
  filter(year >= 2000) |>
  mutate(rate = count / total * 100) |>
  ggplot(aes(x = year, y = rate)) +
  geom_line()

8.8 The data science pipeline

Data visualisation is one step in a larger workflow. A polished, informative plot requires:

Choosing the right geom(s) for the data and message, and working through the pipeline:

  1. Read data from files (readr, readxl)
  2. Clean and transform data (dplyr, tidyr)
  3. Visualise with ggplot2

For each visualisation, also ensure:

  • Clear axis labels with units where applicable
  • Informative legend title (not factor(x))
  • Colourblind-friendly palette (e.g., viridis)
  • Readable scales (no unnecessary scientific notation)
  • Clean theme (e.g., theme_minimal())
  • Title or caption if needed for context

Mastering data visualisation means mastering the entire pipeline, not just the plotting step.