7 Data Manipulation and Statistical Layers

7.1 Introduction

In previous chapters, we built visualisations directly from tidy data frames. In practice, however, data rarely arrives in exactly the form a plot needs: we may need to filter to a relevant subset, create new variables, or count occurrences before plotting. This chapter ties together several loose ends from earlier chapters and practicals, and introduces a few new concepts — Q-Q plots and density overlays — by examining two complementary approaches:

  1. Approach A — Manipulate the data first using dplyr, then pass the result to ggplot2;
  2. Approach B — Let ggplot2 do the work using built-in functions and parameters.

Neither approach is always better. Knowing both gives you flexibility.

7.2 Data manipulation with dplyr

Because data visualisation cannot be done in isolation, we have already used several functions from dplyr throughout the notes and practicals. dplyr is the standard R package for data manipulation and the natural companion to ggplot2 — both are part of the tidyverse family of packages.

While the majority of dplyr’s functionality is beyond the scope of this module, there are six key functions you should know. For some, you only need to be able to read and understand code that uses them; for others, you will be expected to write them yourself.

Function What it does Closest Excel operation
filter() Keep rows matching a condition AutoFilter
mutate() Create or modify columns Adding a formula to a new column
count() Count occurrences of each value COUNTIF, or a pivot table
arrange() Sort rows by one or more columns Data \(\to\) Sort
group_by() Group data for subsequent operations Pivot table grouping
summarise() Compute summary statistics per group Pivot table values

Note: group_by() and summarise() are typically used together. You only need to be able to read code using these two functions — you will not be asked to write them from scratch in this module.

These functions work naturally in |> pipelines. For example:

mpg |>
  filter(cyl == 4) |>        # keep only 4-cylinder cars
  mutate(ratio = hwy / cty)  # create a new column

7.2.1 Two ways to do things

A recurring theme in this chapter is that the same visualisation can often be achieved in two different ways:

  • Approach A: Manipulate the data explicitly with dplyr, then plot the result.
  • Approach B: Use functions and parameters built into ggplot2 to perform the computation during plotting.

We compare these approaches across several examples below.

7.3 Bar charts and counting

7.3.1 Approach A: count first, then plot

Approach A is to count explicitly using dplyr::count(), then use geom_col() to plot the pre-computed values:

cyl_counts <- mtcars |>
  count(cyl) |>
  rename(cylinders = cyl, count = n)
cyl_counts
##   cylinders count
## 1         4    11
## 2         6     7
## 3         8    14
ggplot(cyl_counts, aes(x = factor(cylinders), y = count)) +
  geom_col() +
  labs(x = "Cylinders", y = "Count")
Bar chart from pre-counted data using geom\_col().

Figure 7.1: Bar chart from pre-counted data using geom_col().

geom_bar(stat = "identity") is equivalent to geom_col(). Both produce the same result, and both require a y aesthetic (the pre-computed count).

7.3.2 A common mistake

When working with pre-counted data, a common mistake is to forget to set stat = "identity" and use plain geom_bar() instead:

ggplot(cyl_counts, aes(x = factor(cylinders))) +
  geom_bar() +
  labs(x = "Cylinders", y = "Count")
Attempting geom\_bar() with pre-counted data --- each bar has height 1, not the actual count.

Figure 7.2: Attempting geom_bar() with pre-counted data — each bar has height 1, not the actual count.

Each bar has height 1 because geom_bar() counts the rows in our summary table (there is one row per cylinder value), not the original observations.

7.3.3 Approach B: let geom_bar() count automatically

The simplest way to plot the count of each category is to pass the raw data directly to geom_bar(), which counts observations automatically:

ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar() +
  labs(x = "Cylinders", y = "Count")
Bar chart from raw data --- geom\_bar() counts automatically.

Figure 7.3: Bar chart from raw data — geom_bar() counts automatically.

7.3.4 The log-scale connection

The automatic counting of geom_bar() is particularly useful when combined with axis transformations. In Section 5.5.3 (Chapter 5), we plotted counts from simulated geometric data — but we had to count the data manually with count() first, then used geom_point(). With geom_bar(), we can skip the counting step and work directly from the raw data:

set.seed(42)
raw_geom <- data.frame(x = rgeom(10000, prob = 0.5))
ggplot(raw_geom, aes(x = x)) +
  geom_bar() +
  scale_y_log10() +
  labs(x = "x", y = "Count (log scale)")
Bar chart of geometric data on a log scale, using geom\_bar() without pre-counting.

Figure 7.4: Bar chart of geometric data on a log scale, using geom_bar() without pre-counting.

geom_bar() counts the occurrences of each value of x automatically, so there is no need to call count() first. Adding scale_y_log10() produces the same linearising effect on the log scale.

7.3.5 What geoms compute for you

Most geoms in ggplot2 compute something before displaying the data. The table below shows what each geom computes. Notice that geom_col() and geom_point() are the exceptions: they use values as-is and require both x and y aesthetics.

Geom What it computes
geom_bar() Counts observations per category
geom_histogram() Bins continuous data and counts
geom_density() Computes kernel density estimate
geom_smooth() Fits a smoother (loess or lm)
geom_boxplot() Computes five-number summary
geom_col() Uses values as-is (no computation)
geom_point() Uses values as-is (no computation)

This is why geom_bar() and geom_histogram() only need an x aesthetic — they compute y (the counts) themselves.

7.4 Subsetting data

Suppose you want to focus on a subset of the data in a plot. There are three common approaches.

7.4.1 Approach A

Use dplyr::filter() to remove unwanted observations before plotting:

mpg |>
  filter(displ >= 2, displ <= 6, hwy >= 15, hwy <= 40) |>
  ggplot(aes(x = displ, y = hwy)) +
  geom_point() +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")
Filtering to a subset with dplyr::filter().

Figure 7.5: Filtering to a subset with dplyr::filter().

7.4.2 Approach B(i)

Use scale_x_continuous(limits = ) to set axis limits (covered in Section 5.5 of Chapter 5):

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_continuous(limits = c(2, 6)) +
  scale_y_continuous(limits = c(15, 40)) +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")
## Warning: Removed 33 rows containing missing values or values outside
## the scale range (`geom_point()`).
Setting axis limits with scale\_*\_continuous() removes data outside the range.

Figure 7.6: Setting axis limits with scale_*_continuous() removes data outside the range.

7.4.3 Approach B(ii)

Use coord_cartesian() to zoom visually without removing data (covered in Section 6.3.1 of Chapter 6):

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  coord_cartesian(xlim = c(2, 6), ylim = c(15, 40)) +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")
Zooming with coord\_cartesian() preserves all data.

Figure 7.7: Zooming with coord_cartesian() preserves all data.

The three plots look visually similar, but there is an important difference:

  • filter() and scale_*_continuous(limits = ) both remove data. Any statistical layers added later (e.g., geom_smooth()) will be computed from only the remaining data.
  • coord_cartesian() zooms without removing data. Statistical layers are still computed from all observations.

When to use each:

  • Use filter() when you genuinely only want to analyse a subset (the removed data is truly irrelevant).
  • Use scale_*_continuous(limits = ) for a similar purpose, though filter() is often clearer.
  • Use coord_cartesian() when you want to zoom in for visual clarity while keeping all data for calculations — for example when adding geom_smooth().

7.5 Overlaying summaries with stat_summary()

7.5.1 Approach A

In Section 3.9.3 of Chapter 3, we overlaid group means on raw data by first computing the summary with dplyr:

drv_summary <- mpg |>
  group_by(drv) |>
  summarise(mean_displ = mean(displ))

ggplot() +
  geom_jitter(data = mpg,
              aes(x = drv, y = displ),
              alpha = 0.3, width = 0.2) +
  geom_point(data = drv_summary,
             aes(x = drv, y = mean_displ),
             colour = "red", size = 4) +
  labs(x = "Drive Type", y = "Engine Displacement (L)")
Overlaying group means computed with dplyr --- from Section \@ref(multiple-datasets).

Figure 7.8: Overlaying group means computed with dplyr — from Section 3.9.3.

7.5.2 Approach B

The same result is achieved using stat_summary(), without needing a separate summary data frame:

ggplot(mpg, aes(x = drv, y = displ)) +
  geom_jitter(alpha = 0.3, width = 0.2) +
  stat_summary(fun = mean, geom = "point",
               colour = "red", size = 4) +
  labs(x = "Drive Type", y = "Engine Displacement (L)")
Overlaying group means using stat\_summary() --- no separate data frame needed.

Figure 7.9: Overlaying group means using stat_summary() — no separate data frame needed.

The result is identical. We pass the raw mpg data to ggplot() and let stat_summary() compute the group means on the fly. stat_summary() takes:

  • fun: the function to compute per group (e.g., mean, median, max)
  • geom: how to display the result (e.g., "point", "bar", "line")

This approach keeps the workflow cleaner when the summary is only needed for the plot. Use dplyr when you need the summary for other purposes (tables, further analysis) or want to inspect it before plotting.

7.6 Histogram with density overlay

In Chapter 3 (Section 3.4.1), we introduced geom_histogram() using the mpg dataset:

ggplot(mpg, aes(x = displ)) +
  geom_histogram(binwidth = 0.5) +
  labs(x = "Engine Displacement (L)", y = "Count")
Histogram of engine displacement (from Chapter 3).

Figure 7.10: Histogram of engine displacement (from Chapter 3).

A natural desire is to overlay a density curve on this histogram. However, the default histogram uses counts on the \(y\)-axis, while geom_density() uses density (area = 1). These are on entirely different scales, so a naive overlay does not work:

ggplot(mpg, aes(x = displ)) +
  geom_histogram(binwidth = 0.5) +
  geom_density() +
  labs(x = "Engine Displacement (L)", y = "Count")
A naive overlay: the density curve appears as a flat line near zero because the two geoms are on different scales.

Figure 7.11: A naive overlay: the density curve appears as a flat line near zero because the two geoms are on different scales.

The density curve is barely visible near the bottom of the plot because the histogram \(y\)-axis reaches around 40 (counts), while density values are much smaller. The solution is to rescale the histogram to density units using after_stat(density):

ggplot(mpg, aes(x = displ)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.5) +
  geom_density() +
  labs(x = "Engine Displacement (L)", y = "Density")
Histogram with density overlay, using after\_stat(density) to put both on the same scale.

Figure 7.12: Histogram with density overlay, using after_stat(density) to put both on the same scale.

after_stat(density) accesses the density variable computed internally by geom_histogram(), which rescales the bar heights to density units and makes them comparable to the density curve. Note that after_stat() is only needed for the histogram, not for geom_density().

7.7 Q-Q plots for assessing normality

Quantile-Quantile (Q-Q) plots are a diagnostic tool for assessing whether data follow a particular theoretical distribution, most commonly the normal distribution. They are particularly useful before applying statistical methods that assume normality, such as t-tests or ANOVA.

7.7.1 Why not just use a histogram or density plot?

To illustrate the issue, let us simulate data from the standard normal distribution and look at its histogram and density together:

set.seed(42)
normal_data <- data.frame(x = rnorm(200))

ggplot(normal_data, aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.4) +
  geom_density(linewidth = 1) +
  labs(x = "x", y = "Density")
Histogram and density of 200 simulated standard normal values.

Figure 7.13: Histogram and density of 200 simulated standard normal values.

The shape looks roughly bell-shaped and symmetric — as expected from normal data. But now suppose we are looking at real data and ask: do the tails follow the normal distribution? From a histogram or density plot, it is genuinely difficult to judge. The bars at the extremes are short and it is hard to tell whether they are where they should be under normality.

Q-Q plots are a much better diagnostic for the tails. They plot the sample quantiles against the theoretical quantiles of the normal distribution: if the data follow the normal distribution, the points should fall approximately on a straight reference line.

7.7.2 Q-Q plots in ggplot2

Use geom_qq() for the points and geom_qq_line() for the reference line:

ggplot(normal_data, aes(sample = x)) +
  geom_qq() +
  geom_qq_line(linewidth = 1) +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
Normal Q-Q plot using ggplot2.

Figure 7.14: Normal Q-Q plot using ggplot2.

Note that the aesthetic is sample rather than x or y — this tells geom_qq() which variable to compare against the theoretical distribution.

7.7.3 Interpreting Q-Q plots

If the points fall approximately along the reference line, the data are approximately normally distributed. Deviations indicate departures from normality:

  • Points curving upward on the right & downward on the left: Heavy tails (more extreme values than expected under normality)
  • Points curving downward on the right & upward on the left: Light tails (fewer extreme values than expected under normality)
  • Asymmetry around line perpendicular to reference line: Skewness in the data

7.8 Summary

The following table summarises the five examples in this chapter where the same visualisation can be achieved in two ways. These examples only scratch the surface — data manipulation is a vast field — but they are the ones in scope for this module.

Visualisation Approach A: dplyr first Approach B: ggplot2 built-in
Bar chart of counts count() \(\to\) geom_col() geom_bar() (counts automatically)
Subset a scatterplot filter() \(\to\) geom_point() scale_*_continuous(limits = ...) or coord_cartesian()
Raw data \(+\) group summaries group_by() \(+\) summarise() \(\to\) two data frames stat_summary(fun = ...)
Histogram \(+\) density overlay (not straightforward) geom_histogram(aes(y = after_stat(density))) \(+\) geom_density()
Q-Q plot (not available in base ggplot2) geom_qq() \(+\) geom_qq_line()