11 Miscellaneous Topics

This chapter covers two broad areas. The first half introduces a selection of ggplot2 geoms that have not appeared in earlier chapters, expanding your visualisation toolkit. The second half presents classic case studies in data visualisation: phenomena that plots reveal far better than numbers alone, and enduring principles for creating effective graphics.

11.1 Additional geoms

Earlier chapters focused on the most frequently used geoms. The following geoms complete the picture and are well worth knowing, even if they appear less often in everyday work.

11.1.1 Reference lines: geom_hline(), geom_vline(), geom_abline()

These three geoms add straight reference lines to a plot. They take no data argument — the lines are defined by their intercept and slope, not by observations in a data frame.

Geom Argument(s) Draws
geom_hline() yintercept Horizontal line at \(y = c\)
geom_vline() xintercept Vertical line at \(x = c\)
geom_abline() intercept, slope Line \(y = a + bx\)

A common use is to mark a threshold or reference value. The following plot of city fuel economy by engine displacement adds a dashed horizontal line at the mean, making it easy to see which engine sizes are above and below average:

mean_hwy <- mean(mpg$hwy)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = mean_hwy, linetype = "dashed", colour = "firebrick") +
  labs(x = "Engine displacement (litres)",
       y = "Highway fuel economy (mpg)")
Highway fuel economy vs engine displacement, with mean marked.

Figure 11.1: Highway fuel economy vs engine displacement, with mean marked.

geom_abline() is particularly useful for a line of equality (slope 1, intercept 0) when comparing two measurements on the same scale — for example, city versus highway fuel economy:

ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
  labs(x = "City fuel economy (mpg)", y = "Highway fuel economy (mpg)")
City vs highway fuel economy with the line of equality.

Figure 11.2: City vs highway fuel economy with the line of equality.

All points above the line have better highway economy than city economy — which is true of nearly every vehicle in the dataset.

11.1.2 geom_violin()

A violin plot is a hybrid of a boxplot and a kernel density plot. Like a boxplot, it shows the distribution of a continuous variable within groups; unlike a boxplot, it reveals the full shape of the distribution via a smoothed density estimate mirrored on both sides. Violins are widest where data are most dense, so a bimodal group shows two bulges — information that a boxplot’s quartile box cannot convey.

ggplot(mpg, aes(class, hwy)) +
  geom_violin(fill = "steelblue", alpha = 0.6) +
  labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")
Distribution of highway fuel economy by vehicle class (violin).

Figure 11.3: Distribution of highway fuel economy by vehicle class (violin).

It is common to overlay a boxplot on the violin to show the summary statistics alongside the distributional shape:

ggplot(mpg, aes(class, hwy)) +
  geom_violin(fill = "steelblue", alpha = 0.4) +
  geom_boxplot(width = 0.15, outlier.shape = NA) +
  labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")
Violin plot with overlaid boxplot.

Figure 11.4: Violin plot with overlaid boxplot.

Setting outlier.shape = NA suppresses the boxplot’s outlier points, since the violin already shows the full distribution.

11.1.3 geom_text() and geom_label()

These geoms annotate individual observations with text. Both require an additional label aesthetic. geom_label() draws a filled rectangle behind the text, making it easier to read against a busy background.

# A few cars to highlight
highlight <- mpg |>
  filter(class == "2seater" |
         (manufacturer == "honda" & model == "civic")) |>
  distinct(manufacturer, model, .keep_all = TRUE)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(alpha = 0.3) +
  geom_point(data = highlight, colour = "firebrick", size = 2) +
  geom_label(
    data    = highlight,
    aes(label = model),
    size    = 3,
    nudge_y = 1.8
  ) +
  labs(x = "Engine displacement (litres)", y = "Highway economy (mpg)")
Fuel economy vs engine size, with selected models labelled.

Figure 11.5: Fuel economy vs engine size, with selected models labelled.

The nudge_x and nudge_y arguments shift labels away from their points to reduce overlap. For datasets where many labels would still clash, the ggrepel package provides geom_text_repel() and geom_label_repel(), which automatically reposition labels to avoid collision.

11.1.4 geom_raster()

geom_raster() fills a rectangular grid of cells, mapping a variable to the fill colour of each cell. It is the ggplot2 equivalent of a heatmap: the x and y aesthetics give the cell centres, and fill gives the value.

The built-in faithfuld dataset is a 2D kernel density estimate of the Old Faithful geyser data, stored on a regular grid of eruption duration and waiting time values:

ggplot(faithfuld, aes(waiting, eruptions)) +
  geom_raster(aes(fill = density)) +
  scale_fill_viridis_c() +
  labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)",
       fill = "Density")
2D kernel density of Old Faithful eruptions displayed with `geom_raster()`.

Figure 11.6: 2D kernel density of Old Faithful eruptions displayed with geom_raster().

The two clusters correspond to Old Faithful’s two eruption modes: short eruptions after a short wait, and long eruptions after a longer wait.

A very common application of geom_raster() is displaying a correlation matrix:

cor_long <- mtcars |>
  select(mpg, cyl, disp, hp, wt) |>
  cor() |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "correlation")

ggplot(cor_long, aes(var1, var2, fill = correlation)) +
  geom_raster() +
  geom_text(aes(label = round(correlation, 2)), size = 3) +
  scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick",
                       midpoint = 0, limits = c(-1, 1)) +
  labs(x = NULL, y = NULL, fill = "Correlation") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Correlation matrix of selected `mtcars` variables.

Figure 11.7: Correlation matrix of selected mtcars variables.

11.1.5 geom_density_2d()

geom_density_2d() draws contour lines of a 2D kernel density estimate over a scatterplot. It shows where observations cluster without relying on point overplotting or opacity:

library(palmerpenguins)
## 
## Attaching package: 'palmerpenguins'
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw
penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  ggplot(aes(bill_length_mm, bill_depth_mm)) +
  geom_point(alpha = 0.3) +
  geom_density_2d(colour = "steelblue") +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)")
Bill length vs depth in Palmer penguins, with 2D density contours.

Figure 11.8: Bill length vs depth in Palmer penguins, with 2D density contours.

geom_density_2d_filled() shades the regions between contours, which can be easier to read:

penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  ggplot(aes(bill_length_mm, bill_depth_mm)) +
  geom_density_2d_filled(alpha = 0.7) +
  geom_point(size = 0.6) +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)", fill = "Density level")
Filled 2D density contours for Palmer penguin bill measurements.

Figure 11.9: Filled 2D density contours for Palmer penguin bill measurements.

11.1.6 geom_smooth(method = "glm")

In Chapter 7, geom_smooth() was introduced with method = "lm" for linear fits. Setting method = "glm" fits a generalised linear model instead. Combined with method.args = list(family = binomial), this produces a logistic regression curve — appropriate when the outcome is binary (0/1).

In MAS2903 Regression, you will learn that logistic regression models the probability of a binary outcome as a sigmoid (S-shaped) function of predictors. geom_smooth() can overlay this fitted curve directly on a plot:

ggplot(mtcars, aes(hp, am)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.6) +
  geom_smooth(
    method      = "glm",
    method.args = list(family = binomial),
    se          = FALSE,
    colour      = "steelblue"
  ) +
  labs(x = "Horsepower", y = "P(manual transmission)")
## `geom_smooth()` using formula = 'y ~ x'
Probability of manual transmission by horsepower, with logistic regression curve.

Figure 11.10: Probability of manual transmission by horsepower, with logistic regression curve.

geom_jitter() adds a small vertical offset to prevent the binary points from all lying on \(y = 0\) or \(y = 1\). The sigmoid curve shows that higher-powered cars are substantially less likely to have a manual gearbox.

11.2 Case studies and principles

11.2.1 Simpson’s paradox

Simpson’s paradox occurs when a trend present within separate groups disappears or reverses when those groups are combined. It is a powerful reminder that aggregated data can tell a systematically misleading story.

The Palmer penguin dataset provides a clean illustration. When bill length and bill depth are plotted for all penguins together, the relationship appears negative:

penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  ggplot(aes(bill_length_mm, bill_depth_mm)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)")
## `geom_smooth()` using formula = 'y ~ x'
Bill length vs bill depth across all penguin species --- a negative trend.

Figure 11.11: Bill length vs bill depth across all penguin species — a negative trend.

Yet within each species, the relationship is clearly positive:

penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  ggplot(aes(bill_length_mm, bill_depth_mm, colour = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)", colour = "Species")
## `geom_smooth()` using formula = 'y ~ x'
Bill length vs bill depth within each penguin species --- a positive trend in every group.

Figure 11.12: Bill length vs bill depth within each penguin species — a positive trend in every group.

The reversal is driven by species acting as a confounding variable. Gentoo penguins are large-billed but have relatively shallow bills for their size, while Adélie penguins are the opposite. When species groups are merged, Gentoo’s long, shallow bills pull the overall slope negative, masking the within-group positive trend.

Whenever you see a trend in aggregated data, ask: “Is there a grouping variable that could be driving this?” Disaggregating by that variable is often the fastest way to find out, and a well-chosen colour aesthetic is all it takes.

11.2.2 Correlation does not imply causation

Two variables can be strongly correlated for reasons that have nothing to do with one causing the other. The statistician Tyler Vigen catalogued hundreds of such spurious correlations at tylervigen.com/spurious-correlations. The most famous example finds that US per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets tracked each other almost perfectly over a ten-year period:

# Data from tylervigen.com/spurious-correlations
cheese_bed <- tibble(
  year            = 2000:2009,
  cheese_lbs      = c(29.8, 30.1, 30.5, 30.6, 31.3,
                      31.7, 32.6, 33.1, 32.7, 32.8),
  bedsheet_deaths = c(327, 456, 509, 497, 596,
                      573, 661, 741, 809, 717)
)

ggplot(cheese_bed, aes(cheese_lbs, bedsheet_deaths)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_text(aes(label = year), vjust = -0.8, size = 3) +
  labs(
    x       = "US per capita cheese consumption (lbs)",
    y       = "Deaths by bedsheet tangling",
    title   = "Correlation = 0.95 (to 2 d.p.)",
    caption = "Source: tylervigen.com/spurious-correlations"
  )
## `geom_smooth()` using formula = 'y ~ x'
US cheese consumption vs deaths by bedsheet tangling: a spurious correlation.

Figure 11.13: US cheese consumption vs deaths by bedsheet tangling: a spurious correlation.

Both series share the same broad upward trend over the decade. This is common-cause correlation: a third factor (population growth, changing dietary habits, ageing demographics) drives both series independently. Eating more cheese does not make bedsheets more dangerous.

The lesson: a convincing scatterplot with a tight linear fit is no substitute for a causal argument grounded in subject-matter knowledge. Always ask whether the relationship makes mechanistic sense. The geoms involved here — geom_point(), geom_smooth(), and geom_text() — are all familiar, but the relationship they depict is pure coincidence.

11.2.3 Numerical summaries alone are insufficient

Summary statistics — means, standard deviations, correlations — collapse a dataset into a few numbers and inevitably lose information. The canonical demonstration is Anscombe’s quartet (1973): four datasets with identical means, standard deviations, and correlations, but radically different visual structures. The datasauRus package extends this idea dramatically with the datasaurus_dozen: thirteen datasets that share the same summary statistics but look completely different when plotted.

library(datasauRus)

datasaurus_dozen |>
  group_by(dataset) |>
  summarise(
    mean_x = round(mean(x), 1),
    mean_y = round(mean(y), 1),
    sd_x   = round(sd(x), 1),
    sd_y   = round(sd(y), 1),
    cor    = round(cor(x, y), 2),
    .groups = "drop"
  ) |>
  head(6)
## # A tibble: 6 × 6
##   dataset  mean_x mean_y  sd_x  sd_y   cor
##   <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 away       54.3   47.8  16.8  26.9 -0.06
## 2 bullseye   54.3   47.8  16.8  26.9 -0.07
## 3 circle     54.3   47.8  16.8  26.9 -0.07
## 4 dino       54.3   47.8  16.8  26.9 -0.06
## 5 dots       54.3   47.8  16.8  26.9 -0.06
## 6 h_lines    54.3   47.8  16.8  26.9 -0.06

Despite the identical statistics, the datasets look nothing alike:

ggplot(datasaurus_dozen, aes(x, y)) +
  geom_point(size = 0.5, alpha = 0.6) +
  facet_wrap(~dataset, ncol = 3) +
  theme_void(base_size = 10) +
  theme(strip.text = element_text(size = 8))
The `datasaurus_dozen`: thirteen datasets with identical summary statistics but very different shapes.

Figure 11.14: The datasaurus_dozen: thirteen datasets with identical summary statistics but very different shapes.

One of the thirteen datasets is literally a dinosaur. The message is clear: always plot your data. A single number like the correlation coefficient can make fundamentally different relationships look identical. Visualisation is not an optional extra after the numbers; it is an essential part of understanding data.

11.2.4 John Snow and the cholera map

In the summer of 1854, a cholera outbreak killed over 500 people in ten days in the Soho district of London. At the time, the prevailing theory was that cholera spread through bad air (the “miasma” theory). The physician John Snow was sceptical.

Snow plotted the deaths on a street map of Soho and marked the locations of the neighbourhood’s public water pumps. The spatial pattern was unmistakable: deaths clustered tightly around a single pump on Broad Street.

library(HistData)

ggplot() +
  geom_point(
    data  = Snow.deaths,
    aes(x, y),
    size  = 1,
    alpha = 0.5
  ) +
  geom_point(
    data   = Snow.pumps,
    aes(x, y),
    shape  = 3,
    size   = 4,
    stroke = 1.5,
    colour = "firebrick"
  ) +
  coord_equal() +
  theme_void() +
  labs(
    title   = "John Snow's cholera map, Soho 1854",
    caption = "Dots: cholera deaths  |  Red crosses: water pumps"
  )
John Snow's 1854 cholera data: deaths (dots) and water pump locations (red crosses).

Figure 11.15: John Snow’s 1854 cholera data: deaths (dots) and water pump locations (red crosses).

Snow persuaded local authorities to remove the handle from the Broad Street pump, and the outbreak subsided. His map is widely regarded as one of the founding moments of both epidemiology and data visualisation: the spatial pattern in the data made the case that no table of death counts could have made so immediately.

Today, geospatial tools like sf and leaflet (Chapters 9 and 10) allow us to reproduce and extend this kind of analysis in minutes. The 2003 SARS outbreak, the 2014–2016 Ebola crisis, and the 2020 COVID-19 pandemic were all tracked with real-time geospatial dashboards that owe a conceptual debt to Snow’s hand-drawn map. The crucial ingredient is the same in every case: cases (or deaths, or test results) paired with a location, in a format that can be plotted.

11.2.5 What makes a good visualisation?

The statistician and information designer Edward Tufte codified many of the principles underlying good data visualisation in his 1983 book The Visual Display of Quantitative Information. His ideas remain the most widely cited framework in the field and form a useful checklist when evaluating your own plots.

Data-ink ratio

Tufte’s central concept is the data-ink ratio: the proportion of the ink (or pixels) on a chart that actually conveys data information, as opposed to decorative or redundant elements. A high data-ink ratio is desirable.

\[\text{Data-ink ratio} = \frac{\text{data ink}}{\text{total ink used to print the graphic}}\]

In ggplot2 terms, theme_minimal() and theme_classic() move towards higher data-ink ratios by removing background fills and redundant grid lines. As a rule, removing unnecessary elements is more effective than adding decoration.

Chart junk

Chart junk is Tufte’s term for visual elements that clutter a graphic without contributing to understanding: three-dimensional effects on 2D data, decorative hatching, unnecessary grid lines, and ornamental borders. These elements consume ink, distract the eye, and often distort the data.

The ggplot2 defaults are already fairly clean, but watch for:

  • 3D bar or pie charts (never use these for data that is not inherently three-dimensional)
  • Gradient fills on bars or backgrounds
  • Grid lines so dense they compete with the data

Lie factor

The lie factor is the ratio of the visual effect size to the true data effect size:

\[\text{Lie factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect in data}}\]

A lie factor of 1 is ideal. Values greater than 1 exaggerate differences; values less than 1 understate them. The most common source of inflated lie factors in practice is a truncated \(y\)-axis: starting a bar chart at a non-zero baseline makes a 10% change look like a 300% change, because the bars can now only be compared by their tops, not their areas.

Small multiples

Small multiples (Tufte’s term for what ggplot2 calls facets) display the same graphical form repeated for different subsets of data. Because the reader’s eye can compare across panels without re-learning the visual encoding, small multiples are often more effective than a single busy chart with many overlapping series. Chapters 5 and 8 used facet_wrap() extensively for exactly this reason.

Data density

Related to the data-ink ratio is data density: the amount of information per unit of display area. Tufte’s own “sparkline” is an extreme example — a tiny inline time-series the size of a word, embedded directly in text. In practice, the principle encourages:

  • choosing compact geoms (e.g., a boxplot rather than a bar chart with error bars) when the data support them;
  • avoiding excessive whitespace or padding; and
  • using facets to show more comparisons in the same area.

Summary

Principle Practical rule
Data-ink ratio Remove anything that does not convey data
Chart junk Avoid 3D effects, decorative fills, dense grids
Lie factor \(\approx 1\): don’t exaggerate or understate; start axes at zero for bar charts
Small multiples Facet rather than overcrowd a single panel
Data density Pack information efficiently without clutter

Tufte’s principles are prescriptive rather than absolute. There are contexts — public-facing infographics where visual engagement matters — where some decoration is justified. But as a default starting point, these rules consistently produce clearer, more honest, and more informative graphics.