11 Miscellaneous Topics

This chapter covers three broad areas. The first expands your ggplot2 toolkit with geoms that have not appeared in earlier chapters. The second presents classic case studies in data visualisation — phenomena that plots reveal far better than numbers alone — and enduring principles for creating effective graphics. The third demonstrates that ggplot2 is not confined to statistical data: any deterministic function can be visualised by constructing a suitable data frame.

11.1 Additional geoms

Earlier chapters focused on the most frequently used geoms. The following geoms complete the picture and are well worth knowing, even if they appear less often in everyday work.

11.1.1 Reference lines: `geom_hline()`, `geom_vline()`, `geom_abline()`

These three geoms add straight reference lines to a plot. They take no data argument — the lines are defined by their intercept and slope, not by observations in a data frame.

Geom	Argument(s)	Draws
`geom_hline()`	`yintercept`	Horizontal line at $y = c$
`geom_vline()`	`xintercept`	Vertical line at $x = c$
`geom_abline()`	`intercept`, `slope`	Line $y = a + bx$

A common use is to mark a threshold or reference value. The following plot of highway fuel economy by engine displacement adds a dashed horizontal line at the mean, making it easy to see which engine sizes are above and below average:

mean_hwy <- mean(mpg$hwy)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = mean_hwy, linetype = "dashed", colour = "firebrick") +
  labs(x = "Engine displacement (litres)",
       y = "Highway fuel economy (mpg)")

Figure 11.1: Highway fuel economy vs engine displacement, with mean marked.

geom_abline() was introduced briefly in Section 3.9.4 as a manual alternative to geom_smooth(method = "lm"): instead of asking ggplot2 to fit the line, you supply the intercept and slope directly from a pre-computed lm() object. Its more general use is adding a line of equality (slope 1, intercept 0) when comparing two measurements on the same scale — for example, city versus highway fuel economy:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
  labs(x = "City fuel economy (mpg)", y = "Highway fuel economy (mpg)")

Figure 11.2: City vs highway fuel economy with the line of equality.

All points above the line have better highway economy than city economy — which is true of nearly every vehicle in the dataset.

geom_vline() marks a specific $x$ value, which is useful for annotating historical events on a time series. The ggplot2::economics dataset records monthly US economic indicators. The following plot overlays a dashed vertical line at the date of the Nixon shock (15 August 1971), when President Nixon suspended the convertibility of the US dollar to gold, ending the Bretton Woods system:

library(lubridate)

ggplot(economics, aes(x = date, y = psavert)) +
  geom_line() +
  geom_vline(
    xintercept = as.Date("1971-08-15"),
    linetype   = "dashed",
    colour     = "firebrick"
  ) +
  annotate("text", x = as.Date("1971-08-15"), y = 17,
           label = "Nixon shock", hjust = -0.1, size = 3.5) +
  labs(x = "Date", y = "Personal savings rate (%)")

Figure 11.3: US personal savings rate with a vertical line marking the Nixon shock (August 1971).

The dashed line makes it easy to read off whether the savings rate was rising or falling around the shock, and to compare behaviour in the years before and after.

11.1.2 `geom_violin()`

A violin plot is a hybrid of a boxplot and a kernel density plot. Like a boxplot, it shows the distribution of a continuous variable within groups; unlike a boxplot, it reveals the full shape of the distribution via a smoothed density estimate mirrored on both sides. Violins are widest where data are most dense, so a bimodal group shows two bulges — information that a boxplot’s quartile box cannot convey.

ggplot(mpg, aes(x = class, y = hwy)) +
  geom_violin(fill = "steelblue", alpha = 0.6) +
  labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")

Figure 11.4: Distribution of highway fuel economy by vehicle class (violin).

It is common to overlay a boxplot on the violin to show summary statistics alongside the distributional shape:

ggplot(mpg, aes(x = class, y = hwy)) +
  geom_violin(fill = "steelblue", alpha = 0.4) +
  geom_boxplot(width = 0.15, outlier.shape = NA) +
  labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")

Figure 11.5: Violin plot with overlaid boxplot.

Setting outlier.shape = NA suppresses the boxplot’s outlier points, since the violin already shows the full distribution.

11.1.3 `geom_text()` and `geom_label()`

These geoms annotate individual observations with text. Both require an additional label aesthetic. geom_label() draws a filled rectangle behind the text, making it easier to read against a busy background, though the background rectangle can obscure nearby points.

The mtcars dataset has car names stored as row names rather than a column. Converting them into a proper column with rownames_to_column() makes them available as an aesthetic:

mtcars_new <- mtcars |> rownames_to_column("name")

Starting with geom_text(), each car is labelled at its (mpg, disp) coordinates. With over 30 cars in a small space the labels inevitably overlap:

ggplot(mtcars_new, aes(x = mpg, y = disp)) +
  geom_point() +
  geom_text(aes(label = name), size = 2.5) +
  labs(x = "Fuel economy (mpg)", y = "Displacement (cu. in.)")

Figure 11.6: geom_text() labelling all cars by name: overlapping labels are hard to read.

geom_label() draws a white-filled box behind each label, which improves legibility against a busy background. The trade-off is that the boxes can cover nearby data points:

ggplot(mtcars_new, aes(x = mpg, y = disp)) +
  geom_point() +
  geom_label(aes(label = name), size = 2.5) +
  labs(x = "Fuel economy (mpg)", y = "Displacement (cu. in.)")

Figure 11.7: geom_label() provides clearer text but the boxes obscure some data points.

For datasets where labels still clash after using nudge_x / nudge_y, the ggrepel package provides geom_text_repel() and geom_label_repel(), which automatically reposition labels to avoid collision.

11.1.4 `geom_raster()`

geom_raster() fills a rectangular grid of cells, mapping a variable to the fill colour of each cell. It is the ggplot2 equivalent of a heatmap: the x and y aesthetics give the cell centres, and fill gives the value.

A very common application is displaying a correlation matrix. The tidy format required by geom_raster() is obtained by reshaping the output of cor() with pivot_longer():

cor_long <- mtcars |>
  select(mpg, cyl, disp, hp, wt) |>
  cor() |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "correlation")

ggplot(cor_long, aes(var1, var2, fill = correlation)) +
  geom_raster() +
  geom_text(aes(label = round(correlation, 2)), size = 3) +
  scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick",
                       midpoint = 0, limits = c(-1, 1)) +
  labs(x = NULL, y = NULL, fill = "Correlation") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 11.8: Correlation matrix of selected mtcars variables.

The built-in faithfuld dataset is a 2D kernel density estimate of the Old Faithful geyser data, stored on a regular grid of eruption duration and waiting time values. This provides a ready-made example of geom_raster() applied to a continuous density:

ggplot(faithfuld, aes(x = waiting, y = eruptions)) +
  geom_raster(aes(fill = density)) +
  scale_fill_viridis_c() +
  labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)",
       fill = "Density")

Figure 11.9: 2D kernel density of Old Faithful eruptions displayed with geom_raster().

The two clusters correspond to Old Faithful’s two eruption modes: short eruptions after a short wait, and long eruptions after a longer wait.

11.1.5 Two views of the same bivariate distribution, and `geom_density_2d()*`

faithfuld is a pre-computed density grid. Working from the raw faithful data instead, ggplot2 can compute and render the same 2D density on the fly in two complementary styles.

The filled density representation treats density as a continuous surface and shades regions according to their estimated density — directly analogous to geom_raster() with a pre-computed grid:

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_density_2d_filled(alpha = 0.8) +
  labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)",
       fill = "Density level")

Figure 11.10: Old Faithful eruption data: filled 2D density estimate.

The contour representation draws isolines at fixed density levels. Overlaying these on a scatterplot shows both the raw data and the estimated density structure simultaneously:

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point(alpha = 0.3) +
  geom_density_2d(colour = "steelblue") +
  labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)")

Figure 11.11: Old Faithful eruption data: 2D density contours overlaid on points.

Both plots convey the same two-cluster structure. The filled version is more immediately readable; the contour version preserves the individual data points and lets the reader see exactly where observations lie relative to the density peaks.

11.1.6 `geom_smooth(method = "glm")`

In Chapter 7, geom_smooth() was introduced with method = "lm" for linear fits. Setting method = "glm" fits a generalised linear model instead. Combined with method.args = list(family = binomial), this produces a logistic regression curve — appropriate when the outcome is binary (0/1).

In MAS2910 Regression, you will learn that logistic regression models the probability of a binary outcome as a sigmoid (S-shaped) function of predictors. geom_smooth() can overlay this fitted curve directly on a plot of the raw binary responses.

The Palmer penguins dataset provides a clean example: we aim to predict a penguin’s sex from a single numeric measurement, bill length. Rows with missing values are removed with drop_na():

penguins |>
  drop_na() |>
  ggplot(aes(x = bill_len, y = as.numeric(sex) - 1)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.4) +
  geom_smooth(
    method      = "glm",
    method.args = list(family = binomial),
    se          = FALSE,
    colour      = "steelblue"
  ) +
  labs(x = "Bill length (mm)", y = "P(sex = male)")

## `geom_smooth()` using formula = 'y ~ x'

Figure 11.12: Probability of being male as a function of bill length, with logistic regression curve.

as.numeric(sex) - 1 converts the factor sex (levels: "female" = 1, "male" = 2) to a 0/1 numeric outcome. geom_jitter() adds a small vertical offset so the binary points do not lie exactly on $y = 0$ or $y = 1$. The sigmoid curve shows that longer bills are associated with a substantially higher probability of being male — consistent with sexual dimorphism in beak size across penguin species.

11.2 Simpson’s paradox

Simpson’s paradox occurs when a trend present within separate groups disappears or reverses when those groups are combined. It is a powerful reminder that aggregated data can tell a systematically misleading story.

The Palmer penguin dataset provides a clean illustration. When bill length and bill depth are plotted for all penguins together, the relationship appears negative:

penguins |>
  filter(!is.na(bill_len), !is.na(bill_dep)) |>
  ggplot(aes(x = bill_len, y = bill_dep)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)")

## `geom_smooth()` using formula = 'y ~ x'

Figure 11.13: Bill length vs bill depth across all penguin species — a negative trend.

Yet within each species, the relationship is clearly positive:

penguins |>
  filter(!is.na(bill_len), !is.na(bill_dep)) |>
  ggplot(aes(x = bill_len, y = bill_dep, colour = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Bill length (mm)", y = "Bill depth (mm)", colour = "Species")

## `geom_smooth()` using formula = 'y ~ x'

Figure 11.14: Bill length vs bill depth within each penguin species — a positive trend in every group.

The reversal is driven by species acting as a confounding variable. Gentoo penguins are large-billed but have relatively shallow bills for their size, while Adélie penguins are the opposite. When species groups are merged, Gentoo’s long, shallow bills pull the overall slope negative, masking the within-group positive trend.

Whenever you see a trend in aggregated data, ask: “Is there a grouping variable that could be driving this?” Disaggregating by that variable is often the fastest way to find out, and a well-chosen colour aesthetic is all it takes.

11.3 Correlation does not imply causation

Two variables can be strongly correlated for reasons that have nothing to do with one causing the other. The statistician Tyler Vigen catalogued hundreds of such spurious correlations at https://tylervigen.com/spurious-correlations. The most famous example finds that US per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets tracked each other almost perfectly over a ten-year period:

# Data from tylervigen.com/spurious-correlations
cheese_bed <- tibble(
  year            = 2000:2009,
  cheese_lbs      = c(29.8, 30.1, 30.5, 30.6, 31.3,
                      31.7, 32.6, 33.1, 32.7, 32.8),
  bedsheet_deaths = c(327, 456, 509, 497, 596,
                      573, 661, 741, 809, 717)
)

ggplot(cheese_bed, aes(x = cheese_lbs, y = bedsheet_deaths)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_text(aes(label = year), vjust = -0.8, size = 3) +
  labs(
    x       = "US per capita cheese consumption (lbs)",
    y       = "Deaths by bedsheet tangling",
    title   = "Correlation = 0.95 (to 2 d.p.)",
    caption = "Source: tylervigen.com/spurious-correlations"
  )

## `geom_smooth()` using formula = 'y ~ x'

Figure 11.15: US cheese consumption vs deaths by bedsheet tangling: a spurious correlation.

Both series share the same broad upward trend over the decade. This is common-cause correlation: a third factor (population growth, changing dietary habits, ageing demographics) drives both series independently. Eating more cheese does not make bedsheets more dangerous.

The lesson: a convincing scatterplot with a tight linear fit is no substitute for a causal argument grounded in subject-matter knowledge. Always ask whether the relationship makes mechanistic sense. The geoms involved here — geom_point(), geom_smooth(), and geom_text() — are all familiar, but the relationship they depict is pure coincidence.

11.4 Numerical summaries alone are insufficient

Summary statistics — means, standard deviations, correlations — collapse a dataset into a few numbers and inevitably lose information. The canonical demonstration is Anscombe’s quartet (1973): four datasets with identical means, standard deviations, and correlations, but radically different visual structures. The datasauRus package extends this idea dramatically with the datasaurus_dozen: thirteen datasets that share the same summary statistics but look completely different when plotted.

library(datasauRus)

datasaurus_dozen |>
  group_by(dataset) |>
  summarise(
    mean_x = round(mean(x), 1),
    mean_y = round(mean(y), 1),
    sd_x   = round(sd(x), 1),
    sd_y   = round(sd(y), 1),
    cor    = round(cor(x, y), 2),
    .groups = "drop"
  ) |>
  head(6)

## # A tibble: 6 × 6
##   dataset  mean_x mean_y  sd_x  sd_y   cor
##   <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 away       54.3   47.8  16.8  26.9 -0.06
## 2 bullseye   54.3   47.8  16.8  26.9 -0.07
## 3 circle     54.3   47.8  16.8  26.9 -0.07
## 4 dino       54.3   47.8  16.8  26.9 -0.06
## 5 dots       54.3   47.8  16.8  26.9 -0.06
## 6 h_lines    54.3   47.8  16.8  26.9 -0.06

Despite the almost identical statistics, the datasets look nothing alike:

ggplot(datasaurus_dozen, aes(x = x, y = y)) +
  geom_point(size = 0.5, alpha = 0.6) +
  facet_wrap(~dataset, ncol = 3) +
  theme_void(base_size = 10) +
  theme(strip.text = element_text(size = 8))

Figure 11.16: The datasaurus_dozen: thirteen datasets with identical summary statistics but very different shapes.

One of the thirteen datasets is literally a dinosaur. The message is clear: always plot your data. A single number like the correlation coefficient can make fundamentally different relationships look identical. Visualisation is not an optional extra after the numbers; it is an essential part of understanding data.

11.5 John Snow and the cholera map

In the summer of 1854, a cholera outbreak killed over 500 people in ten days in the Soho district of London. At the time, the prevailing theory was that cholera spread through bad air (the “miasma” theory). The physician John Snow was sceptical.

Snow plotted the deaths on a street map of Soho and marked the locations of the neighbourhood’s public water pumps. The spatial pattern was unmistakable: deaths clustered tightly around a single pump on Broad Street.

library(HistData)

ggplot() +
  geom_point(
    data  = Snow.deaths,
    aes(x = x, y = y),
    size  = 1,
    alpha = 0.5
  ) +
  geom_point(
    data   = Snow.pumps,
    aes(x = x, y = y),
    shape  = 3,
    size   = 4,
    stroke = 1.5,
    colour = "firebrick"
  ) +
  coord_equal() +
  theme_void() +
  labs(
    title   = "John Snow's cholera map, Soho 1854",
    caption = "Dots: cholera deaths  |  Red crosses: water pumps"
  )

Figure 11.17: John Snow’s 1854 cholera data: deaths (dots) and water pump locations (red crosses).

Snow persuaded local authorities to remove the handle from the Broad Street pump, and the outbreak subsided. His map is widely regarded as one of the founding moments of both epidemiology and data visualisation: the spatial pattern in the data made the case that no table of death counts could have made so immediately.

The data used above come from the HistData package. Robin Wilson has collected and converted Snow’s original data into several additional formats (CSV, GeoJSON, shapefile, and others), available at https://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/. These formats are directly readable with the tools from Chapter 9, making it straightforward to overlay the death and pump locations on a modern interactive base map with leaflet.

Today, geospatial tools like sf and leaflet (Chapters 9 and 10) allow us to reproduce and extend this kind of analysis in minutes. The 2003 SARS outbreak, the 2014–2016 Ebola crisis, and the 2020 COVID-19 pandemic were all tracked with real-time geospatial dashboards that owe a conceptual debt to Snow’s hand-drawn map. The crucial ingredient is the same in every case: cases (or deaths, or test results) paired with a location, in a format that can be plotted.

11.6 Connecting with next year’s modules

This module may turn out to be most useful as a foundation for the Statistics modules you will encounter in Year 3 (those with code MAS392x). The connections below are necessarily speculative — the visualisation techniques that matter most depend on the data and the question — but they give a sense of where the skills from this module are likely to be applied.

MAS3921 Extreme Value Theory: Some datasets contain observations that are far larger than the bulk of the data: record-breaking floods, stock-market crashes, extreme wind speeds. Identifying such outliers visually is often the first step: a boxplot, density plot, or histogram (Section 3.4) will reveal a heavy upper tail or isolated extreme values. When the range of the data spans several orders of magnitude, a log scale (Section 5.5.3) is usually needed to make the body of the distribution legible alongside the extremes. Once identified, these extreme observations are best handled using the methods of extreme value theory rather than standard distributional assumptions.
MAS3923 Time Series and MAS3924 Survival Analysis: Both modules deal with data indexed by time, so Chapter 8 and geom_line() in general are the most directly relevant tools. Time series analysis studies how a variable evolves and how to forecast it; survival analysis focuses on the time until an event (failure, death, recovery). In both cases, plotting the raw series or survival curves before modelling is standard practice.
MAS3928 Statistical Modelling: Building regression models requires understanding the relationships between variables. The natural starting point is a pairwise scatterplot matrix, such as the one produced by GGally::ggpairs() (introduced in Section 3.6.3), which shows every pair of variables simultaneously. This makes it easy to spot strong correlations, non-linear relationships, or outliers before fitting a model. That said, keep in mind the message from Section 11.3: a strong visual association between two variables is not evidence that one causes the other.
A broader remark is worth making here. Everything covered in this module has been about using ggplot2 to visualise the data itself. The modules above mostly focus on fitting statistical models to data, and their outputs — fitted values, residuals, survival curves, spectral densities — are a different kind of object from raw observations. It is perfectly possible to visualise modelling results with ggplot2, but doing so usually requires a little more scaffolding: extracting quantities from model objects, reshaping them into tidy data frames, and then applying the grammar of graphics in the usual way. The grammar still applies; the challenge is knowing what you want to visualise and how to get it into the right shape. If you find yourself reaching directly for the tools in this module, the most natural moment is during exploratory data analysis — before any model is fitted — where visualisation plays its most direct and indispensable role.

Beyond statistics, R is a capable platform for data science and machine learning more broadly. MAS3919 Foundations of Machine Learning covers this territory directly. The caret package (Classification And REgression Training) provides a unified interface for training, tuning, and evaluating a wide range of machine learning models in R, and is a natural next step for data science work. More generally, the tidyverse is a collection of R packages designed around the same tidy-data philosophy underpinning ggplot2; it provides consistent, composable tools for every stage of a data analysis workflow, from importing and reshaping data to modelling and reporting. Several packages from this ecosystem have already appeared in this module: dplyr for data manipulation, tidyr for reshaping, and lubridate for working with dates and times.

11.7 What makes a good visualisation?

The statistician and information designer Edward Tufte codified many of the principles underlying good data visualisation in his 1983 book The Visual Display of Quantitative Information. His ideas remain the most widely cited framework in the field and form a useful checklist when evaluating your own plots.

Data-ink ratio

Tufte’s central concept is the data-ink ratio: the proportion of the ink (or pixels) on a chart that actually conveys data information, as opposed to decorative or redundant elements. A high data-ink ratio is desirable.

\[\text{Data-ink ratio} = \frac{\text{data ink}}{\text{total ink used to print the graphic}}\]

In ggplot2 terms, theme_minimal() and theme_classic() move towards higher data-ink ratios by removing background fills and redundant grid lines. As a rule, removing unnecessary elements is more effective than adding decoration.

Chart junk

Chart junk is Tufte’s term for visual elements that clutter a graphic without contributing to understanding: three-dimensional effects on 2D data, decorative hatching, unnecessary grid lines, and ornamental borders. These elements consume ink, distract the eye, and often distort the data.

The ggplot2 defaults are already fairly clean, but watch for:

3D bar or pie charts (never use these for data that is not inherently three-dimensional)
Gradient fills on bars or backgrounds
Grid lines so dense they compete with the data

Lie factor

The lie factor is the ratio of the visual effect size to the true data effect size:

\[\text{Lie factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect in data}}\]

A lie factor of 1 is ideal. Values greater than 1 exaggerate differences; values less than 1 understate them. The most common source of inflated lie factors in practice is a truncated $y$-axis: starting a bar chart at a non-zero baseline makes a 10% change look like a 300% change, because the bars can now only be compared by their tops, not their areas.

Small multiples

Small multiples (Tufte’s term for what ggplot2 calls facets) display the same graphical form repeated for different subsets of data. Because the reader’s eye can compare across panels without re-learning the visual encoding, small multiples are often more effective than a single busy chart with many overlapping series. Chapters 5 and 8 used facet_wrap() extensively for exactly this reason.

Data density

Related to the data-ink ratio is data density: the amount of information per unit of display area. Tufte’s own “sparkline” is an extreme example — a tiny inline time-series the size of a word, embedded directly in text. In practice, the principle encourages:

choosing compact geoms (e.g., a boxplot rather than a bar chart with error bars) when the data support them;
avoiding excessive whitespace or padding; and
using facets to show more comparisons in the same area.

Summary

Principle	Practical rule
Data-ink ratio	Remove anything that does not convey data
Chart junk	Avoid 3D effects, decorative fills, dense grids
Lie factor	$\approx 1$: don’t exaggerate or understate; start axes at zero for bar charts
Small multiples	Facet rather than overcrowd a single panel
Data density	Pack information efficiently without clutter

Tufte’s principles are prescriptive rather than absolute. There are contexts — public-facing infographics where visual engagement matters — where some decoration is justified. But as a default starting point, these rules consistently produce clearer, more honest, and more informative graphics.

11.8 Visualising a mathematical function: Riemann’s zeta function

Data does not have to come from a statistical dataset. Any deterministic function can be visualised in ggplot2 by constructing a data frame of inputs and outputs and treating the columns as aesthetic mappings in the usual way. This section illustrates the idea with the Riemann zeta function — which is associated with the Riemann Hypothesis, arguably the most famous and difficult unsolved problem in mathematics — as a motivating example.

The Riemann zeta function $\zeta(s)$ is a complex-valued function of a complex variable $s$. The Riemann hypothesis conjectures that every non-trivial zero of $\zeta$ lies on the critical line $\text{Re}(s) = \frac{1}{2}$. On that line, $s = \frac{1}{2} + it$ for real $t$, so $\zeta$ reduces to a complex-valued function of a single real parameter $t$, whose real and imaginary parts can each be plotted against $t$.

The pracma package provides a numerical implementation of $\zeta$. We construct the data frame in one pipeline, using complex arithmetic directly in R:

library(pracma)

df_riemann <- data.frame(t = seq(-30, 30, by = 0.01)) |>
  mutate(
    z  = zeta(0.5 + t * 1i),
    Re = Re(z),
    Im = Im(z)
  )

Plotting Re and Im against t gives the critical line plot. Each crossing of zero in either component is a candidate zero of $\zeta$:

df_riemann |>
  ggplot() +
  geom_line(aes(x = t, y = Re), colour = "red") +
  geom_line(aes(x = t, y = Im), colour = "blue") +
  labs(x = expression(t),
       y = expression(zeta(over(1, 2) + it)))

$Real (red) and imaginary (blue) parts of $\zeta(1/2 + it)$ for $t \in [-30, 30]$.$

Figure 11.18: Real (red) and imaginary (blue) parts of $\zeta(1/2 + it)$ for $t \in [-30, 30]$.

The two curves oscillate and cross zero repeatedly. A non-trivial zero of $\zeta$ occurs precisely where both curves cross zero simultaneously: the first such pair is near $t \approx 14.1$.

The same data can be rendered as a parametric curve by plotting Im against Re, tracing the path of $\zeta(\frac{1}{2} + it)$ in the complex plane as $t$ varies. This is the polar-style graph of the critical line:

df_riemann |>
  ggplot() +
  geom_path(aes(x = Re, y = Im), colour = "blue") +
  labs(x = expression(Re(zeta(over(1, 2) + it))),
       y = expression(Im(zeta(over(1, 2) + it)))) +
  coord_equal()

$Parametric plot of $\zeta(1/2 + it)$ in the complex plane: the curve passes through the origin at each non-trivial zero.$

Figure 11.19: Parametric plot of $\zeta(1/2 + it)$ in the complex plane: the curve passes through the origin at each non-trivial zero.

Every passage of the curve through the origin $(0, 0)$ corresponds to a non-trivial zero of $\zeta$ on the critical line, and hence is consistent with the Riemann hypothesis. The Riemann hypothesis has been verified computationally for the first $10^{13}$ zeros, but remains unproved in general.

The key observation from a visualisation standpoint is that nothing here was special about ggplot2: the workflow was identical to any other plot. Define the input grid, compute the function values, store both in a data frame, then apply the appropriate geom. The same approach works for Fourier series, differential equations solved numerically, or any other deterministic mathematical object.

11.9 Summary

This chapter has extended the ggplot2 toolkit and placed it in a broader context:

Additional geoms: geom_hline(), geom_vline(), and geom_abline() add reference lines defined by coordinates rather than data; geom_violin() reveals distributional shape; geom_text() and geom_label() annotate individual points; geom_raster() encodes a third variable on a grid via fill colour; geom_density_2d() shows where bivariate data cluster via contour lines; and geom_smooth(method = "glm") overlays a logistic regression sigmoid on binary outcome data.
Case studies:
1. Simpson’s paradox shows that aggregated trends can reverse when a confounding variable is revealed by colour or faceting.
2. Spurious correlations demonstrate that visual fit is no substitute for causal reasoning.
3. The datasaurus_dozen and Anscombe’s quartet prove that identical summary statistics can hide radically different structures — always plot your data.
4. John Snow’s cholera map illustrates that the right visualisation can settle a scientific argument more decisively than any numerical summary.
5. Tufte’s principles — data-ink ratio, chart junk, lie factor, small multiples, data density — provide a practical framework for producing honest, effective graphics.
Mathematical visualisation: ggplot2 is not confined to statistical data. Any function can be visualised by constructing a data frame of inputs and outputs and treating the result like any other tidy dataset.

There is, of course, far more to data visualisation than this module has covered. Animation (gganimate), network graphs (ggraph), interactive web graphics beyond leaflet (plotly, Observable), high-dimensional projection methods, perception research, and accessibility considerations (colour-blindness, screen readers) are all active areas with extensive literatures of their own. What this module has tried to establish is a firm foundation — the grammar of graphics, tidy data, and a consistent set of principles — from which any of those directions can be pursued.

Geom	Argument(s)	Draws
`geom_hline()`	`yintercept`	Horizontal line at \(y = c\)
`geom_vline()`	`xintercept`	Vertical line at \(x = c\)
`geom_abline()`	`intercept`, `slope`	Line \(y = a + bx\)