11 Miscellaneous Topics
This chapter covers three broad areas. The first expands your ggplot2
toolkit with geoms that have not appeared in earlier chapters. The second
presents classic case studies in data visualisation — phenomena that plots
reveal far better than numbers alone — and enduring principles for creating
effective graphics. The third demonstrates that ggplot2 is not confined to
statistical data: any deterministic function can be visualised by constructing
a suitable data frame.
11.1 Additional geoms
Earlier chapters focused on the most frequently used geoms. The following geoms complete the picture and are well worth knowing, even if they appear less often in everyday work.
11.1.1 Reference lines: geom_hline(), geom_vline(), geom_abline()
These three geoms add straight reference lines to a plot. They take no data
argument — the lines are defined by their intercept and slope, not by
observations in a data frame.
| Geom | Argument(s) | Draws |
|---|---|---|
geom_hline() |
yintercept |
Horizontal line at \(y = c\) |
geom_vline() |
xintercept |
Vertical line at \(x = c\) |
geom_abline() |
intercept, slope |
Line \(y = a + bx\) |
A common use is to mark a threshold or reference value. The following plot of highway fuel economy by engine displacement adds a dashed horizontal line at the mean, making it easy to see which engine sizes are above and below average:
mean_hwy <- mean(mpg$hwy)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = mean_hwy, linetype = "dashed", colour = "firebrick") +
labs(x = "Engine displacement (litres)",
y = "Highway fuel economy (mpg)")
Figure 11.1: Highway fuel economy vs engine displacement, with mean marked.
geom_abline() was introduced briefly in Section 3.9.4 as a manual alternative to
geom_smooth(method = "lm"): instead of asking ggplot2 to fit the line, you
supply the intercept and slope directly from a pre-computed lm() object. Its
more general use is adding a line of equality (slope 1, intercept 0) when
comparing two measurements on the same scale — for example, city versus
highway fuel economy:
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(alpha = 0.4) +
geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
labs(x = "City fuel economy (mpg)", y = "Highway fuel economy (mpg)")
Figure 11.2: City vs highway fuel economy with the line of equality.
All points above the line have better highway economy than city economy — which is true of nearly every vehicle in the dataset.
geom_vline() marks a specific \(x\) value, which is useful for annotating
historical events on a time series. The ggplot2::economics dataset records
monthly US economic indicators. The following plot overlays a dashed vertical
line at the date of the Nixon shock (15 August 1971), when President Nixon
suspended the convertibility of the US dollar to gold, ending the Bretton Woods
system:
library(lubridate)
ggplot(economics, aes(x = date, y = psavert)) +
geom_line() +
geom_vline(
xintercept = as.Date("1971-08-15"),
linetype = "dashed",
colour = "firebrick"
) +
annotate("text", x = as.Date("1971-08-15"), y = 17,
label = "Nixon shock", hjust = -0.1, size = 3.5) +
labs(x = "Date", y = "Personal savings rate (%)")
Figure 11.3: US personal savings rate with a vertical line marking the Nixon shock (August 1971).
The dashed line makes it easy to read off whether the savings rate was rising or falling around the shock, and to compare behaviour in the years before and after.
11.1.2 geom_violin()
A violin plot is a hybrid of a boxplot and a kernel density plot. Like a boxplot, it shows the distribution of a continuous variable within groups; unlike a boxplot, it reveals the full shape of the distribution via a smoothed density estimate mirrored on both sides. Violins are widest where data are most dense, so a bimodal group shows two bulges — information that a boxplot’s quartile box cannot convey.
ggplot(mpg, aes(x = class, y = hwy)) +
geom_violin(fill = "steelblue", alpha = 0.6) +
labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")
Figure 11.4: Distribution of highway fuel economy by vehicle class (violin).
It is common to overlay a boxplot on the violin to show summary statistics alongside the distributional shape:
ggplot(mpg, aes(x = class, y = hwy)) +
geom_violin(fill = "steelblue", alpha = 0.4) +
geom_boxplot(width = 0.15, outlier.shape = NA) +
labs(x = "Vehicle class", y = "Highway fuel economy (mpg)")
Figure 11.5: Violin plot with overlaid boxplot.
Setting outlier.shape = NA suppresses the boxplot’s outlier points, since
the violin already shows the full distribution.
11.1.3 geom_text() and geom_label()
These geoms annotate individual observations with text. Both require an
additional label aesthetic. geom_label() draws a filled rectangle behind
the text, making it easier to read against a busy background, though the
background rectangle can obscure nearby points.
The mtcars dataset has car names stored as row names rather than a column.
Converting them into a proper column with rownames_to_column() makes them
available as an aesthetic:
mtcars_new <- mtcars |> rownames_to_column("name")Starting with geom_text(), each car is labelled at its (mpg, disp)
coordinates. With over 30 cars in a small space the labels inevitably overlap:
ggplot(mtcars_new, aes(x = mpg, y = disp)) +
geom_point() +
geom_text(aes(label = name), size = 2.5) +
labs(x = "Fuel economy (mpg)", y = "Displacement (cu. in.)")
Figure 11.6: geom_text() labelling all cars by name: overlapping labels are hard to read.
geom_label() draws a white-filled box behind each label, which improves
legibility against a busy background. The trade-off is that the boxes can cover
nearby data points:
ggplot(mtcars_new, aes(x = mpg, y = disp)) +
geom_point() +
geom_label(aes(label = name), size = 2.5) +
labs(x = "Fuel economy (mpg)", y = "Displacement (cu. in.)")
Figure 11.7: geom_label() provides clearer text but the boxes obscure some data points.
For datasets where labels still clash after using nudge_x / nudge_y, the
ggrepel package provides geom_text_repel() and geom_label_repel(), which
automatically reposition labels to avoid collision.
11.1.4 geom_raster()
geom_raster() fills a rectangular grid of cells, mapping a variable to the
fill colour of each cell. It is the ggplot2 equivalent of a heatmap:
the x and y aesthetics give the cell centres, and fill gives the value.
A very common application is displaying a correlation matrix. The tidy
format required by geom_raster() is obtained by reshaping the output of
cor() with pivot_longer():
cor_long <- mtcars |>
select(mpg, cyl, disp, hp, wt) |>
cor() |>
as.data.frame() |>
rownames_to_column("var1") |>
pivot_longer(-var1, names_to = "var2", values_to = "correlation")
ggplot(cor_long, aes(var1, var2, fill = correlation)) +
geom_raster() +
geom_text(aes(label = round(correlation, 2)), size = 3) +
scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick",
midpoint = 0, limits = c(-1, 1)) +
labs(x = NULL, y = NULL, fill = "Correlation") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 11.8: Correlation matrix of selected mtcars variables.
The built-in faithfuld dataset is a 2D kernel density estimate of the Old
Faithful geyser data, stored on a regular grid of eruption duration and waiting
time values. This provides a ready-made example of geom_raster() applied to
a continuous density:
ggplot(faithfuld, aes(x = waiting, y = eruptions)) +
geom_raster(aes(fill = density)) +
scale_fill_viridis_c() +
labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)",
fill = "Density")
Figure 11.9: 2D kernel density of Old Faithful eruptions displayed with geom_raster().
The two clusters correspond to Old Faithful’s two eruption modes: short eruptions after a short wait, and long eruptions after a longer wait.
11.1.5 Two views of the same bivariate distribution, and geom_density_2d()*
faithfuld is a pre-computed density grid. Working from the raw faithful
data instead, ggplot2 can compute and render the same 2D density on the fly
in two complementary styles.
The filled density representation treats density as a continuous surface and
shades regions according to their estimated density — directly analogous to
geom_raster() with a pre-computed grid:
ggplot(faithful, aes(x = waiting, y = eruptions)) +
geom_density_2d_filled(alpha = 0.8) +
labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)",
fill = "Density level")
Figure 11.10: Old Faithful eruption data: filled 2D density estimate.
The contour representation draws isolines at fixed density levels. Overlaying these on a scatterplot shows both the raw data and the estimated density structure simultaneously:
ggplot(faithful, aes(x = waiting, y = eruptions)) +
geom_point(alpha = 0.3) +
geom_density_2d(colour = "steelblue") +
labs(x = "Waiting time (minutes)", y = "Eruption duration (minutes)")
Figure 11.11: Old Faithful eruption data: 2D density contours overlaid on points.
Both plots convey the same two-cluster structure. The filled version is more immediately readable; the contour version preserves the individual data points and lets the reader see exactly where observations lie relative to the density peaks.
11.1.6 geom_smooth(method = "glm")
In Chapter 7, geom_smooth() was introduced with method = "lm"
for linear fits. Setting method = "glm" fits a generalised linear model
instead. Combined with method.args = list(family = binomial), this produces a
logistic regression curve — appropriate when the outcome is binary (0/1).
In MAS2910 Regression, you will learn that logistic regression models the
probability of a binary outcome as a sigmoid (S-shaped) function of predictors.
geom_smooth() can overlay this fitted curve directly on a plot of the raw
binary responses.
The Palmer penguins dataset provides a clean example: we aim to predict a
penguin’s sex from a single numeric measurement, bill length. Rows with missing
values are removed with drop_na():
penguins |>
drop_na() |>
ggplot(aes(x = bill_len, y = as.numeric(sex) - 1)) +
geom_jitter(height = 0.05, width = 0, alpha = 0.4) +
geom_smooth(
method = "glm",
method.args = list(family = binomial),
se = FALSE,
colour = "steelblue"
) +
labs(x = "Bill length (mm)", y = "P(sex = male)")## `geom_smooth()` using formula = 'y ~ x'
Figure 11.12: Probability of being male as a function of bill length, with logistic regression curve.
as.numeric(sex) - 1 converts the factor sex (levels: "female" = 1,
"male" = 2) to a 0/1 numeric outcome. geom_jitter() adds a small vertical
offset so the binary points do not lie exactly on \(y = 0\) or \(y = 1\). The
sigmoid curve shows that longer bills are associated with a substantially higher
probability of being male — consistent with sexual dimorphism in beak size
across penguin species.
11.2 Simpson’s paradox
Simpson’s paradox occurs when a trend present within separate groups disappears or reverses when those groups are combined. It is a powerful reminder that aggregated data can tell a systematically misleading story.
The Palmer penguin dataset provides a clean illustration. When bill length and bill depth are plotted for all penguins together, the relationship appears negative:
penguins |>
filter(!is.na(bill_len), !is.na(bill_dep)) |>
ggplot(aes(x = bill_len, y = bill_dep)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)")## `geom_smooth()` using formula = 'y ~ x'
Figure 11.13: Bill length vs bill depth across all penguin species — a negative trend.
Yet within each species, the relationship is clearly positive:
penguins |>
filter(!is.na(bill_len), !is.na(bill_dep)) |>
ggplot(aes(x = bill_len, y = bill_dep, colour = species)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)", colour = "Species")## `geom_smooth()` using formula = 'y ~ x'
Figure 11.14: Bill length vs bill depth within each penguin species — a positive trend in every group.
The reversal is driven by species acting as a confounding variable. Gentoo penguins are large-billed but have relatively shallow bills for their size, while Adélie penguins are the opposite. When species groups are merged, Gentoo’s long, shallow bills pull the overall slope negative, masking the within-group positive trend.
Whenever you see a trend in aggregated data, ask: “Is there a grouping variable that could be driving this?” Disaggregating by that variable is often the fastest way to find out, and a well-chosen colour aesthetic is all it takes.
11.3 Correlation does not imply causation
Two variables can be strongly correlated for reasons that have nothing to do with one causing the other. The statistician Tyler Vigen catalogued hundreds of such spurious correlations at https://tylervigen.com/spurious-correlations. The most famous example finds that US per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets tracked each other almost perfectly over a ten-year period:
# Data from tylervigen.com/spurious-correlations
cheese_bed <- tibble(
year = 2000:2009,
cheese_lbs = c(29.8, 30.1, 30.5, 30.6, 31.3,
31.7, 32.6, 33.1, 32.7, 32.8),
bedsheet_deaths = c(327, 456, 509, 497, 596,
573, 661, 741, 809, 717)
)
ggplot(cheese_bed, aes(x = cheese_lbs, y = bedsheet_deaths)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
geom_text(aes(label = year), vjust = -0.8, size = 3) +
labs(
x = "US per capita cheese consumption (lbs)",
y = "Deaths by bedsheet tangling",
title = "Correlation = 0.95 (to 2 d.p.)",
caption = "Source: tylervigen.com/spurious-correlations"
)## `geom_smooth()` using formula = 'y ~ x'
Figure 11.15: US cheese consumption vs deaths by bedsheet tangling: a spurious correlation.
Both series share the same broad upward trend over the decade. This is common-cause correlation: a third factor (population growth, changing dietary habits, ageing demographics) drives both series independently. Eating more cheese does not make bedsheets more dangerous.
The lesson: a convincing scatterplot with a tight linear fit is no substitute
for a causal argument grounded in subject-matter knowledge. Always ask whether
the relationship makes mechanistic sense. The geoms involved here —
geom_point(), geom_smooth(), and geom_text() — are all familiar, but
the relationship they depict is pure coincidence.
11.4 Numerical summaries alone are insufficient
Summary statistics — means, standard deviations, correlations — collapse a
dataset into a few numbers and inevitably lose information. The canonical
demonstration is Anscombe’s quartet (1973): four datasets with identical
means, standard deviations, and correlations, but radically different visual
structures. The datasauRus package extends this idea dramatically with the
datasaurus_dozen: thirteen datasets that share the same summary statistics but
look completely different when plotted.
library(datasauRus)
datasaurus_dozen |>
group_by(dataset) |>
summarise(
mean_x = round(mean(x), 1),
mean_y = round(mean(y), 1),
sd_x = round(sd(x), 1),
sd_y = round(sd(y), 1),
cor = round(cor(x, y), 2),
.groups = "drop"
) |>
head(6)## # A tibble: 6 × 6
## dataset mean_x mean_y sd_x sd_y cor
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.06
## 2 bullseye 54.3 47.8 16.8 26.9 -0.07
## 3 circle 54.3 47.8 16.8 26.9 -0.07
## 4 dino 54.3 47.8 16.8 26.9 -0.06
## 5 dots 54.3 47.8 16.8 26.9 -0.06
## 6 h_lines 54.3 47.8 16.8 26.9 -0.06
Despite the almost identical statistics, the datasets look nothing alike:
ggplot(datasaurus_dozen, aes(x = x, y = y)) +
geom_point(size = 0.5, alpha = 0.6) +
facet_wrap(~dataset, ncol = 3) +
theme_void(base_size = 10) +
theme(strip.text = element_text(size = 8))
Figure 11.16: The datasaurus_dozen: thirteen datasets with identical summary statistics but very different shapes.
One of the thirteen datasets is literally a dinosaur. The message is clear: always plot your data. A single number like the correlation coefficient can make fundamentally different relationships look identical. Visualisation is not an optional extra after the numbers; it is an essential part of understanding data.
11.5 John Snow and the cholera map
In the summer of 1854, a cholera outbreak killed over 500 people in ten days in the Soho district of London. At the time, the prevailing theory was that cholera spread through bad air (the “miasma” theory). The physician John Snow was sceptical.
Snow plotted the deaths on a street map of Soho and marked the locations of the neighbourhood’s public water pumps. The spatial pattern was unmistakable: deaths clustered tightly around a single pump on Broad Street.
library(HistData)
ggplot() +
geom_point(
data = Snow.deaths,
aes(x = x, y = y),
size = 1,
alpha = 0.5
) +
geom_point(
data = Snow.pumps,
aes(x = x, y = y),
shape = 3,
size = 4,
stroke = 1.5,
colour = "firebrick"
) +
coord_equal() +
theme_void() +
labs(
title = "John Snow's cholera map, Soho 1854",
caption = "Dots: cholera deaths | Red crosses: water pumps"
)
Figure 11.17: John Snow’s 1854 cholera data: deaths (dots) and water pump locations (red crosses).
Snow persuaded local authorities to remove the handle from the Broad Street pump, and the outbreak subsided. His map is widely regarded as one of the founding moments of both epidemiology and data visualisation: the spatial pattern in the data made the case that no table of death counts could have made so immediately.
The data used above come from the HistData package. Robin Wilson has collected and converted Snow’s original data into several additional formats (CSV, GeoJSON, shapefile, and others), available at https://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/. These formats are directly readable with the tools from Chapter 9, making it straightforward to overlay the death and pump locations on a modern interactive base map with leaflet.
Today, geospatial tools like sf and leaflet (Chapters 9 and
10) allow us to reproduce and extend this kind of analysis in
minutes. The 2003 SARS outbreak, the 2014–2016 Ebola crisis, and the 2020
COVID-19 pandemic were all tracked with real-time geospatial dashboards that
owe a conceptual debt to Snow’s hand-drawn map. The crucial ingredient is the
same in every case: cases (or deaths, or test results) paired with a location,
in a format that can be plotted.
11.6 Connecting with next year’s modules
This module may turn out to be most useful as a foundation for the Statistics modules you will encounter in Year 3 (those with code MAS392x). The connections below are necessarily speculative — the visualisation techniques that matter most depend on the data and the question — but they give a sense of where the skills from this module are likely to be applied.
- MAS3921 Extreme Value Theory: Some datasets contain observations that are far larger than the bulk of the data: record-breaking floods, stock-market crashes, extreme wind speeds. Identifying such outliers visually is often the first step: a boxplot, density plot, or histogram (Section 3.4) will reveal a heavy upper tail or isolated extreme values. When the range of the data spans several orders of magnitude, a log scale (Section 5.5.3) is usually needed to make the body of the distribution legible alongside the extremes. Once identified, these extreme observations are best handled using the methods of extreme value theory rather than standard distributional assumptions.
- MAS3923 Time Series and MAS3924 Survival Analysis: Both modules deal with
data indexed by time, so Chapter 8 and
geom_line()in general are the most directly relevant tools. Time series analysis studies how a variable evolves and how to forecast it; survival analysis focuses on the time until an event (failure, death, recovery). In both cases, plotting the raw series or survival curves before modelling is standard practice. - MAS3928 Statistical Modelling: Building regression models requires
understanding the relationships between variables. The natural starting point
is a pairwise scatterplot matrix, such as the one produced by
GGally::ggpairs()(introduced in Section 3.6.3), which shows every pair of variables simultaneously. This makes it easy to spot strong correlations, non-linear relationships, or outliers before fitting a model. That said, keep in mind the message from Section 11.3: a strong visual association between two variables is not evidence that one causes the other. - A broader remark is worth making here. Everything covered in this module has
been about using
ggplot2to visualise the data itself. The modules above mostly focus on fitting statistical models to data, and their outputs — fitted values, residuals, survival curves, spectral densities — are a different kind of object from raw observations. It is perfectly possible to visualise modelling results withggplot2, but doing so usually requires a little more scaffolding: extracting quantities from model objects, reshaping them into tidy data frames, and then applying the grammar of graphics in the usual way. The grammar still applies; the challenge is knowing what you want to visualise and how to get it into the right shape. If you find yourself reaching directly for the tools in this module, the most natural moment is during exploratory data analysis — before any model is fitted — where visualisation plays its most direct and indispensable role.
Beyond statistics, R is a capable platform for data science and machine
learning more broadly. MAS3919 Foundations of Machine
Learning
covers this territory directly. The caret package (Classification And
REgression Training) provides a unified interface for training, tuning, and
evaluating a wide range of machine learning models in R, and is a natural next
step for data science work. More generally, the tidyverse is a collection
of R packages designed around the same tidy-data philosophy underpinning
ggplot2; it provides consistent, composable tools for every stage of a data
analysis workflow, from importing and reshaping data to modelling and
reporting. Several packages from this ecosystem have already appeared in this
module: dplyr for data manipulation, tidyr for reshaping, and lubridate
for working with dates and times.
11.7 What makes a good visualisation?
The statistician and information designer Edward Tufte codified many of the principles underlying good data visualisation in his 1983 book The Visual Display of Quantitative Information. His ideas remain the most widely cited framework in the field and form a useful checklist when evaluating your own plots.
Data-ink ratio
Tufte’s central concept is the data-ink ratio: the proportion of the ink (or pixels) on a chart that actually conveys data information, as opposed to decorative or redundant elements. A high data-ink ratio is desirable.
\[\text{Data-ink ratio} = \frac{\text{data ink}}{\text{total ink used to print the graphic}}\]
In ggplot2 terms, theme_minimal() and theme_classic() move towards higher
data-ink ratios by removing background fills and redundant grid lines. As a
rule, removing unnecessary elements is more effective than adding decoration.
Chart junk
Chart junk is Tufte’s term for visual elements that clutter a graphic without contributing to understanding: three-dimensional effects on 2D data, decorative hatching, unnecessary grid lines, and ornamental borders. These elements consume ink, distract the eye, and often distort the data.
The ggplot2 defaults are already fairly clean, but watch for:
- 3D bar or pie charts (never use these for data that is not inherently three-dimensional)
- Gradient fills on bars or backgrounds
- Grid lines so dense they compete with the data
Lie factor
The lie factor is the ratio of the visual effect size to the true data effect size:
\[\text{Lie factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect in data}}\]
A lie factor of 1 is ideal. Values greater than 1 exaggerate differences; values less than 1 understate them. The most common source of inflated lie factors in practice is a truncated \(y\)-axis: starting a bar chart at a non-zero baseline makes a 10% change look like a 300% change, because the bars can now only be compared by their tops, not their areas.
Small multiples
Small multiples (Tufte’s term for what ggplot2 calls facets) display
the same graphical form repeated for different subsets of data. Because the
reader’s eye can compare across panels without re-learning the visual encoding,
small multiples are often more effective than a single busy chart with many
overlapping series. Chapters 5 and 8 used
facet_wrap() extensively for exactly this reason.
Data density
Related to the data-ink ratio is data density: the amount of information per unit of display area. Tufte’s own “sparkline” is an extreme example — a tiny inline time-series the size of a word, embedded directly in text. In practice, the principle encourages:
- choosing compact geoms (e.g., a boxplot rather than a bar chart with error bars) when the data support them;
- avoiding excessive whitespace or padding; and
- using facets to show more comparisons in the same area.
Summary
| Principle | Practical rule |
|---|---|
| Data-ink ratio | Remove anything that does not convey data |
| Chart junk | Avoid 3D effects, decorative fills, dense grids |
| Lie factor | \(\approx 1\): don’t exaggerate or understate; start axes at zero for bar charts |
| Small multiples | Facet rather than overcrowd a single panel |
| Data density | Pack information efficiently without clutter |
Tufte’s principles are prescriptive rather than absolute. There are contexts — public-facing infographics where visual engagement matters — where some decoration is justified. But as a default starting point, these rules consistently produce clearer, more honest, and more informative graphics.
11.8 Visualising a mathematical function: Riemann’s zeta function
Data does not have to come from a statistical dataset. Any deterministic
function can be visualised in ggplot2 by constructing a data frame of inputs
and outputs and treating the columns as aesthetic mappings in the usual way.
This section illustrates the idea with the Riemann zeta function —
which is associated with the Riemann Hypothesis, arguably the most famous and difficult unsolved problem in mathematics — as a motivating
example.
The Riemann zeta function \(\zeta(s)\) is a complex-valued function of a complex variable \(s\). The Riemann hypothesis conjectures that every non-trivial zero of \(\zeta\) lies on the critical line \(\text{Re}(s) = \frac{1}{2}\). On that line, \(s = \frac{1}{2} + it\) for real \(t\), so \(\zeta\) reduces to a complex-valued function of a single real parameter \(t\), whose real and imaginary parts can each be plotted against \(t\).
The pracma package provides a numerical implementation of \(\zeta\). We
construct the data frame in one pipeline, using complex arithmetic directly
in R:
library(pracma)
df_riemann <- data.frame(t = seq(-30, 30, by = 0.01)) |>
mutate(
z = zeta(0.5 + t * 1i),
Re = Re(z),
Im = Im(z)
)Plotting Re and Im against t gives the critical line plot. Each
crossing of zero in either component is a candidate zero of \(\zeta\):
df_riemann |>
ggplot() +
geom_line(aes(x = t, y = Re), colour = "red") +
geom_line(aes(x = t, y = Im), colour = "blue") +
labs(x = expression(t),
y = expression(zeta(over(1, 2) + it)))
Figure 11.18: Real (red) and imaginary (blue) parts of \(\zeta(1/2 + it)\) for \(t \in [-30, 30]\).
The two curves oscillate and cross zero repeatedly. A non-trivial zero of \(\zeta\) occurs precisely where both curves cross zero simultaneously: the first such pair is near \(t \approx 14.1\).
The same data can be rendered as a parametric curve by plotting Im
against Re, tracing the path of \(\zeta(\frac{1}{2} + it)\) in the complex
plane as \(t\) varies. This is the polar-style graph of the critical line:
df_riemann |>
ggplot() +
geom_path(aes(x = Re, y = Im), colour = "blue") +
labs(x = expression(Re(zeta(over(1, 2) + it))),
y = expression(Im(zeta(over(1, 2) + it)))) +
coord_equal()
Figure 11.19: Parametric plot of \(\zeta(1/2 + it)\) in the complex plane: the curve passes through the origin at each non-trivial zero.
Every passage of the curve through the origin \((0, 0)\) corresponds to a non-trivial zero of \(\zeta\) on the critical line, and hence is consistent with the Riemann hypothesis. The Riemann hypothesis has been verified computationally for the first \(10^{13}\) zeros, but remains unproved in general.
The key observation from a visualisation standpoint is that nothing here was
special about ggplot2: the workflow was identical to any other plot. Define
the input grid, compute the function values, store both in a data frame, then
apply the appropriate geom. The same approach works for Fourier series,
differential equations solved numerically, or any other deterministic
mathematical object.
11.9 Summary
This chapter has extended the ggplot2 toolkit and placed it in a broader
context:
Additional geoms:
geom_hline(),geom_vline(), andgeom_abline()add reference lines defined by coordinates rather than data;geom_violin()reveals distributional shape;geom_text()andgeom_label()annotate individual points;geom_raster()encodes a third variable on a grid via fill colour;geom_density_2d()shows where bivariate data cluster via contour lines; andgeom_smooth(method = "glm")overlays a logistic regression sigmoid on binary outcome data.Case studies:
- Simpson’s paradox shows that aggregated trends can reverse when a confounding variable is revealed by colour or faceting.
- Spurious correlations demonstrate that visual fit is no substitute for causal reasoning.
- The
datasaurus_dozenand Anscombe’s quartet prove that identical summary statistics can hide radically different structures — always plot your data. - John Snow’s cholera map illustrates that the right visualisation can settle a scientific argument more decisively than any numerical summary.
- Tufte’s principles — data-ink ratio, chart junk, lie factor, small multiples, data density — provide a practical framework for producing honest, effective graphics.
Mathematical visualisation:
ggplot2is not confined to statistical data. Any function can be visualised by constructing a data frame of inputs and outputs and treating the result like any other tidy dataset.
There is, of course, far more to data visualisation than this module has
covered. Animation (gganimate), network graphs (ggraph), interactive
web graphics beyond leaflet (plotly, Observable), high-dimensional
projection methods, perception research, and accessibility considerations
(colour-blindness, screen readers) are all active areas with extensive
literatures of their own. What this module has tried to establish is a firm
foundation — the grammar of graphics, tidy data, and a consistent set of
principles — from which any of those directions can be pursued.