This practical focuses on plot cosmetics, statistical transformations,
incremental workflow, and logarithmic scales in ggplot2. You will learn how
to:
last_plot() for interactive explorationlabs()coord_cartesian()Note: The notes chapter on cosmetics covers many topics, but not all are equally important. This practical focuses on the higher-priority items. See the summary table in Chapter 6 of the notes for a full priority guide.
A powerful feature of ggplot2 is that you can build plots incrementally. This
is covered in Chapter 6.2 of the notes, but we
summarise the key ideas here.
You can save a ggplot as an object and add layers or scales later:
# Create the base plot and save it
p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point(size = 2) +
labs(x = "Engine Displacement (L)", y = "Highway MPG", colour = "Class")
# Display with default colours
p
# Add a different colour scale
p + scale_colour_brewer(palette = "Set1")
# Or try viridis
p + scale_colour_viridis_d()
This approach makes it easy to experiment with different palettes while keeping your base plot consistent.
last_plot() for interactive explorationWhen working interactively in RStudio (e.g., in the console or by running
individual chunks), the last_plot() function returns the most recently
created ggplot. This allows you to iteratively refine a plot without
rewriting the entire code.
last_plot() works# Create a basic plot
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
# Add to the last plot without rewriting everything
last_plot() + labs(title = "Engine Size vs Fuel Economy")
# Continue refining
last_plot() + theme_minimal()
# Add more layers
last_plot() + geom_smooth(method = "lm")
last_plot()last_plot() is most useful for interactive exploration in RStudio:
Important limitations:
last_plot() only works interactively — it won’t work reliably in R
Markdown documents because the “last plot” depends on execution orderCreate a base scatterplot of hwy vs cty from the mpg dataset. Save
it as an object p. Then add:
labs()theme_bw() themedrv (you’ll need to add colour = drv to the
original aes())# Base plot
p <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(x = "City MPG", y = "Highway MPG")
p
# Add title
p + labs(title = "City vs Highway Fuel Economy")
# Add theme
p + labs(title = "City vs Highway Fuel Economy") + theme_bw()
# With colour (need to modify base)
p2 <- ggplot(mpg, aes(x = cty, y = hwy, colour = drv)) +
geom_point() +
labs(x = "City MPG", y = "Highway MPG", colour = "Drive")
p2 + scale_colour_viridis_d()
In the RStudio console (not in a chunk), create a basic scatterplot
of hwy vs displ from mpg. Then use last_plot() to:
theme_classic()geom_smooth()This exercise should be done interactively in the console. The sequence would be:
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
last_plot() + labs(title = 'Engine Size vs Fuel Economy')
last_plot() + theme_classic()
last_plot() + geom_smooth(method = 'lm')Explain why last_plot() would not be appropriate for a final R Markdown
report, and what approach you should use instead.
last_plot() depends on execution order and the state of the R session, making it unreliable for reproducible documents. In R Markdown, chunks may be executed in different orders during development, and the ‘last plot’ could be different each time. For reproducible reports, save plots as objects (e.g., p <- ggplot(...) + ...) and add layers explicitly (e.g., p + theme_minimal()).
When you map an aesthetic to a transformed variable like factor(cyl), the
legend title inherits the expression literally. Use labs() to fix this.
# Ugly legend title
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
geom_point(size = 3)
# Fixed with labs()
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
colour = "Cylinders")
All built-in themes accept a base_size argument to control font size:
p <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(x = "Engine Displacement (L)", y = "Highway MPG")
# Larger text for readability
p + theme_minimal(base_size = 14)
coord_cartesian()Use coord_cartesian() to zoom into a plot without removing data. This is
important when you have fitted lines or smoothers that should be computed
from all data:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "lm") +
coord_cartesian(xlim = c(2, 5), ylim = c(20, 40)) +
labs(x = "Engine Displacement (L)", y = "Highway MPG")
## `geom_smooth()` using formula = 'y ~ x'
Compare this to using scale_x_continuous(limits = c(2, 5)), which would
remove points outside the range and change the fitted line.
The following plot has an ugly legend title (“factor(cyl)”). Fix it using
labs():
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
geom_point(size = 3)
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
geom_point(size = 3) +
labs(x = "Engine Displacement (L)", y = "Highway MPG",
colour = "Cylinders")
Using the mpg dataset, create a scatterplot of hwy vs displ and apply
three different built-in themes, each with base_size = 14:
theme_bw(base_size = 14)theme_minimal(base_size = 14)theme_classic(base_size = 14)p <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(x = "Engine Displacement (L)", y = "Highway MPG")
p + theme_bw(base_size = 14)
p + theme_minimal(base_size = 14)
p + theme_classic(base_size = 14)
Create a scatterplot of hwy vs displ from mpg, coloured by class.
Experiment with:
theme(legend.position = "bottom")theme(legend.position = "none")p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
labs(x = "Displacement (L)", y = "Highway MPG", colour = "Class")
# Legend at bottom
p + theme(legend.position = "bottom")
# No legend
p + theme(legend.position = "none")
The economics dataset shows US unemployment (unemploy) over time.
Create a line plot and use coord_cartesian() to zoom into the period
from 2005 to 2015 on the \(x\)-axis and 5000 to 15000 on the \(y\)-axis.
Hint: Use as.Date() to create date limits, e.g.,
xlim = as.Date(c("2005-01-01", "2015-01-01")).
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line() +
coord_cartesian(
xlim = as.Date(c("2005-01-01", "2015-01-01")),
ylim = c(5000, 15000)
) +
labs(x = "Date", y = "Unemployment (thousands)")
Logarithmic scales are useful when data spans several orders of magnitude or
follows a multiplicative relationship. This section explores when and how to
use scale_x_log10() and scale_y_log10().
Many real-world phenomena follow a power law distribution, where the probability of observing a value \(x\) is proportional to \(x^{-\alpha}\) for some exponent \(\alpha > 1\): \[ p(x) \propto x^{-\alpha} \]
Taking logarithms of both sides: \[ \log p(x) = -\alpha \log x + \text{constant} \]
This means that on a log-log plot (both axes on log scale), power law data should appear as a straight line with slope \(-\alpha\).
The poweRlaw package contains the dataset moby, which records the
frequency of unique words in Herman Melville’s novel Moby Dick. Load and
examine the data:
data(moby, package = "poweRlaw")
head(moby, 20)
## [1] 14086 6414 6260 4573 4484 4040 2917 2483 2374 1942 1792 1744
## [13] 1711 1683 1674 1604 1581 1493 1372 1297
length(moby)
## [1] 18855
This is a vector of word frequencies. Create a counted data frame showing how many words appear exactly 1 time, 2 times, 3 times, etc.
moby_counts <- data.frame(frequency = moby) |>
count(frequency, name = "n_words")
head(moby_counts, 10)
## frequency n_words
## 1 1 9161
## 2 2 3085
## 3 3 1629
## 4 4 926
## 5 5 627
## 6 6 469
## 7 7 361
## 8 8 300
## 9 9 232
## 10 10 179Create a scatterplot of n_words (number of words with that frequency)
vs frequency on the original (linear) scale. What do you observe?
ggplot(moby_counts, aes(x = frequency, y = n_words)) +
geom_point() +
labs(x = "Word Frequency", y = "Number of Words")
The plot shows extreme skewness: most words appear only a few times, while a few words appear very frequently. The points are heavily compressed against the axes, making it difficult to see any pattern.
Apply scale_y_log10() only. Does this straighten the relationship?
ggplot(moby_counts, aes(x = frequency, y = n_words)) +
geom_point() +
scale_y_log10() +
labs(x = "Word Frequency", y = "Number of Words (log scale)")
The log \(y\)-axis helps spread out the points vertically, but the relationship is still curved. This suggests we also need a log scale on the \(x\)-axis.
Now apply both scale_x_log10() and scale_y_log10(). What shape do
you see? Why does this happen for power law data?
ggplot(moby_counts, aes(x = frequency, y = n_words)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Word Frequency (log scale)", y = "Number of Words (log scale)")
On the log-log scale, the points fall approximately on a straight line with negative slope. This is the signature of a power law distribution: \(\log(\text{count}) = -\alpha \log(\text{frequency}) + \text{constant}\), which is linear in log-log space.
Add a linear regression line using geom_smooth(method = "lm") to your
log-log plot. The slope of this line estimates \(-\alpha\).
ggplot(moby_counts, aes(x = frequency, y = n_words)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Word Frequency (log scale)", y = "Number of Words (log scale)")
## `geom_smooth()` using formula = 'y ~ x'
Incremental workflow:
p <- ggplot(...) + ...p + theme_bw(), p + scale_colour_viridis_d()last_plot() for interactive exploration in the consolelast_plot()Cosmetics (high/medium priority):
labs(): labs(colour = "Nice Title")base_size: theme_minimal(base_size = 14)theme(legend.position = "bottom") or "none"coord_cartesian(xlim = ..., ylim = ...)Logarithmic scales:
scale_x_log10() and scale_y_log10() for data spanning orders of
magnitude