1 Introduction

This practical focuses on plot cosmetics, statistical transformations, incremental workflow, and logarithmic scales in ggplot2. You will learn how to:

Note: The notes chapter on cosmetics covers many topics, but not all are equally important. This practical focuses on the higher-priority items. See the summary table in Chapter 6 of the notes for a full priority guide.

2 Incremental workflow

A powerful feature of ggplot2 is that you can build plots incrementally. This is covered in Chapter 6.2 of the notes, but we summarise the key ideas here.

2.1 Saving plots as objects

You can save a ggplot as an object and add layers or scales later:

# Create the base plot and save it
p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point(size = 2) +
  labs(x = "Engine Displacement (L)", y = "Highway MPG", colour = "Class")

# Display with default colours
p

# Add a different colour scale
p + scale_colour_brewer(palette = "Set1")

# Or try viridis
p + scale_colour_viridis_d()

This approach makes it easy to experiment with different palettes while keeping your base plot consistent.

2.2 Using last_plot() for interactive exploration

When working interactively in RStudio (e.g., in the console or by running individual chunks), the last_plot() function returns the most recently created ggplot. This allows you to iteratively refine a plot without rewriting the entire code.

2.2.1 How last_plot() works

# Create a basic plot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

# Add to the last plot without rewriting everything
last_plot() + labs(title = "Engine Size vs Fuel Economy")

# Continue refining
last_plot() + theme_minimal()

# Add more layers
last_plot() + geom_smooth(method = "lm")

2.2.2 When to use last_plot()

last_plot() is most useful for interactive exploration in RStudio:

  • When you’re exploring a new dataset and want to quickly try different options
  • When you’re refining aesthetics (colours, themes, labels) and want to see changes immediately
  • When you’re not sure what the final plot should look like and are experimenting

Important limitations:

  • last_plot() only works interactively — it won’t work reliably in R Markdown documents because the “last plot” depends on execution order
  • For reproducible reports, always use the approach of saving plots as objects
  • The “last plot” is reset each time you create a new ggplot

2.3 Exercises: Incremental workflow

  1. Create a base scatterplot of hwy vs cty from the mpg dataset. Save it as an object p. Then add:

    1. A title using labs()
    2. A theme_bw() theme
    3. A colour scale for drv (you’ll need to add colour = drv to the original aes())
    # Base plot
    p <- ggplot(mpg, aes(x = cty, y = hwy)) +
      geom_point() +
      labs(x = "City MPG", y = "Highway MPG")
    p

    # Add title
    p + labs(title = "City vs Highway Fuel Economy")

    # Add theme
    p + labs(title = "City vs Highway Fuel Economy") + theme_bw()

    # With colour (need to modify base)
    p2 <- ggplot(mpg, aes(x = cty, y = hwy, colour = drv)) +
      geom_point() +
      labs(x = "City MPG", y = "Highway MPG", colour = "Drive")
    p2 + scale_colour_viridis_d()

  2. In the RStudio console (not in a chunk), create a basic scatterplot of hwy vs displ from mpg. Then use last_plot() to:

    • Add a title
    • Change the theme to theme_classic()
    • Add a smooth line with geom_smooth()

    This exercise should be done interactively in the console. The sequence would be:

    ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
    last_plot() + labs(title = 'Engine Size vs Fuel Economy')
    last_plot() + theme_classic()
    last_plot() + geom_smooth(method = 'lm')
  3. Explain why last_plot() would not be appropriate for a final R Markdown report, and what approach you should use instead.

    last_plot() depends on execution order and the state of the R session, making it unreliable for reproducible documents. In R Markdown, chunks may be executed in different orders during development, and the ‘last plot’ could be different each time. For reproducible reports, save plots as objects (e.g., p <- ggplot(...) + ...) and add layers explicitly (e.g., p + theme_minimal()).

3 Plot cosmetics

3.1 Fixing legend titles

When you map an aesthetic to a transformed variable like factor(cyl), the legend title inherits the expression literally. Use labs() to fix this.

# Ugly legend title
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
  geom_point(size = 3)

# Fixed with labs()
ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(cyl))) +
  geom_point(size = 3) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon",
       colour = "Cylinders")

3.2 Built-in themes and font size

All built-in themes accept a base_size argument to control font size:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")

# Larger text for readability
p + theme_minimal(base_size = 14)

3.3 Zooming with coord_cartesian()

Use coord_cartesian() to zoom into a plot without removing data. This is important when you have fitted lines or smoothers that should be computed from all data:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_cartesian(xlim = c(2, 5), ylim = c(20, 40)) +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")
## `geom_smooth()` using formula = 'y ~ x'

Compare this to using scale_x_continuous(limits = c(2, 5)), which would remove points outside the range and change the fitted line.

3.4 Exercises

  1. The following plot has an ugly legend title (“factor(cyl)”). Fix it using labs():

    ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
      geom_point(size = 3)

    ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
      geom_point(size = 3) +
      labs(x = "Engine Displacement (L)", y = "Highway MPG",
           colour = "Cylinders")

  2. Using the mpg dataset, create a scatterplot of hwy vs displ and apply three different built-in themes, each with base_size = 14:

    1. theme_bw(base_size = 14)
    2. theme_minimal(base_size = 14)
    3. theme_classic(base_size = 14)
    p <- ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      labs(x = "Engine Displacement (L)", y = "Highway MPG")
    
    p + theme_bw(base_size = 14)

    p + theme_minimal(base_size = 14)

    p + theme_classic(base_size = 14)

  3. Create a scatterplot of hwy vs displ from mpg, coloured by class. Experiment with:

    1. Moving the legend to the bottom using theme(legend.position = "bottom")
    2. Removing the legend entirely using theme(legend.position = "none")
    p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
      geom_point() +
      labs(x = "Displacement (L)", y = "Highway MPG", colour = "Class")
    
    # Legend at bottom
    p + theme(legend.position = "bottom")

    # No legend
    p + theme(legend.position = "none")

  4. The economics dataset shows US unemployment (unemploy) over time. Create a line plot and use coord_cartesian() to zoom into the period from 2005 to 2015 on the \(x\)-axis and 5000 to 15000 on the \(y\)-axis.

    Hint: Use as.Date() to create date limits, e.g., xlim = as.Date(c("2005-01-01", "2015-01-01")).

    ggplot(economics, aes(x = date, y = unemploy)) +
      geom_line() +
      coord_cartesian(
        xlim = as.Date(c("2005-01-01", "2015-01-01")),
        ylim = c(5000, 15000)
      ) +
      labs(x = "Date", y = "Unemployment (thousands)")

4 Logarithmic scales

Logarithmic scales are useful when data spans several orders of magnitude or follows a multiplicative relationship. This section explores when and how to use scale_x_log10() and scale_y_log10().

4.1 Power law distributions

Many real-world phenomena follow a power law distribution, where the probability of observing a value \(x\) is proportional to \(x^{-\alpha}\) for some exponent \(\alpha > 1\): \[ p(x) \propto x^{-\alpha} \]

Taking logarithms of both sides: \[ \log p(x) = -\alpha \log x + \text{constant} \]

This means that on a log-log plot (both axes on log scale), power law data should appear as a straight line with slope \(-\alpha\).

4.2 Exercises

  1. The poweRlaw package contains the dataset moby, which records the frequency of unique words in Herman Melville’s novel Moby Dick. Load and examine the data:

    data(moby, package = "poweRlaw")
    head(moby, 20)
    ##  [1] 14086  6414  6260  4573  4484  4040  2917  2483  2374  1942  1792  1744
    ## [13]  1711  1683  1674  1604  1581  1493  1372  1297
    length(moby)
    ## [1] 18855

    This is a vector of word frequencies. Create a counted data frame showing how many words appear exactly 1 time, 2 times, 3 times, etc.

    moby_counts <- data.frame(frequency = moby) |>
      count(frequency, name = "n_words")
    head(moby_counts, 10)
    ##    frequency n_words
    ## 1          1    9161
    ## 2          2    3085
    ## 3          3    1629
    ## 4          4     926
    ## 5          5     627
    ## 6          6     469
    ## 7          7     361
    ## 8          8     300
    ## 9          9     232
    ## 10        10     179
  2. Create a scatterplot of n_words (number of words with that frequency) vs frequency on the original (linear) scale. What do you observe?

    ggplot(moby_counts, aes(x = frequency, y = n_words)) +
      geom_point() +
      labs(x = "Word Frequency", y = "Number of Words")

    The plot shows extreme skewness: most words appear only a few times, while a few words appear very frequently. The points are heavily compressed against the axes, making it difficult to see any pattern.

  3. Apply scale_y_log10() only. Does this straighten the relationship?

    ggplot(moby_counts, aes(x = frequency, y = n_words)) +
      geom_point() +
      scale_y_log10() +
      labs(x = "Word Frequency", y = "Number of Words (log scale)")

    The log \(y\)-axis helps spread out the points vertically, but the relationship is still curved. This suggests we also need a log scale on the \(x\)-axis.

  4. Now apply both scale_x_log10() and scale_y_log10(). What shape do you see? Why does this happen for power law data?

    ggplot(moby_counts, aes(x = frequency, y = n_words)) +
      geom_point() +
      scale_x_log10() +
      scale_y_log10() +
      labs(x = "Word Frequency (log scale)", y = "Number of Words (log scale)")

    On the log-log scale, the points fall approximately on a straight line with negative slope. This is the signature of a power law distribution: \(\log(\text{count}) = -\alpha \log(\text{frequency}) + \text{constant}\), which is linear in log-log space.

  5. Add a linear regression line using geom_smooth(method = "lm") to your log-log plot. The slope of this line estimates \(-\alpha\).

    ggplot(moby_counts, aes(x = frequency, y = n_words)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      scale_x_log10() +
      scale_y_log10() +
      labs(x = "Word Frequency (log scale)", y = "Number of Words (log scale)")
    ## `geom_smooth()` using formula = 'y ~ x'

5 Summary

Incremental workflow:

Cosmetics (high/medium priority):

Logarithmic scales: