2 Graphical Presentation of Data with Base R

2.1 Introduction

Graphical displays of data can be very useful in showing the main features of a data set. The appropriate form of graph depends on the nature of the variables being displayed and what aspects are to be shown. However it should be kept in mind that the object is to provide a clear and truthful representation of the data, not to distort and not to impress with unnecessary “fancy” features.

2.1.1 Data frames

In R, we almost exclusively work with data frames for data visualisation. A data frame is a table where each column represents a variable and each row represents an observation.

Important: Data frames are not the same as matrices. While both are rectangular structures, they differ in key ways:

  • A matrix contains elements of a single type (e.g., all numeric)
  • A data frame can have different types in each column (e.g., numeric, character, factor)
  • Data frames have column names that we use to access variables

While R has some special object types for specific data (such as ts for time series), the unified framework introduced in Chapter 3 and onwards always applies to data frames. This means you do not need to convert the class of your data for the sake of visualisation — if your data are in a data frame, you are ready to plot.

If you are unfamiliar with data frames or need a refresher on basic operations like creating, subsetting, and manipulating data frames, please refer to Practical 1.

Base R provides a rich set of plotting functions that can be used without loading any additional packages. These functions are fast, flexible, and form the foundation upon which many other plotting systems are built.

2.1.2 Datasets used in this chapter

We primarily use the mpg dataset from the ggplot2 package, which contains fuel economy data for 234 vehicles from 1999 to 2008:

library(ggplot2)
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans     drv     cty
##   <chr>        <chr> <dbl> <int> <int> <chr>     <chr> <int>
## 1 audi         a4      1.8  1999     4 auto(l5)  f        18
## 2 audi         a4      1.8  1999     4 manual(m… f        21
## 3 audi         a4      2    2008     4 manual(m… f        20
## 4 audi         a4      2    2008     4 auto(av)  f        21
## 5 audi         a4      2.8  1999     6 auto(l5)  f        16
## 6 audi         a4      2.8  1999     6 manual(m… f        18
## # ℹ 3 more variables: hwy <int>, fl <chr>, class <chr>

The variables are:

  • manufacturer: Car manufacturer
  • model: Model name
  • displ: Engine displacement in litres
  • year: Year of manufacture (1999 or 2008)
  • cyl: Number of cylinders
  • trans: Type of transmission
  • drv: Drive type (f = front-wheel, r = rear-wheel, 4 = 4-wheel)
  • cty: City miles per gallon
  • hwy: Highway miles per gallon
  • fl: Fuel type (e = E85 ethanol, d = diesel, r = regular, p = premium, c = CNG)
  • class: Type of vehicle (e.g., compact, SUV)

For time series data, we use the economics dataset:

head(economics)
## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

This contains US economic time series data from 1967 to 2015. The variables are:

  • date: Month of data collection
  • pce: Personal consumption expenditures (billions of dollars)
  • pop: Total population (thousands)
  • psavert: Personal savings rate (percent)
  • uempmed: Median duration of unemployment (weeks)
  • unemploy: Number of unemployed (thousands)

2.2 One variable, continuous

When we have a single continuous variable, we want to understand its distribution: where are the values concentrated, how spread out are they, and is the distribution symmetric or skewed?

2.2.1 Histograms

Histograms represent the distribution of a sample of values of a continuous variable. The range of values is divided into intervals (bins or classes), and the frequencies in each class are represented by columns. As the variable is continuous, there are no gaps between neighbouring columns.

The \(y\)-axis can show either the count in each class (absolute frequency) or the density (relative frequency). Bin widths should be chosen to show the distribution without being swamped by random variation.

par(mfrow = c(1, 2))
hist(
  mpg$displ,
  col = "grey",
  main = "Absolute Frequency",
  freq = TRUE,
  xlab = "Engine Displacement (L)",
  ylab = "Counts"
)
hist(
  mpg$displ,
  col = "grey",
  main = "Relative Frequency",
  freq = FALSE,
  xlab = "Engine Displacement (L)",
  ylab = "Density"
)
par(mfrow = c(1, 1))
Histograms of engine displacement: absolute frequency (left) and relative frequency (right).

Figure 2.1: Histograms of engine displacement: absolute frequency (left) and relative frequency (right).

The ggplot2 equivalent uses geom_histogram() — see Section 3.4.1.

Exercise: Create a histogram of cty (city miles per gallon) from the mpg dataset. Use col = "lightblue" and add appropriate axis labels.

2.2.2 Density plots

Density plots provide a smoothed version of a histogram, estimating the probability density function of the underlying variable. They are created using the density() function combined with plot():

plot(
  density(mpg$displ),
  main = "Density Plot of Engine Displacement",
  xlab = "Engine Displacement (L)",
  col = "darkblue",
  lwd = 2
)
rug(mpg$displ, col = "red")
Density plot of engine displacement.

Figure 2.2: Density plot of engine displacement.

The rug() function adds tick marks showing individual data values along the axis.

The ggplot2 equivalent uses geom_density() — see Section 3.4.2.

Exercise: Create a density plot of cty from the mpg dataset. Add a rug plot and use a different colour for the density line.

2.2.3 Stem-and-leaf plots

Stem-and-leaf plots are similar to histograms but show the actual data values:

stem(mpg$displ)
## 
##   The decimal point is at the |
## 
##   1 | 6666688888888888888999
##   2 | 0000000000000000000002222224444444444444
##   2 | 55555555555555555555777777778888888888
##   3 | 000000001111113333333334444
##   3 | 555556677788888888999
##   4 | 00000000000000022224
##   4 | 6666666666677777777777777777
##   5 | 002222233333344444444
##   5 | 67777777799
##   6 | 0122
##   6 | 5
##   7 | 0

Values on the left (the “stems”) give the first digit(s), and those on the right (the “leaves”) the subsequent digit for each observation. This type of plot is useful for small datasets but becomes unwieldy for larger ones.

Note: There is no direct equivalent for stem-and-leaf plots in the unified framework in Chapter 3.

2.2.4 Boxplots

Boxplots (also known as box-and-whisker plots) provide a visual summary of the distribution. The central bar is the median, the box spans the interquartile range (IQR), and the whiskers extend to the most extreme points within 1.5 times the IQR. Points beyond the whiskers are shown as individual outliers.

boxplot(
  mpg$displ,
  main = "Engine Displacement",
  ylab = "Engine Displacement (L)"
)
Boxplot of engine displacement.

Figure 2.3: Boxplot of engine displacement.

While boxplots are most commonly used to compare distributions across groups (see Section 2.5.3), they are also useful for a single variable to quickly identify the median, spread, and any outliers.

The ggplot2 equivalent uses geom_boxplot() — see Section 3.4.3.

2.3 One variable, discrete

For discrete or categorical variables, we want to see the frequency of each category.

2.3.1 Bar charts

Bar charts display the frequency with which each distinct value of a variable occurred. The widths of the bars should be equal, and a small gap is drawn between each bar to indicate separate categories.

barplot(
  table(mpg$drv),
  xlab = "Drive Type",
  ylab = "Frequency",
  col = "steelblue",
  names.arg = c("4-wheel", "Front", "Rear")
)
Bar chart of drive type in the mpg dataset.

Figure 2.4: Bar chart of drive type in the mpg dataset.

We will see grouped bar charts in the section on two discrete variables below.

The ggplot2 equivalent uses geom_bar() — see Section 3.5.1.

Exercise: Create a bar chart of class (vehicle type) from the mpg dataset. Add appropriate axis labels.

2.4 Two variables, both continuous

When both variables are continuous, we want to understand their relationship: are they correlated, is the relationship linear, are there clusters or outliers?

2.4.1 Scatterplots

Scatterplots display the relationship between two continuous variables. Each observation is represented as a point in two-dimensional space.

plot(
  mpg$displ,
  mpg$hwy,
  xlab = "Engine Displacement (L)",
  ylab = "Highway MPG",
  main = "Fuel Efficiency vs Engine Size",
  pch = 19
)
Scatterplot of engine displacement vs highway MPG.

Figure 2.5: Scatterplot of engine displacement vs highway MPG.

We can add a regression line to show the trend:

plot(
  mpg$displ,
  mpg$hwy,
  xlab = "Engine Displacement (L)",
  ylab = "Highway MPG",
  main = "Fuel Efficiency vs Engine Size",
  pch = 19
)
abline(lm(hwy ~ displ, data = mpg), col = "red", lwd = 2)
Scatterplot with regression line.

Figure 2.6: Scatterplot with regression line.

The ggplot2 equivalent uses geom_point() — see Section 3.6.1.

Exercise: Create a scatterplot of cty (\(y\)-axis) versus hwy (\(x\)-axis) from the mpg dataset. Add a regression line. What relationship do you observe?

2.4.2 Line plots

Line plots connect observations in sequence and are used for time series or ordered data. For time series data like economics, a line plot shows the trend over time:

plot(
  economics$date,
  economics$unemploy,
  type = "l",
  xlab = "Date",
  ylab = "Unemployment (thousands)",
  main = "US Unemployment Over Time",
  col = "darkblue"
)
Unemployment over time in the US.

Figure 2.7: Unemployment over time in the US.

Note that a scatterplot (type = "p" or default) would not be appropriate here — the sequential nature of time series data is best shown with connected lines:

plot(
  economics$date,
  economics$unemploy,
  xlab = "Date",
  ylab = "Unemployment (thousands)",
  main = "Same Data as Scatterplot",
  pch = 19,
  cex = 0.5
)
Scatterplot of time series data --- not as informative as a line plot.

Figure 2.8: Scatterplot of time series data — not as informative as a line plot.

The ggplot2 equivalent uses geom_line() — see Section 3.6.2.

Exercise: Create a line plot of psavert (personal savings rate) over time from the economics dataset. Add appropriate axis labels.

2.4.3 Pairs plots (scatterplot matrix)

Coming back to scatterplot, what if we want to look at pairs of variables simultaneously? When exploring relationships among multiple continuous variables, a pairs plot (scatterplot matrix) shows all pairwise scatterplots at once:

pairs(
  mpg[, c("displ", "cty", "hwy")],
  main = "Scatterplot Matrix",
  pch = 19
)
Pairs plot of selected mpg variables.

Figure 2.9: Pairs plot of selected mpg variables.

Alternatively, if you have only numeric columns, you can use plot() directly on a data frame, which calls pairs() automatically.

The ggplot2 equivalent uses ggpairs() from the GGally package — see Section 3.6.3.

2.5 Two variables, one continuous and one discrete

When we have one continuous variable and one categorical variable, we typically want to compare distributions across groups.

2.5.1 The problem with scatterplots

A naive approach might be to use a scatterplot, but this doesn’t work well because the discrete variable creates vertical strips of points:

plot(
  as.numeric(factor(mpg$drv)),
  mpg$displ,
  xlab = "Drive Type",
  ylab = "Engine Displacement (L)",
  main = "Engine Displacement by Drive Type",
  pch = 19,
  xaxt = "n"
)
axis(1, at = 1:3, labels = c("4-wheel", "Front", "Rear"))
Scatterplot for continuous vs discrete --- not ideal due to overplotting.

Figure 2.10: Scatterplot for continuous vs discrete — not ideal due to overplotting.

Many points overlap, making it hard to see the distribution within each group.

2.5.2 Stripcharts

A stripchart (also called a one-dimensional scatterplot) addresses the overplotting problem by adding jitter — small random horizontal offsets:

stripchart(
  displ ~ drv,
  data = mpg,
  method = "jitter",
  pch = 19,
  vertical = TRUE,
  xlab = "Drive Type",
  ylab = "Engine Displacement (L)",
  group.names = c("4-wheel", "Front", "Rear")
)
Stripchart of engine displacement by drive type.

Figure 2.11: Stripchart of engine displacement by drive type.

This shows all individual data points, which is useful for smaller datasets.

The ggplot2 equivalent uses geom_jitter() — see Section 3.7.2.

Exercise: Create a stripchart of cty by class from the mpg dataset. Use jittering and add appropriate axis labels.

2.5.3 Boxplots

For larger datasets or when summary statistics are more important than individual points, boxplots are the standard choice. The central bar is the median, the box spans the interquartile range (IQR), and the whiskers extend to the most extreme points within 1.5 times the IQR. Points beyond the whiskers are shown as individual outliers.

boxplot(
  displ ~ drv,
  data = mpg,
  xlab = "Drive Type",
  ylab = "Engine Displacement (L)",
  names = c("4-wheel", "Front", "Rear")
)
Boxplot of engine displacement by drive type.

Figure 2.12: Boxplot of engine displacement by drive type.

The ggplot2 equivalent uses geom_boxplot() — see Section 3.7.3.

Exercise: Create a boxplot of cty by class from the mpg dataset. Which vehicle class has the best city fuel efficiency? Which has the most variability?

2.6 Two variables, both discrete

When both variables are categorical, we are looking at the joint frequency distribution—how often each combination of categories occurs.

2.6.1 Grouped bar charts

Grouped bar charts show frequencies for combinations of two categorical variables. Here we show vehicle class by drive type:

# Create contingency table
class_drv <- table(mpg$class, mpg$drv)

barplot(
  class_drv,
  beside = TRUE,
  xlab = "Drive Type",
  ylab = "Frequency",
  col = rainbow(7),
  legend.text = rownames(class_drv),
  args.legend = list(x = "topright", cex = 0.7),
  names.arg = c("4-wheel", "Front", "Rear")
)
Grouped bar chart showing drive type by vehicle class.

Figure 2.13: Grouped bar chart showing drive type by vehicle class.

The ggplot2 equivalent uses geom_bar() with the fill aesthetic — see Section 3.8.1. Colour customisation is covered in Chapter 4.

2.6.2 Mosaic plots

Mosaic plots show the relative frequencies of combinations. The area of each rectangle is proportional to the frequency of that combination:

mosaicplot(
  table(mpg$drv, mpg$cyl),
  main = "Drive Type by Cylinders",
  xlab = "Drive Type",
  ylab = "Cylinders",
  color = TRUE
)
Mosaic plot of drive type by number of cylinders.

Figure 2.14: Mosaic plot of drive type by number of cylinders.

2.6.3 When a table suffices

For two discrete variables, a simple contingency table often communicates the information more clearly than a plot:

table(mpg$drv, mpg$cyl)
##    
##      4  5  6  8
##   4 23  0 32 48
##   f 58  4 43  1
##   r  0  0  4 21

Don’t create visualisations for the sake of visualisations — think about whether a plot actually adds value over a numerical summary.

2.7 Summary

2.7.1 Choosing the right plot

The appropriate plot type depends on the nature of your variables:

Variables Types Base R function(s)
One continuous Distribution hist(), plot(density())
One continuous Summary boxplot()
One continuous Values stem()
One discrete Frequency barplot(table())
Two continuous Relationship plot()
Two continuous Time series plot(type = "l")
Multiple pairs of continuous Pairwise pairs()
One cont. + one discrete Comparison stripchart(), boxplot()
Two discrete Joint frequency barplot(table()), mosaicplot()
Two continuous \(+\) straight line plot() + abline(lm())

2.7.2 What to look for

Using these plots, we can understand:

  • Distribution shape: Is it symmetric or skewed? Unimodal or multimodal?
  • Centre and spread: Where are the values concentrated? How variable?
  • Outliers: Are there unusual observations?
  • Relationships: Are two variables correlated? Linear or nonlinear?
  • Group differences: How do distributions differ across categories?

2.7.3 Limitations of base R graphics

While base R graphics are powerful and flexible, they have several limitations:

  1. Inconsistent syntax: Different plot types use different functions with different argument names and behaviours.

  2. Clunky customisation: Adding colours, legends, labels, and titles often requires multiple lines of code and careful coordination.

  3. No colour-blind-friendly defaults: Creating accessible visualisations requires manual colour selection.

  4. Redundant colour use: It’s common to add colour for aesthetic reasons even when it encodes no information.

  5. No unified framework: Each plot type is a standalone function, making it difficult to build a mental model of how graphics work.

These limitations are addressed by ggplot2, which provides a unified framework based on the grammar of graphics. In the following chapters, we will see how ggplot2 offers consistent syntax, automatic legends, thoughtful defaults, and a layered approach to building graphics.