9 Plotting (geo)spatial data

9.1 Introduction

Geographic Information Systems (GIS) are a cornerstone of modern data analysis across many disciplines. Geospatial engineers use GIS to plan infrastructure, geographers use it to study human and physical landscapes, and archaeologists use it to map excavation sites and analyse spatial patterns in artefacts. Many universities offer dedicated modules on GIS, and you may encounter such courses in Geography, Environmental Science, or Engineering programmes.

As data scientists, we should not lag behind these specialists. The good news is that the grammar of graphics aligns remarkably well with how GIS works. Just as we build statistical graphics layer by layer in ggplot2, GIS practitioners build maps by layering geographic features on top of one another. This conceptual alignment means that once you understand ggplot2, working with spatial data is a natural extension rather than a completely new skill.

In this chapter, we focus on creating static maps using ggplot2. In Chapter 10, we will explore interactive versions that allow users to zoom, pan, and click on map features.

9.2 Why not just use geom_point()?

If you have a dataset with coordinates — say, longitude and latitude, or easting and northing — you might think: “Can’t I just use geom_point() with aes(x = longitude, y = latitude)?” The answer is: yes, but only for simple cases.

# This works for a single dataset
ggplot(my_data, aes(x = longitude, y = latitude)) +
  geom_point()

However, this approach breaks down when you want to:

  1. Overlay multiple datasets that use different coordinate systems (e.g., one uses longitude/latitude, another uses British National Grid easting/northing)
  2. Add a background map from a different source
  3. Ensure geographic accuracy across different regions of the world

The problem is that geom_point() treats coordinates as simple numbers on Cartesian axes. It has no understanding that these numbers represent locations on Earth, or that different coordinate systems exist.

9.2.1 The sf solution

The sf package solves this by attaching coordinate reference system (CRS) metadata to your data. When you use geom_sf() instead of geom_point(), ggplot2 knows:

  • What coordinate system each dataset uses
  • How to transform coordinates between systems

This means that the same point on Earth will appear in the same position on your plot, even if different datasets represent that point using different coordinate systems. You can layer a dataset in longitude/latitude on top of a map in British National Grid, and they will align correctly.

# Two datasets with different coordinate systems
# will still align correctly when using geom_sf()
ggplot() +
  geom_sf(data = uk_map) +           # In WGS84 (longitude/latitude)
  geom_sf(data = scottish_data)      # In British National Grid (easting/northing)

We will look at how the above code works in practice in Section 9.4, and detailed explanations of CRS in Section 9.5.

Without sf, you would need to manually convert all coordinates to a common system before plotting — a tedious and error-prone process.

9.2.2 How geom_sf() differs from other geoms

Another key difference between geom_sf() and other geoms is that you do not need to specify x and y aesthetics. With most geoms, you must tell ggplot2 which variables map to the x- and y-axes:

# Regular geoms require x and y aesthetics
ggplot(data, aes(x = variable1, y = variable2)) +
  geom_point()

With geom_sf(), the geometry column already contains all the positional information. The geom simply plots whatever geometry is in the data:

  • If your data has POINT geometry, points are plotted
  • If your data has POLYGON or MULTIPOLYGON geometry, filled polygons are plotted
  • If your data has LINESTRING geometry, lines are plotted
# geom_sf() reads position from the geometry column automatically
ggplot(spatial_data) +
  geom_sf()  # No aes(x = ..., y = ...) needed!

All other aesthetic mappings work the same as with regular geoms. You can still map variables to colour, fill, size, alpha, linetype, and so on:

# Other aesthetics work as usual
ggplot(spatial_data) +
  geom_sf(aes(fill = population, colour = region), size = 0.5)

This automatic handling of geometry makes geom_sf() both simpler (no need to specify x/y) and more powerful (works with any geometry type) than using geom_point(), geom_polygon(), or geom_path() manually.

9.3 The key ingredient: geometry

The main principle of working with spatial data remains the same as with any other data: we read in a file or load a dataset from a package, then visualise it. The crucial difference is that spatial data requires geometry — a special type of data that describes the shape and location of geographic features.

9.3.1 Simple feature collections

If a dataset already has geometry information in the correct format, it is called a simple feature collection. This is essentially a tidy data frame with an additional geometry column. The geometry column contains the spatial information (points, lines, or polygons) for each observation.

# A simple feature collection looks like this:
# Simple feature collection with 100 features and 3 fields
# Geometry type: POINT
# Bounding box:  xmin: -5.7 ymin: 50.0 xmax: 1.8 ymax: 58.6
# CRS:           EPSG:4326
# # A tibble: 100 x 4
#    name        population area_km2           geometry
#    <chr>            <dbl>    <dbl>        <POINT [°]>
#  1 Location A        1234     12.5 (-2.345 54.123)
#  2 Location B        5678     34.2 (-1.567 55.456)
#  ...

9.3.2 File formats for spatial data

The file format determines what kind of geometry information is available:

Flat files (CSV, XLS, XLSX):

  • Easy to read using read_csv() or read_excel()
  • Most likely contain point geometry only (e.g., latitude and longitude in separate columns)
  • The geometry is not in a readily available format, so we need to manipulate the data before plotting (see Section 9.5.3)

Shapefiles (.shp):

  • The traditional GIS file format
  • Can contain point, line, or polygon geometry
  • A “shapefile” is actually a collection of files with the same name but different extensions (.shp, .shx, .dbf, .prj, etc.)
  • Tend to be larger in size than flat files
  • Geometry is already in the correct format

9.3.3 Reading shapefiles

We use read_sf() from the sf package to read shapefiles. You only need to point to the .shp file, but do not delete the companion files that come with it (e.g., .shx, .dbf, .prj): they contain essential metadata such as the coordinate reference system information.

library(sf)

# Read a shapefile (point to the .shp file; keep the companion files!)
spatial_data <- read_sf("path/to/file.shp")

st_read() achieves the same result; read_sf() is a convenient wrapper that returns a tibble and suppresses messages by default.

9.4 Example: Scottish Census output areas

To illustrate spatial visualisation, we use data from the 2022 Scottish Census. The National Records of Scotland provides geographic data for census output areas at https://www.nrscotland.gov.uk/publications/2022-census-geography-products/. Download the zip file that says “… Population Weighted Centroids” and decompress / extract the files to your working directory. These files should be in a folder OutputArea2022_PWC/ and all with the same name but different file extensions.

9.4.1 Point data: population-weighted centroids

Let’s start by reading the population-weighted centroids (PWC) shapefile. Each output area is represented by a single point — the population-weighted centre of that area.

pwc <-
  read_sf("OutputArea2022_PWC/OutputArea2022_PWC.shp")
head(pwc)
## Simple feature collection with 6 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 392813 ymin: 805762 xmax: 394238 ymax: 806594
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 6 × 8
##   code    HHcount Popcount council masterpc easting northing
##   <chr>     <dbl>    <dbl> <chr>   <chr>    <chr>   <chr>   
## 1 S00136…      61       64 S12000… AB10 1BT 393865  806312  
## 2 S00136…      64       68 S12000… AB10 1DA 393877  806296  
## 3 S00136…      62      105 S12000… AB10 1DB 392813  805762  
## 4 S00136…      49       87 S12000… AB10 1DH 393708  806213  
## 5 S00136…      47       75 S12000… AB10 1FL 394238  806594  
## 6 S00136…      49       64 S12000… AB10 1HE 394157  806265  
## # ℹ 1 more variable: geometry <POINT [m]>

Notice that pwc is a simple feature collection with a geometry column containing POINT data. The dataset includes HHcount (household count) and Popcount (population count) for each output area.

9.4.2 First attempt: plotting points

The simplest call just plots the points as they are:

ggplot(pwc) +
  geom_sf(size = 0.3)
Scottish census output areas as points.

Figure 9.1: Scottish census output areas as points.

This already gives a recognisable shape of Scotland. We can also colour the points by population count:

ggplot(pwc) +
  geom_sf(aes(colour = Popcount), size = 0.3) +
  scale_colour_viridis_c() +
  labs(colour = "Population")
Scottish census output areas as points, coloured by population count.

Figure 9.2: Scottish census output areas as points, coloured by population count.

This gives us a recognisable shape of Scotland, but there are two problems:

  1. We want polygons, not points: The dots are hard to see and don’t show the actual boundaries of output areas
  2. No background map: We have no geographic context — no coastline, no country outline

Let’s fix these issues.

9.4.3 Fix (a): Using polygon data

For filled areas instead of points, we need the Extent of the Realm (EoR) shapefile, which contains polygon geometry. From the same data source, download the zip file that says “… – Extent of the Realm” and extract the files to the working directory. These files should be in a folder OutputArea2022_EoR/ and all with the same name but different file extensions.

eor <-
  read_sf("OutputArea2022_EoR/OutputArea2022_EoR.shp")
head(eor)
## Simple feature collection with 6 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 382978.4 ymin: 809487 xmax: 396528 ymax: 816096.1
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 6 × 12
##   code      HHcount Popcount council    sqkm   hect masterpc
##   <chr>       <dbl>    <dbl> <chr>     <dbl>  <dbl> <chr>   
## 1 S00135307      62      144 S12000… 10.9    1095.  AB21 0EQ
## 2 S00135308      33       65 S12000…  5.41    541   AB21 0T…
## 3 S00135309      71      143 S12000… 10.2    1016.  AB23 8NT
## 4 S00135310      65      145 S12000…  0.774    77.4 AB21 7ER
## 5 S00135311      64      151 S12000…  7.81    781.  AB22 8AR
## 6 S00135312      64      187 S12000…  0.0820    8.2 AB21 7GJ
## # ℹ 5 more variables: easting <chr>, northing <chr>,
## #   Shape_Leng <dbl>, Shape_Area <dbl>,
## #   geometry <MULTIPOLYGON [m]>

Now we have polygons! The simplest call shows just the boundaries of each output area:

ggplot(eor) +
  geom_sf()
Scottish census output areas as polygons (boundaries only).

Figure 9.3: Scottish census output areas as polygons (boundaries only).

To show population density, we fill by Popcount:

ggplot(eor, aes(fill = Popcount)) +
  geom_sf(colour = NA) +
  scale_fill_viridis_c() +
  labs(fill = "Population")
Scottish census output areas as polygons, coloured by population count.

Figure 9.4: Scottish census output areas as polygons, coloured by population count.

Note the key differences:

  • We use fill (not colour) to fill the polygon interiors
  • We set colour = NA to remove polygon borders (they would clutter the map at this scale)

9.4.4 Fix (b): Adding a background map

To provide geographic context, we can add a map of the United Kingdom using the rnaturalearth package:

library(rnaturalearth)

# Get UK country outline
uk <- ne_countries(scale = "medium", country = "United Kingdom",
                   returnclass = "sf")

# Check what we have
class(uk)
## [1] "sf"         "data.frame"

Now let’s create a map with the UK outline as a background, then add our census points on top:

ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = pwc, aes(colour = Popcount), size = 0.3) +
  scale_colour_viridis_c() +
  coord_sf(xlim = c(-8, 0), ylim = c(54, 61), crs = 4326) +
  labs(colour = "Population")
Scottish census output areas with UK background map.

Figure 9.5: Scottish census output areas with UK background map.

Much better! We can now see Scotland in its geographic context. The coord_sf() function allows us to zoom into the region of interest by setting xlim and ylim. The crs argument specifies which coordinate reference system those limit values are expressed in; see Section 9.5. It is always a good idea to set it explicitly: when a map combines layers that use different CRS (as ours does — the UK outline is in WGS84 while the census data is in British National Grid), ggplot2 needs to know which system your xlim/ylim values refer to. Using crs = 4326 (WGS84 longitude/latitude) is the safest default because longitude and latitude are universally understood and work regardless of which CRS the underlying data layers use.

9.4.5 Combining: polygons with background

For the complete picture, let’s use polygons with the background map:

ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = eor, aes(fill = Popcount), colour = NA) +
  scale_fill_viridis_c() +
  coord_sf(xlim = c(-8, 0), ylim = c(54, 61), crs = 4326) +
  labs(fill = "Population")
Scottish census output areas (polygons) with UK background map.

Figure 9.6: Scottish census output areas (polygons) with UK background map.

9.5 Coordinate reference systems

An important concept in spatial data is the coordinate reference system (CRS). Different datasets may use different systems to represent locations on Earth.

9.5.1 Common coordinate systems

WGS84 (EPSG:4326):

  • The most common global system
  • Uses longitude and latitude in degrees
  • What you see on Google Maps

British National Grid (EPSG:27700):

  • Used for Great Britain
  • Uses easting and northing in metres
  • What you see on Ordnance Survey maps

Pay attention to the number after “EPSG:”; this is the number you use when there is a crs parameter in the function, such as coord_sf() above in Section 9.4.4 and st_as_sf() below.

9.5.2 How the sf package handles CRS

The sf package handles coordinate system translations automatically. This means:

  • Points represented in different systems will be plotted correctly on the same map
  • Datasets with different CRS can be layered on top of each other

Let’s check the CRS of our datasets:

# Check CRS of the census data
st_crs(pwc)$input
## [1] "OSGB36 / British National Grid"
# Check CRS of the UK map
st_crs(uk)$input
## [1] "WGS 84"

Even though these use different coordinate systems, ggplot2 with geom_sf() handles the transformation automatically.

9.5.3 Creating geometry from coordinates

The point data in OutputArea2022_PWC is particularly useful for learning data manipulation because it contains both:

  • The geometry column (ready-to-use spatial data)
  • Separate coordinate columns (easting, northing)

This allows us to practise converting coordinate columns into geometry. If you have a CSV file with latitude and longitude columns but no geometry, you can create the geometry using st_as_sf():

# First, let's see the coordinate columns in our data
pwc |>
  st_drop_geometry() |>
  select(code, easting, northing) |>
  head()
## # A tibble: 6 × 3
##   code      easting northing
##   <chr>     <chr>   <chr>   
## 1 S00136472 393865  806312  
## 2 S00136478 393877  806296  
## 3 S00136558 392813  805762  
## 4 S00136477 393708  806213  
## 5 S00136331 394238  806594  
## 6 S00136454 394157  806265
# If we only had coordinates (no geometry), we could create it like this:
pwc_coords <- pwc |>
  st_drop_geometry() |>
  select(code, easting, northing, HHcount, Popcount)

# Convert to sf object using the coordinate columns
pwc_recreated <- st_as_sf(pwc_coords,
                          coords = c("easting", "northing"),
                          crs = 27700)  # British National Grid

# Verify it worked
class(pwc_recreated)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
head(pwc_recreated)
## Simple feature collection with 6 features and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 392813 ymin: 805762 xmax: 394238 ymax: 806594
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 6 × 4
##   code      HHcount Popcount        geometry
##   <chr>       <dbl>    <dbl>     <POINT [m]>
## 1 S00136472      61       64 (393865 806312)
## 2 S00136478      64       68 (393877 806296)
## 3 S00136558      62      105 (392813 805762)
## 4 S00136477      49       87 (393708 806213)
## 5 S00136331      47       75 (394238 806594)
## 6 S00136454      49       64 (394157 806265)

The coords argument specifies which columns contain the x and y coordinates, and crs specifies the coordinate reference system those coordinates use.

Important: The coords argument expects the \(x\)-coordinate first, then the \(y\)-coordinate. For geographic coordinates:

  • Longitude is the \(x\)-axis (east-west position), so it comes first
  • Latitude is the \(y\)-axis (north-south position), so it comes second

This order might feel counterintuitive if you’re used to saying “latitude and longitude”, but remember: longitude measures position along the horizontal (east-west) direction, just like the \(x\)-axis.

# Correct: longitude (x) first, then latitude (y)
st_as_sf(data, coords = c("longitude", "latitude"), crs = 4326)

# Correct: easting (x) first, then northing (y)
st_as_sf(data, coords = c("easting", "northing"), crs = 27700)

# WRONG: latitude first would place points in the wrong locations!
# st_as_sf(data, coords = c("latitude", "longitude"), crs = 4326)

If your points appear in unexpected locations (e.g., in the ocean instead of on land), the first thing to check is whether you have the coordinate order correct.

9.6 Polishing spatial plots

Everything you learned in Chapters 4, 5, and 6 about colours, legends, and axes applies directly to spatial plots:

ggplot() +
  geom_sf(data = uk) +
  geom_sf(data = eor, aes(fill = Popcount), colour = NA) +
  scale_fill_viridis_c(option = "plasma") +
  coord_sf(xlim = c(-8, 0), ylim = c(54, 61), crs = 4326) +
  labs(
    title = "Population by Census Output Area",
    subtitle = "Scotland, 2022 Census",
    fill = "Population",
    caption = "Source: National Records of Scotland"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
Polished map of Scottish census output areas.

Figure 9.7: Polished map of Scottish census output areas.

9.7 Summary

9.7.1 Packages needed

  • sf: Reading, manipulating, and writing spatial data
  • rnaturalearth: Downloading country and region outlines for background maps

9.7.2 Data requirements

Spatial data must be a simple feature collection: a tidy data frame plus a geometry column.

Source Geometry types Ease of use
Flat files (CSV, XLS, XLSX) Points only (coordinates in columns) Easy to read, but requires st_as_sf()
Shapefiles (.shp) Points, lines, and/or polygons Geometry ready, but larger file size

9.7.3 Coordinate reference systems

  • Different datasets may use different CRS (e.g., WGS84 vs British National Grid)
  • The sf package handles transformations automatically
  • Datasets with different CRS can be plotted together seamlessly

9.7.4 The spatial geom

  • geom_sf() is the key function for plotting spatial data
  • Use colour for points and lines
  • Use fill for polygon interiors
  • Layer multiple geom_sf() calls to build up your map
  • Use coord_sf(xlim, ylim, crs = 4326) to zoom to a region of interest; always specify crs = 4326 so the limits are interpreted as longitude and latitude, which is unambiguous regardless of the CRS used by the data layers

9.7.5 Building maps

The workflow for creating a map is:

  1. Load background map data (e.g., from rnaturalearth)
  2. Load your spatial data (shapefile or convert from coordinates)
  3. Create a ggplot() with geom_sf() layers
  4. Add scales, labels, and themes as usual