1 Introduction

This practical focuses on plotting spatial (geographic) data using ggplot2 and the sf package. You will learn how to:

We will use the North Carolina SIDS (Sudden Infant Death Syndrome) dataset, which is included with the sf package. This classic dataset contains county-level data on SIDS deaths in North Carolina for two time periods (1974-78 and 1979-84).

2 Part A: Reading and examining spatial data

2.1 Exercise 1: Load and inspect the data

  1. Load the North Carolina dataset using the following code:

    nc <- read_sf(system.file("gpkg/nc.gpkg", package = "sf"))

    Confirm that nc is a simple feature collection by checking its class. What does this tell you about the data structure?

    class(nc)
    ## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

    Answer: The class shows ‘sf’ and ‘tbl_df’ (or ‘data.frame’), confirming it is a simple feature collection. This means it is a tidy data frame with an additional geometry column containing spatial information.

  2. Use str() or head() to examine the structure of the data. How many observations (counties) and variables are there? What is the geometry column called?

    str(nc)
    ## sf [100 × 15] (S3: sf/tbl_df/tbl/data.frame)
    ##  $ AREA     : num [1:100] 0.114 0.061 0.143 0.07 0.153 0.097 0.062 0.091 0.118 0.124 ...
    ##  $ PERIMETER: num [1:100] 1.44 1.23 1.63 2.97 2.21 ...
    ##  $ CNTY_    : num [1:100] 1825 1827 1828 1831 1832 ...
    ##  $ CNTY_ID  : num [1:100] 1825 1827 1828 1831 1832 ...
    ##  $ NAME     : chr [1:100] "Ashe" "Alleghany" "Surry" "Currituck" ...
    ##  $ FIPS     : chr [1:100] "37009" "37005" "37171" "37053" ...
    ##  $ FIPSNO   : num [1:100] 37009 37005 37171 37053 37131 ...
    ##  $ CRESS_ID : int [1:100] 5 3 86 27 66 46 15 37 93 85 ...
    ##  $ BIR74    : num [1:100] 1091 487 3188 508 1421 ...
    ##  $ SID74    : num [1:100] 1 0 5 1 9 7 0 0 4 1 ...
    ##  $ NWBIR74  : num [1:100] 10 10 208 123 1066 ...
    ##  $ BIR79    : num [1:100] 1364 542 3616 830 1606 ...
    ##  $ SID79    : num [1:100] 0 3 6 2 3 5 2 2 2 5 ...
    ##  $ NWBIR79  : num [1:100] 19 12 260 145 1197 ...
    ##  $ geom     :sfc_MULTIPOLYGON of length 100; first list element: List of 1
    ##   ..$ :List of 1
    ##   .. ..$ : num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...
    ##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
    ##  - attr(*, "sf_column")= chr "geom"
    ##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
    ##   ..- attr(*, "names")= chr [1:14] "AREA" "PERIMETER" "CNTY_" "CNTY_ID" ...

    Answer: There are 100 counties and 14 variables (15 columns including the geometry column geom). The geometry column is called geom and contains MULTIPOLYGON data — each county is represented by one or more polygons.

  3. Verify that the data is “tidy” in the sense that each row represents one county. Pick a few county names and confirm they appear only once.

    # Check that each county appears once
    nc |>
      count(NAME, sort = TRUE) |>
      head()
    ## Simple feature collection with 6 features and 2 fields
    ## Geometry type: POLYGON
    ## Dimension:     XY
    ## Bounding box:  xmin: -82.07776 ymin: 34.80792 xmax: -79.23799 ymax: 36.58965
    ## Geodetic CRS:  NAD27
    ## # A tibble: 6 × 3
    ##   NAME          n                                                           geom
    ##   <chr>     <int>                                                  <POLYGON [°]>
    ## 1 Alamance      1 ((-79.27082 35.9046, -79.25977 36.04789, -79.2585 36.23569, -…
    ## 2 Alexander     1 ((-81.0491 35.83597, -80.99535 35.97708, -81.02057 36.03493, …
    ## 3 Alleghany     1 ((-81.17667 36.41544, -81.15337 36.42474, -81.1384 36.41763, …
    ## 4 Anson         1 ((-79.90142 34.85241, -79.85371 34.90458, -79.86705 34.96772,…
    ## 5 Ashe          1 ((-81.45289 36.23959, -81.43104 36.26072, -81.41233 36.26729,…
    ## 6 Avery         1 ((-81.92215 35.98251, -81.90085 35.99465, -81.88081 35.98952,…

    Answer: Each county name appears exactly once, confirming one row per county.

2.2 Exercise 2: Examine the coordinate reference system

  1. Use st_crs() to examine the coordinate reference system (CRS) of the data. What CRS is used? Is it a geographic (lat/lon) or projected system?

    st_crs(nc)
    ## Coordinate Reference System:
    ##   User input: NAD27 
    ##   wkt:
    ## GEOGCRS["NAD27",
    ##     DATUM["North American Datum 1927",
    ##         ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
    ##             LENGTHUNIT["metre",1]]],
    ##     PRIMEM["Greenwich",0,
    ##         ANGLEUNIT["degree",0.0174532925199433]],
    ##     CS[ellipsoidal,2],
    ##         AXIS["geodetic latitude (Lat)",north,
    ##             ORDER[1],
    ##             ANGLEUNIT["degree",0.0174532925199433]],
    ##         AXIS["geodetic longitude (Lon)",east,
    ##             ORDER[2],
    ##             ANGLEUNIT["degree",0.0174532925199433]],
    ##     USAGE[
    ##         SCOPE["Geodesy."],
    ##         AREA["North and central America: Antigua and Barbuda - onshore. Bahamas - onshore plus offshore over internal continental shelf only. Belize - onshore. British Virgin Islands - onshore. Canada onshore - Alberta, British Columbia, Manitoba, New Brunswick, Newfoundland and Labrador, Northwest Territories, Nova Scotia, Nunavut, Ontario, Prince Edward Island, Quebec, Saskatchewan and Yukon - plus offshore east coast. Cuba - onshore and offshore. El Salvador - onshore. Guatemala - onshore. Honduras - onshore. Panama - onshore. Puerto Rico - onshore. Mexico - onshore plus offshore east coast. Nicaragua - onshore. United States (USA) onshore and offshore - Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming - plus offshore . US Virgin Islands - onshore."],
    ##         BBOX[7.15,167.65,83.17,-47.74]],
    ##     ID["EPSG",4267]]

    Answer: The CRS is NAD27 (EPSG:4267), which is a geographic coordinate system using latitude and longitude in degrees. This is similar to WGS84 but uses an older datum specific to North America.

3 Part B: Creating maps with geom_sf()

3.1 Exercise 3: A misleading first plot

  1. Create a map of North Carolina counties with fill = AREA:

    ggplot(nc) +
      geom_sf(aes(fill = AREA)) +
      labs(fill = "Area")
    North Carolina counties coloured by area.

    Figure 3.1: North Carolina counties coloured by area.

    Why is this visualisation not particularly useful or informative?

    Answer: This map is circular and uninformative: larger polygons appear larger on the map AND have darker fill colours. The colour directly corresponds to the visual size of the polygon, so it adds no new information. We already see that some counties are bigger — we don’t need colour to tell us the same thing. A useful choropleth map should show a variable that is NOT directly visible from the polygon shapes.

3.2 Exercise 4: Meaningful indicators

The dataset contains SIDS death counts for two time periods:

  • SID74: SIDS deaths 1974-78
  • SID79: SIDS deaths 1979-84

And population data:

  • BIR74: Live births 1974-78
  • BIR79: Live births 1979-84
  1. Create a map showing the number of SIDS deaths in 1974-78 (SID74). Add an appropriate title and legend label.

    ggplot(nc) +
      geom_sf(aes(fill = SID74)) +
      labs(fill = "SIDS Deaths",
           title = "SIDS Deaths by County, 1974-78")
    SIDS deaths 1974-78 by county.

    Figure 3.2: SIDS deaths 1974-78 by county.

  2. The map from the previous question shows absolute counts. Why might this be misleading when comparing counties? Create a new variable for the SIDS rate (deaths per 1000 live births) and plot it instead.

    Answer: Absolute counts are misleading because larger counties naturally have more births and therefore more SIDS deaths, regardless of the underlying risk. A county with 10 deaths out of 1000 births (rate = 10 per 1000) has a much higher SIDS rate than a county with 10 deaths out of 10000 births (rate = 1 per 1000).

    nc <- nc |>
      mutate(SID74_rate = SID74 / BIR74 * 1000)
    
    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      labs(fill = "SIDS Rate\n(per 1000 births)",
           title = "SIDS Rate by County, 1974-78")
    SIDS rate (per 1000 births) 1974-78 by county.

    Figure 3.3: SIDS rate (per 1000 births) 1974-78 by county.

  3. Compare the two maps (absolute counts vs rate). Which counties stand out in each? What story does the rate map tell that the count map does not?

    Answer: The count map highlights the most populous counties (often urban areas with more births). The rate map shows a different pattern — some smaller, more rural counties have high SIDS rates despite having few total deaths. This reveals geographic variation in SIDS risk that is hidden when looking only at counts.

  4. Create the same rate calculation and map for the 1979-84 period. Has the geographic pattern changed?

    nc <- nc |>
      mutate(SID79_rate = SID79 / BIR79 * 1000)
    
    ggplot(nc) +
      geom_sf(aes(fill = SID79_rate)) +
      labs(fill = "SIDS Rate\n(per 1000 births)",
           title = "SIDS Rate by County, 1979-84")
    SIDS rate (per 1000 births) 1979-84 by county.

    Figure 3.4: SIDS rate (per 1000 births) 1979-84 by county.

4 Part C: Improving the visualisation

4.1 Exercise 5: Colour scales

  1. The default colour scale uses a gradient from dark to light blue. Apply a viridis colour scale to make the map more accessible:

    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      scale_fill_viridis_c() +
      labs(fill = "SIDS Rate\n(per 1000 births)",
           title = "SIDS Rate by County, 1974-78")
    SIDS rate with viridis colour scale.

    Figure 4.1: SIDS rate with viridis colour scale.

  2. Try a different viridis palette option (e.g., option = "plasma" or option = "magma"). Which do you find most effective for this data?

    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      scale_fill_viridis_c(option = "plasma") +
      labs(fill = "SIDS Rate\n(per 1000 births)",
           title = "SIDS Rate by County, 1974-78")
    SIDS rate with plasma colour scale.

    Figure 4.2: SIDS rate with plasma colour scale.

  3. Use scale_fill_distiller() with a sequential ColorBrewer palette (e.g., palette = "YlOrRd") to create a map where higher values are shown in more intense colours.

    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      scale_fill_distiller(palette = "YlOrRd", direction = 1) +
      labs(fill = "SIDS Rate\n(per 1000 births)",
           title = "SIDS Rate by County, 1974-78")
    SIDS rate with YlOrRd colour scale.

    Figure 4.3: SIDS rate with YlOrRd colour scale.

4.2 Exercise 6: Legend and theme adjustments

  1. Move the legend to the bottom of the plot using theme(). Adjust the legend width to span more of the plot area.

    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      scale_fill_viridis_c() +
      labs(fill = "SIDS Rate (per 1000 births)",
           title = "SIDS Rate by County, 1974-78") +
      theme(
        legend.position = "bottom",
        legend.key.width = unit(2, "cm")
      )
    SIDS rate with bottom legend.

    Figure 4.4: SIDS rate with bottom legend.

  2. Apply theme_minimal() or theme_void() to remove the axis labels and grid lines, which are often unnecessary for maps.

    ggplot(nc) +
      geom_sf(aes(fill = SID74_rate)) +
      scale_fill_viridis_c() +
      labs(fill = "SIDS Rate (per 1000 births)",
           title = "SIDS Rate by County, 1974-78",
           caption = "Source: North Carolina SIDS data") +
      theme_void() +
      theme(
        legend.position = "bottom",
        legend.key.width = unit(2, "cm")
      )
    SIDS rate with minimal theme.

    Figure 4.5: SIDS rate with minimal theme.

5 Part D: Overlaying spatial layers

5.1 Exercise 7: Scotland — centroids on boundaries

For this exercise use the same Scottish Census shapefiles as the notes (Chapter 9). Load them as follows:

eor <- read_sf("OutputArea2022_EoR/OutputArea2022_EoR.shp")
pwc <- read_sf("OutputArea2022_PWC/OutputArea2022_PWC.shp")
  1. Overlay the population-weighted centroids (PWC) as red points on top of the EoR polygon boundaries. Zoom in using coord_sf() to a small region so the individual polygons and their centroids are visible. Does each centroid appear inside its corresponding output area polygon?

    Hint: Use two geom_sf() calls — the first for the EoR polygons (use fill = NA and a light grey colour to show just the outlines), the second for the PWC points (use colour = "red" and a small size). In coord_sf(), set crs = 4326 so that your xlim/ylim limits are interpreted as longitude and latitude (WGS84); this is always a good idea when data layers may use different coordinate reference systems.

    ggplot() +
      geom_sf(data = eor, fill = NA, colour = "grey50", linewidth = 0.2) +
      geom_sf(data = pwc, colour = "red", size = 0.8) +
      coord_sf(xlim = c(-3.3, -3.0), ylim = c(55.9, 56.1), crs = 4326) +
      theme_void()
    Scotland EoR boundaries with PWC centroids overlaid in red (Edinburgh area).

    Figure 5.1: Scotland EoR boundaries with PWC centroids overlaid in red (Edinburgh area).

    Answer: Yes — each red centroid falls inside (or very close to the boundary of) its corresponding output area polygon. This is precisely the definition of a population-weighted centroid: the mean location of all residents within that area, weighted by where people actually live.

5.2 Exercise 8: Scotland and Northern Ireland on a UK map

Download the Northern Ireland Super Data Zone (SDZ) boundaries from NISRA at https://www.nisra.gov.uk/publications/super-data-zone-boundaries-gis-format (choose the ESRI Shapefile option). Extract the files into a folder geography-sdz2021-esri-shapefile/ in your working directory.

Recall from Practical 04 that census_ni was loaded from the NISRA Census 2021 file (https://www.nisra.gov.uk/publications/census-2021-person-and-household-estimates-data-zones-northern-ireland), with columns name, code, residents, area, and density. Reload it now if it is not already in your environment.

ni_sdz <- read_sf("geography-sdz2021-esri-shapefile/SDZ2021.shp")
  1. Using a UK background map from rnaturalearth, overlay the Scotland EoR boundaries and the NI SDZ boundaries on the same map, each in a different fill colour. Use coord_sf() to zoom to show both regions. What do you notice about how geom_sf() handles the two layers, which come from different coordinate reference systems?

    Hint: Call ne_countries() to get the UK outline, then add one geom_sf() layer for eor and another for ni_sdz, each with a fixed fill colour (e.g., "steelblue" and "darkorange") and colour = NA to suppress borders. Set alpha for transparency so both layers are visible. Use crs = 4326 in coord_sf() to express the zoom limits in longitude/latitude — especially important here since the two boundary datasets use different coordinate reference systems.

    library(rnaturalearth)
    uk <- ne_countries(scale = "medium", country = "United Kingdom",
                       returnclass = "sf")
    
    ggplot() +
      geom_sf(data = uk, fill = "grey95", colour = "grey50") +
      geom_sf(data = eor, fill = "steelblue", colour = NA, alpha = 0.6) +
      geom_sf(data = ni_sdz, fill = "darkorange", colour = NA, alpha = 0.6) +
      coord_sf(xlim = c(-9, 0), ylim = c(54, 61.5), crs = 4326) +
      labs(title = "Scotland (blue) and Northern Ireland (orange) census areas") +
      theme_void()
    Scotland EoR and Northern Ireland SDZ boundaries overlaid on a UK map.

    Figure 5.2: Scotland EoR and Northern Ireland SDZ boundaries overlaid on a UK map.

    Answer: The two datasets use different coordinate reference systems (British National Grid for Scotland, Irish Transverse Mercator for Northern Ireland), yet geom_sf() aligns them correctly on the same map without any manual conversion. This is the key advantage of using sf and geom_sf() over plain geom_point(): CRS differences are handled automatically.

6 Summary

Key concepts:

Functions introduced:

Function Purpose
read_sf() Read spatial data files
st_crs() Check coordinate reference system
geom_sf() Plot spatial data
coord_sf(crs = 4326) Zoom / set extent; crs = 4326 keeps limits in lon/lat
scale_fill_viridis_c() Continuous viridis colour scale
scale_fill_distiller() Continuous ColorBrewer scale
theme_void() Remove all background elements