Clement Lee (Newcastle University)
2025-02-04 (Tue)
Introduction
Practical 1
Data pipeline via the tidyverse
Practical 2
Write documents & presentations
Practical 3
Q&A, summary
Today’s slides: https://bit.ly/insights-north-east
Toolkits on https://insightsnortheast.co.uk/data/
Lecturer in Statistics at Newcastle University since August 2022
Teach statistical analysis using R
Research on statistical models for (social) networks
Write in the chat
Experience in working with data, and what are your usual tools?
Have you coded in R (or other programming languages) before?
Future-proof your analysis
for others to replicate / reproduce
including your future self
ChatGPT & others making it easier to start
Also the stepping stone for machine learning and AI
but don’t start with something completely unfamiliar
garbage in, garbage out
Version control, CI/CD, and beyond
“But I need to make presentations, reports, visualisations, etc.”
Most can be done nicely via R
We will see how this can be done
Python great for big data pipelines
R is better in some other aspects
tidyverse
Understand which one is better / more suited for your tasks
Free & open source
Download from https://cran.r-project.org/
Anyone can contribute:
https://cran.r-project.org/web/packages/available_packages_by_name.html
Some packages about “business”, “politics”, etc.
The best interface for using R
Download from https://posit.co/download/rstudio-desktop/
Let’s have a look
## # A tibble: 11,714 × 19
## Year start end property code street address
## <dbl> <dttm> <lgl> <chr> <chr> <chr> <chr>
## 1 2024 2017-02-10 00:00:00 NA N000080 CW 582 Brunswick Park Industr…
## 2 2024 2020-08-01 00:00:00 NA N000082 IM3 582 Brunswick Park Industr…
## 3 2024 2014-11-28 00:00:00 NA N000085 CW 582 Antique Pine Imports, …
## 4 2024 2016-06-10 00:00:00 NA N000087 IF 582 Brunswick Park Industr…
## 5 2024 1995-04-01 00:00:00 NA N000088 IF 582 9, Brunswick Park Indu…
## 6 2024 2015-08-05 00:00:00 NA N000090 CW 582 10, Brunswick Park Ind…
## 7 2024 1995-04-01 00:00:00 NA N000093 IF3 582 Brunswick Park Industr…
## 8 2024 1995-04-01 00:00:00 NA N000094 IF3 583 13, Brunswick Village,…
## 9 2024 1995-04-01 00:00:00 NA N000095 IF3 582 Brunswick Park Industr…
## 10 2024 2019-01-01 00:00:00 NA N000096 CW 12979 Unit 15, Brunswick Par…
## # ℹ 11,704 more rows
## # ℹ 12 more variables: `NEW foi_liable party` <chr>, `NEW c/o address` <chr>,
## # `rateable value` <dbl>, LIA <dbl>, TRL <dbl>, SBR <dbl>, Chr <dbl>,
## # s44a <lgl>, EXM <dbl>, DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>
A table / spreadsheet in Excel (.xls, .xlsx, .csv, …) \(\approx\) a data frame / tibble in R
## [1] 28322.39
## [1] 78594.26
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 419 10380 3898 3598140
hist(table1$`net charges`,
breaks = 100, # approximately 100 bars
main = NA, xlab = NA) # don't print the title and the x-axis label yet
title(main = "Histogram of net charges for November 2024",
xlab = "Net charges (£)")
plot(table1$start, table1$`net charges`, log = "y", # on log scale
main = NA, xlab = NA, ylab = NA, # don't print title & axis labels yet
cex = 0.1) # size of the points
title(main = "Net charges over the starting date of businesses",
xlab = "Starting date", ylab = "Net charges (£)")
What if we have an xls or csv file instead?
read_xls()
from the same package readxl
read.csv()
that comes with R
What if we want to create new columns (by manipulating some existing columns) and plot them?
Tedious if we need to
create them first,
re-save the spreadsheet,
and then rerun the analysis
It is usually the best not to modify the raw / original data
Plots still not perfect
Huge skew of net charges \(\rightarrow\) uninformative histogram; perhaps log scale help?
Faffy modifications; instead, we will use the package ggplot2
We can build a data pipeline within R to
read the data,
manipulate them, and
make very nice plots
Tidyverse (https://www.tidyverse.org/), arguably the biggest development in R ever
ggplot2
is part of itGo through the practical in groups
Make summaries and plots using your own data or the business rates data
Note any challenges / difficulties, or anything you find useful
Report your findings
Continue with the previous data frame table1
Let’s say we want to calculate the mean of the square root of the net charges
## [1] 48.93841
The way the functions are typed is not that the way we think about applying them
The code below is more linear
## [1] 48.93841
Collection of R packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures
Installation: only need to do it once (theoretically)
The above might require a lot more packages to be installed
After all, it’s just a collection of packages
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("readr")
install.packages("stringr")
install.packages("tibble")
install.packages("purrr")
install.packages("forcats")
Type above commands once and first
Type below commands every time you need them
library(tidyverse)
dplyr
table2 <- table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") # select / reorder columns
table2
## # A tibble: 11,714 × 13
## date ratio `rateable value` LIA TRL SBR Chr s44a EXM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 2017-02-10 0.250 45750 22829. NA NA NA NA NA
## 2 2020-08-01 1 35500 17714. NA NA NA NA NA
## 3 2014-11-28 0.2 11250 5614. NA NA -4491 NA NA
## 4 2016-06-10 0.250 16500 8234. NA NA NA NA NA
## 5 1995-04-01 1 21000 10479 NA NA NA NA NA
## 6 2015-08-05 1 52000 28392 NA NA NA NA NA
## 7 1995-04-01 0.0417 12500 6238. NA -5198. NA NA NA
## 8 1995-04-01 0.339 13250 6612. NA -3857. NA NA NA
## 9 1995-04-01 0 11000 5489 NA -5489 NA NA NA
## 10 2019-01-01 1 34250 17091. NA NA NA NA NA
## # ℹ 11,704 more rows
## # ℹ 4 more variables: DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>,
## # address <chr>
dplyr
Filter rows where DRR1
is not empty (NA
in R)
Other uses (not shown):
Join other tables
Summarise columns
## # A tibble: 3 × 13
## date ratio `rateable value` LIA TRL SBR Chr s44a EXM DRR1
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 1995-04-01 0.814 272500 148785 NA NA NA NA NA TRUE
## 2 2021-03-09 0.911 47500 23702. NA NA NA NA NA TRUE
## 3 2005-08-21 1 3060000 1670760 0 NA NA NA NA FALSE
## # ℹ 3 more variables: DRR2 <dbl>, `net charges` <dbl>, address <chr>
tidyr
(+ dplyr
)Pivot tables
Combine columns between “rateable value” & “net charges”
New columns “name” and “value”
table3 <- table2 |>
pivot_longer(`rateable value`:`net charges`) |>
select(date, ratio, name, value, everything()) |> # a dplyr function
filter(!is.na(value)) # another dplyr function
table3
## # A tibble: 46,557 × 5
## date ratio name value address
## <date> <dbl> <chr> <dbl> <chr>
## 1 2017-02-10 0.250 rateable value 45750 Brunswick Park Industrial Estate, Br…
## 2 2017-02-10 0.250 LIA 22829. Brunswick Park Industrial Estate, Br…
## 3 2017-02-10 0.250 DRR2 -17122. Brunswick Park Industrial Estate, Br…
## 4 2017-02-10 0.250 net charges 5707. Brunswick Park Industrial Estate, Br…
## 5 2020-08-01 1 rateable value 35500 Brunswick Park Industrial Estate, Br…
## 6 2020-08-01 1 LIA 17714. Brunswick Park Industrial Estate, Br…
## 7 2020-08-01 1 net charges 17714. Brunswick Park Industrial Estate, Br…
## 8 2014-11-28 0.2 rateable value 11250 Antique Pine Imports, Brunswick Park…
## 9 2014-11-28 0.2 LIA 5614. Antique Pine Imports, Brunswick Park…
## 10 2014-11-28 0.2 Chr -4491 Antique Pine Imports, Brunswick Park…
## # ℹ 46,547 more rows
stringr
(+ dplyr
)Extract postcode
Useful if we want to combine with coordinates for plotting on a map
table3 |>
mutate(postcode = str_extract(address, "NE\\d+ \\d\\w{2}")) |> # within a dplyr function
select(-address) # postcode suffices for finding the coordinates
## # A tibble: 46,557 × 5
## date ratio name value postcode
## <date> <dbl> <chr> <dbl> <chr>
## 1 2017-02-10 0.250 rateable value 45750 NE13 7BA
## 2 2017-02-10 0.250 LIA 22829. NE13 7BA
## 3 2017-02-10 0.250 DRR2 -17122. NE13 7BA
## 4 2017-02-10 0.250 net charges 5707. NE13 7BA
## 5 2020-08-01 1 rateable value 35500 NE13 7BA
## 6 2020-08-01 1 LIA 17714. NE13 7BA
## 7 2020-08-01 1 net charges 17714. NE13 7BA
## 8 2014-11-28 0.2 rateable value 11250 NE13 7BA
## 9 2014-11-28 0.2 LIA 5614. NE13 7BA
## 10 2014-11-28 0.2 Chr -4491 NE13 7BA
## # ℹ 46,547 more rows
ggplot2
Histogram of new column ratio
we have created
Always the same syntax, as we will see
ggplot2
Density of new column ratio
we have created
Change the background colour and label size
table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") |>
ggplot() +
geom_density(aes(ratio)) +
theme_bw(18)
table1 |>
mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
select(date, ratio, `rateable value`:`net charges`, "address") |>
ggplot() +
geom_point(aes(date, ratio)) +
theme_bw(18)
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")
table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")
month_year <- "November 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")
month_year <- "August 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")
month_year <- "May 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
ggplot() +
geom_histogram(aes(`net charges`)) +
scale_y_log10() +
scale_x_continuous(labels = scales::comma) +
theme_bw(18) +
labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")
Install the packages first if you haven’t already
Look at the examples in the slides, and try to do the same for your data
Look at the package manuals for inspiration: https://www.tidyverse.org/packages/
If you know what plots you want but not the code, ask
Note anything useful / challenging, and report your findings
Write reports or make presentation slides from time to time
An imagined typical workflow
Do the analysis, create and save the plots
Copy the numbers and plots to the word / powerpoint document
Discover errors / necessary changes, go back and forth
Eventually, some results don’t add up
A script that contains text and code interweave
Upon a button click:
R goes through the code and generates results on the fly, then
creates the document / slides automatically, and
in a format of your choice: pdf / word / HTML / powerpoint
HTML highly useful
Hosting on website
Checking the accessibility box
Testament: I create these materials in this approach
In RStudio, below File in the top left corner, click the small black triangle, then click R Markdown
Enter the title, author, date, and choose the output format (which can always be changed later)
Click OK to open a new file
Edit between the two lines of triple dashes if necessary
Grey code chunks ringfenced by the two lines of triple backticks
Write your R code within
Create new one by clicking green square (near top right of the file) and then R
White space is for text that describes your analysis
Any blue text will be formatted nicely e.g. section headings, URLs, bold/italics
Leave the template text for reference
Remove the template text when you feel confident
When ready:
Click Knit (and save the script file for the first time) and wait;
The Render panel (bottom left) will show the progress;
Once finished, the file may be opened automatically; if not, in the Files panel (bottom right), click the file.
If you want to change the format:
Click the small black triangle next to Knit and choose the format
Same content, different formats at (almost) no cost
Complete list here: https://rmarkdown.rstudio.com/lesson-9.html
Caution: pdf formats (pdf_document
, beamer_presentation
) might encounter more issues
Towards the top of the file
The global options
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, # print the code or not?
message = FALSE, # print messages or not?
warning = FALSE, # print warnings or not?
include = FALSE) # print results or not?
```
Individual chunk options override the global ones
The second chunk (with #|
) looks neater & easier to read
```{r setup2, echo = FALSE, message = FALSE, warning = FALSE, include = FALSE}
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```
```{r setup3}
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
#| include: FALSE
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```
Other options: fig.cap
for captions of plots
Full list here: https://yihui.org/knitr/options/
Whatever objects you have created or packages loaded in console is independent of the R Markdown file
The code in the Rmd file has to load all necessary packages & create objects within
Reproducibility
Can your colleagues replicate the results?
Can you generate the document if the data changes? Related to semi-automation
Some commands that should not be in an R Markdown file
install.packages(<package name>)
– library(<package name>)
suffices
setwd("C:~/home/clement/paths/that/only/exist/on/my/laptop/")
Refer to previous slides (Workflow part 1, Workflow part 2, Chunk options, Self-contained) if needed
Put the code you have written in Practical 2 in an R Markdown file
Knit to different formats and see the results
If in doubt, knit a new file without removing the template, before making changes
Introduction to R & basic plots
Data pipeline via the tidyverse (& pipe operator)
read the data into R,
create / manipulate columns,
make plots
Make documents / presentations using R Markdown
Code & text together in one script file
Document created upon a button click
Cons
Needs more effort to refine layout
Needs to set things up & install packages
Needs all code working before getting the finished product
Up-front cost of learning the coding before achieving efficiency
Pros
Focus on content, which is separate from format
Semi-automation
Reproducible
Easy to go on to version control, CI/CD etc. in your workflow
Aggregation of marginal gains (Kaizen)
Start from the templates, make sure they work, then slowly add your analysis
Make one plot or calculate one summary at a time
The manuals (https://www.tidyverse.org/packages/)
ChatGPT also useful to generate the initial code
Email: clement.lee@newcastle.ac.uk
University pages:
Questions?