Today’s outline

Introduction
- The case for coding
- Quick introduction to R
- Basic plots
Practical 1
Data pipeline via the tidyverse
Practical 2
Write documents & presentations
Practical 3
Q&A, summary

Today’s slides: https://bit.ly/insights-north-east
- I shall put this in the chat
Toolkits on https://insightsnortheast.co.uk/data/

About me

Lecturer in Statistics at Newcastle University since August 2022
Teach statistical analysis using R
Research on statistical models for (social) networks

Experience with outside academia

Case study: https://bit.ly/s4s-freegle

About you?

Write in the chat
Experience in working with data, and what are your usual tools?
Have you coded in R (or other programming languages) before?

Why coding in the first place?

Future-proof your analysis
- for others to replicate / reproduce
- including your future self
ChatGPT & others making it easier to start
Also the stepping stone for machine learning and AI
- but don’t start with something completely unfamiliar
- garbage in, garbage out
Version control, CI/CD, and beyond

Other kinds of output

“But I need to make presentations, reports, visualisations, etc.”
Most can be done nicely via R
We will see how this can be done

Why not Python?

Python great for big data pipelines
- Interfacing with the Internet
- Machine learning at scale
R is better in some other aspects
- Standalone analysis
- The whole data pipeline with tidyverse
- The notebook / literate programming approach
Understand which one is better / more suited for your tasks

R

Free & open source
Download from https://cran.r-project.org/
Anyone can contribute:
- https://cran.r-project.org/web/packages/available_packages_by_name.html
- Some packages about “business”, “politics”, etc.

RStudio

The best interface for using R
Download from https://posit.co/download/rstudio-desktop/
Let’s have a look

Data

library(readxl)
table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1

## # A tibble: 11,714 × 19
##     Year start               end   property code  street address                
##    <dbl> <dttm>              <lgl> <chr>    <chr> <chr>  <chr>                  
##  1  2024 2017-02-10 00:00:00 NA    N000080  CW    582    Brunswick Park Industr…
##  2  2024 2020-08-01 00:00:00 NA    N000082  IM3   582    Brunswick Park Industr…
##  3  2024 2014-11-28 00:00:00 NA    N000085  CW    582    Antique Pine Imports, …
##  4  2024 2016-06-10 00:00:00 NA    N000087  IF    582    Brunswick Park Industr…
##  5  2024 1995-04-01 00:00:00 NA    N000088  IF    582    9, Brunswick Park Indu…
##  6  2024 2015-08-05 00:00:00 NA    N000090  CW    582    10, Brunswick Park Ind…
##  7  2024 1995-04-01 00:00:00 NA    N000093  IF3   582    Brunswick Park Industr…
##  8  2024 1995-04-01 00:00:00 NA    N000094  IF3   583    13, Brunswick Village,…
##  9  2024 1995-04-01 00:00:00 NA    N000095  IF3   582    Brunswick Park Industr…
## 10  2024 2019-01-01 00:00:00 NA    N000096  CW    12979  Unit 15, Brunswick Par…
## # ℹ 11,704 more rows
## # ℹ 12 more variables: `NEW foi_liable party` <chr>, `NEW c/o address` <chr>,
## #   `rateable value` <dbl>, LIA <dbl>, TRL <dbl>, SBR <dbl>, Chr <dbl>,
## #   s44a <lgl>, EXM <dbl>, DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>

Source: https://new.newcastle.gov.uk/budget-policies-performance-data/open-data/business-economy/business-rates-data/all-current-business-rates-properties
A table / spreadsheet in Excel (.xls, .xlsx, .csv, …) \(\approx\) a data frame / tibble in R

Some summary statistics

mean(table1$`rateable value`) # the average of rateable value

## [1] 28322.39

sd(table1$`LIA`) # standard deviation of variable LIA

## [1] 78594.26

summary(table1$`net charges`) # a range of summaries

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0     419   10380    3898 3598140

A histogram

hist(table1$`net charges`) # histogram of net charges

Better

hist(table1$`net charges`,
     breaks = 100, # approximately 100 bars
     main = NA, xlab = NA) # don't print the title and the x-axis label yet
title(main = "Histogram of net charges for November 2024",
      xlab = "Net charges (£)")

Dates / times

plot(table1$start, table1$`net charges`)

Better

plot(table1$start, table1$`net charges`, log = "y", # on log scale
     main = NA, xlab = NA, ylab = NA, # don't print title & axis labels yet
     cex = 0.1) # size of the points
title(main = "Net charges over the starting date of businesses",
      xlab = "Starting date", ylab = "Net charges (£)")

Preempting the questions

What if we have an xls or csv file instead?
read_xls() from the same package readxl
read.csv() that comes with R

Preempting the questions

What if we want to create new columns (by manipulating some existing columns) and plot them?
Tedious if we need to
- create them first,
- re-save the spreadsheet,
- and then rerun the analysis
It is usually the best not to modify the raw / original data

Preempting the questions

Plots still not perfect
Huge skew of net charges \(\rightarrow\) uninformative histogram; perhaps log scale help?
Faffy modifications; instead, we will use the package ggplot2

Preparing for the next section

We can build a data pipeline within R to
- read the data,
- manipulate them, and
- make very nice plots
Tidyverse (https://www.tidyverse.org/), arguably the biggest development in R ever
- ggplot2 is part of it

Questions so far?

Also, too fast or too slow?

2. Practical 1

Go to https://bit.ly/insights-practical1
Go through the practical in groups
Make summaries and plots using your own data or the business rates data
- Think about what you want to highlight
Note any challenges / difficulties, or anything you find useful
Report your findings

The pipe operator

Continue with the previous data frame table1
Let’s say we want to calculate the mean of the square root of the net charges

mean(sqrt(table1$`net charges`))

## [1] 48.93841

The way the functions are typed is not that the way we think about applying them
The code below is more linear

table1$`net charges` |> sqrt() |> mean()

## [1] 48.93841

The tidyverse

Collection of R packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures
Installation: only need to do it once (theoretically)

install.packages("tidyverse")

The above might require a lot more packages to be installed
After all, it’s just a collection of packages

Alternatively

install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("readr")
install.packages("stringr")
install.packages("tibble")
install.packages("purrr")
install.packages("forcats")

Type above commands once and first
Type below commands every time you need them
- Only downside: can’t type library(tidyverse)

library(dplyr) # manipulates columns
library(ggplot2) # for plots
library(tidyr) # pivot tables & more
library(readr) # reads tables nicely
library(stringr) # works with text data
library(tibble) # nice tables
library(purrr) # for functional programming
library(forcats) # for "factors"

`dplyr`

Create / manipulate columns

table2 <- table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") # select / reorder columns
table2

## # A tibble: 11,714 × 13
##    date        ratio `rateable value`    LIA   TRL    SBR   Chr s44a    EXM
##    <date>      <dbl>            <dbl>  <dbl> <dbl>  <dbl> <dbl> <lgl> <dbl>
##  1 2017-02-10 0.250             45750 22829.    NA    NA     NA NA       NA
##  2 2020-08-01 1                 35500 17714.    NA    NA     NA NA       NA
##  3 2014-11-28 0.2               11250  5614.    NA    NA  -4491 NA       NA
##  4 2016-06-10 0.250             16500  8234.    NA    NA     NA NA       NA
##  5 1995-04-01 1                 21000 10479     NA    NA     NA NA       NA
##  6 2015-08-05 1                 52000 28392     NA    NA     NA NA       NA
##  7 1995-04-01 0.0417            12500  6238.    NA -5198.    NA NA       NA
##  8 1995-04-01 0.339             13250  6612.    NA -3857.    NA NA       NA
##  9 1995-04-01 0                 11000  5489     NA -5489     NA NA       NA
## 10 2019-01-01 1                 34250 17091.    NA    NA     NA NA       NA
## # ℹ 11,704 more rows
## # ℹ 4 more variables: DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>,
## #   address <chr>

`dplyr`

Filter rows where DRR1 is not empty (NA in R)
Other uses (not shown):
- Join other tables
- Summarise columns

table2 |> filter(!is.na(DRR1))

## # A tibble: 3 × 13
##   date       ratio `rateable value`      LIA   TRL   SBR   Chr s44a    EXM DRR1 
##   <date>     <dbl>            <dbl>    <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 1995-04-01 0.814           272500  148785     NA    NA    NA NA       NA TRUE 
## 2 2021-03-09 0.911            47500   23702.    NA    NA    NA NA       NA TRUE 
## 3 2005-08-21 1              3060000 1670760      0    NA    NA NA       NA FALSE
## # ℹ 3 more variables: DRR2 <dbl>, `net charges` <dbl>, address <chr>

`tidyr` (+ `dplyr`)

Pivot tables
- Combine columns between “rateable value” & “net charges”
- New columns “name” and “value”

table3 <- table2 |>
  pivot_longer(`rateable value`:`net charges`) |>
  select(date, ratio, name, value, everything()) |> # a dplyr function
  filter(!is.na(value)) # another dplyr function
table3

## # A tibble: 46,557 × 5
##    date       ratio name             value address                              
##    <date>     <dbl> <chr>            <dbl> <chr>                                
##  1 2017-02-10 0.250 rateable value  45750  Brunswick Park Industrial Estate, Br…
##  2 2017-02-10 0.250 LIA             22829. Brunswick Park Industrial Estate, Br…
##  3 2017-02-10 0.250 DRR2           -17122. Brunswick Park Industrial Estate, Br…
##  4 2017-02-10 0.250 net charges      5707. Brunswick Park Industrial Estate, Br…
##  5 2020-08-01 1     rateable value  35500  Brunswick Park Industrial Estate, Br…
##  6 2020-08-01 1     LIA             17714. Brunswick Park Industrial Estate, Br…
##  7 2020-08-01 1     net charges     17714. Brunswick Park Industrial Estate, Br…
##  8 2014-11-28 0.2   rateable value  11250  Antique Pine Imports, Brunswick Park…
##  9 2014-11-28 0.2   LIA              5614. Antique Pine Imports, Brunswick Park…
## 10 2014-11-28 0.2   Chr             -4491  Antique Pine Imports, Brunswick Park…
## # ℹ 46,547 more rows

`stringr` (+ `dplyr`)

Extract postcode
Useful if we want to combine with coordinates for plotting on a map

table3 |>
  mutate(postcode = str_extract(address, "NE\\d+ \\d\\w{2}")) |> # within a dplyr function
  select(-address) # postcode suffices for finding the coordinates

## # A tibble: 46,557 × 5
##    date       ratio name             value postcode
##    <date>     <dbl> <chr>            <dbl> <chr>   
##  1 2017-02-10 0.250 rateable value  45750  NE13 7BA
##  2 2017-02-10 0.250 LIA             22829. NE13 7BA
##  3 2017-02-10 0.250 DRR2           -17122. NE13 7BA
##  4 2017-02-10 0.250 net charges      5707. NE13 7BA
##  5 2020-08-01 1     rateable value  35500  NE13 7BA
##  6 2020-08-01 1     LIA             17714. NE13 7BA
##  7 2020-08-01 1     net charges     17714. NE13 7BA
##  8 2014-11-28 0.2   rateable value  11250  NE13 7BA
##  9 2014-11-28 0.2   LIA              5614. NE13 7BA
## 10 2014-11-28 0.2   Chr             -4491  NE13 7BA
## # ℹ 46,547 more rows

`ggplot2`

Histogram of new column ratio we have created
Always the same syntax, as we will see

table2 |> ggplot() +
  geom_histogram(aes(ratio))

`ggplot2`

Density of new column ratio we have created
Change the background colour and label size

table2 |> ggplot() +
  geom_density(aes(ratio)) +
  theme_bw(18)

Chaining every together

table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") |>
  ggplot() +
  geom_density(aes(ratio)) +
  theme_bw(18)

Chaining every together

table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") |>
  ggplot() +
  geom_point(aes(date, ratio)) +
  theme_bw(18)

What about the problematic net charges?

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`))

Log scale on the \(y\)-axis

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10()

Polishing the plot

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")

Taking it all the way

table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")

Semi-automation

month_year <- "November 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

Changes in content & title

month_year <- "August 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

As long as format of file name stays the same

month_year <- "May 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

4. Practical 2

Install the packages first if you haven’t already
Look at the examples in the slides, and try to do the same for your data
Look at the package manuals for inspiration: https://www.tidyverse.org/packages/
If you know what plots you want but not the code, ask
Note anything useful / challenging, and report your findings

5. Write documents & presentations

What good is it for me?

Write reports or make presentation slides from time to time
An imagined typical workflow
- Do the analysis, create and save the plots
- Copy the numbers and plots to the word / powerpoint document
- Discover errors / necessary changes, go back and forth
- Eventually, some results don’t add up

R Markdown to the rescue

A script that contains text and code interweave
Upon a button click:
- R goes through the code and generates results on the fly, then
- creates the document / slides automatically, and
- in a format of your choice: pdf / word / HTML / powerpoint
HTML highly useful
- Hosting on website
- Checking the accessibility box
Testament: I create these materials in this approach

Workflow part 1

In RStudio, below File in the top left corner, click the small black triangle, then click R Markdown
Enter the title, author, date, and choose the output format (which can always be changed later)
Click OK to open a new file
Edit between the two lines of triple dashes if necessary
Grey code chunks ringfenced by the two lines of triple backticks
- Write your R code within
- Create new one by clicking green square (near top right of the file) and then R
White space is for text that describes your analysis
- Any blue text will be formatted nicely e.g. section headings, URLs, bold/italics
- Leave the template text for reference

Workflow part 2

Remove the template text when you feel confident
When ready:
- Click Knit (and save the script file for the first time) and wait;
- The Render panel (bottom left) will show the progress;
- Once finished, the file may be opened automatically; if not, in the Files panel (bottom right), click the file.
If you want to change the format:
- Click the small black triangle next to Knit and choose the format
- Same content, different formats at (almost) no cost
- Complete list here: https://rmarkdown.rstudio.com/lesson-9.html
- Caution: pdf formats (pdf_document, beamer_presentation) might encounter more issues

Chunk options

Towards the top of the file
The global options

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, # print the code or not?
                      message = FALSE, # print messages or not?
                      warning = FALSE, # print warnings or not?
                      include = FALSE) # print results or not?
```

Individual chunk options override the global ones
The second chunk (with #|) looks neater & easier to read

```{r setup2, echo = FALSE, message = FALSE, warning = FALSE, include = FALSE}
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```  

```{r setup3}
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
#| include: FALSE
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```

Other options: fig.cap for captions of plots
Full list here: https://yihui.org/knitr/options/

Self-contained

Whatever objects you have created or packages loaded in console is independent of the R Markdown file
The code in the Rmd file has to load all necessary packages & create objects within
Reproducibility
- Can your colleagues replicate the results?
- Can you generate the document if the data changes? Related to semi-automation
Some commands that should not be in an R Markdown file
- install.packages(<package name>) – library(<package name>) suffices
- setwd("C:~/home/clement/paths/that/only/exist/on/my/laptop/")

6. Practical 3

Refer to previous slides (Workflow part 1, Workflow part 2, Chunk options, Self-contained) if needed
Put the code you have written in Practical 2 in an R Markdown file
Knit to different formats and see the results
If in doubt, knit a new file without removing the template, before making changes

Taking stock

Introduction to R & basic plots
Data pipeline via the tidyverse (& pipe operator)
- read the data into R,
- create / manipulate columns,
- make plots
Make documents / presentations using R Markdown
- Code & text together in one script file
- Document created upon a button click

Is this way of making documents useful for me?

Cons

Needs more effort to refine layout
Needs to set things up & install packages
Needs all code working before getting the finished product
Up-front cost of learning the coding before achieving efficiency

Pros

Focus on content, which is separate from format
Semi-automation
Reproducible
Easy to go on to version control, CI/CD etc. in your workflow

“Learning all this is very costly”

Aggregation of marginal gains (Kaizen)
Start from the templates, make sure they work, then slowly add your analysis
Make one plot or calculate one summary at a time
The manuals (https://www.tidyverse.org/packages/)
ChatGPT also useful to generate the initial code

Good coding practices

Insights North East Workshop

Today’s outline

1. Introduction

About me

Experience with outside academia

About you?

Why coding in the first place?

Other kinds of output

Why not Python?

R

RStudio

Data

Some summary statistics

A histogram

Better

Dates / times

Better

Preempting the questions

Preempting the questions

Preempting the questions

Preparing for the next section

Questions so far?

2. Practical 1

3. Data pipeline

The pipe operator

The tidyverse

Alternatively

dplyr

dplyr

tidyr (+ dplyr)

stringr (+ dplyr)

ggplot2

ggplot2

Chaining every together

Chaining every together

What about the problematic net charges?

Log scale on the \(y\)-axis

Polishing the plot

Taking it all the way

Semi-automation

Changes in content & title

As long as format of file name stays the same

Questions?

4. Practical 2

5. Write documents & presentations

What good is it for me?

R Markdown to the rescue

Workflow part 1

Workflow part 2

Chunk options

Self-contained

Questions?

6. Practical 3

7. Summary

Taking stock

Is this way of making documents useful for me?

“Learning all this is very costly”

Thank you for listening

`dplyr`

`dplyr`

`tidyr` (+ `dplyr`)

`stringr` (+ `dplyr`)

`ggplot2`

`ggplot2`