Good coding practices

Insights North East Workshop

Clement Lee (Newcastle University)

2025-02-04 (Tue)

Today’s outline

  1. Introduction

    • The case for coding
    • Quick introduction to R
    • Basic plots
  2. Practical 1

  3. Data pipeline via the tidyverse

  4. Practical 2

  5. Write documents & presentations

  6. Practical 3

  7. Q&A, summary

 

1. Introduction

About me

  • Lecturer in Statistics at Newcastle University since August 2022

  • Teach statistical analysis using R

  • Research on statistical models for (social) networks

 

Experience with outside academia

 

About you?

Why coding in the first place?

Other kinds of output

Why not Python?

R

RStudio

Data

library(readxl)
table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1
## # A tibble: 11,714 × 19
##     Year start               end   property code  street address                
##    <dbl> <dttm>              <lgl> <chr>    <chr> <chr>  <chr>                  
##  1  2024 2017-02-10 00:00:00 NA    N000080  CW    582    Brunswick Park Industr…
##  2  2024 2020-08-01 00:00:00 NA    N000082  IM3   582    Brunswick Park Industr…
##  3  2024 2014-11-28 00:00:00 NA    N000085  CW    582    Antique Pine Imports, …
##  4  2024 2016-06-10 00:00:00 NA    N000087  IF    582    Brunswick Park Industr…
##  5  2024 1995-04-01 00:00:00 NA    N000088  IF    582    9, Brunswick Park Indu…
##  6  2024 2015-08-05 00:00:00 NA    N000090  CW    582    10, Brunswick Park Ind…
##  7  2024 1995-04-01 00:00:00 NA    N000093  IF3   582    Brunswick Park Industr…
##  8  2024 1995-04-01 00:00:00 NA    N000094  IF3   583    13, Brunswick Village,…
##  9  2024 1995-04-01 00:00:00 NA    N000095  IF3   582    Brunswick Park Industr…
## 10  2024 2019-01-01 00:00:00 NA    N000096  CW    12979  Unit 15, Brunswick Par…
## # ℹ 11,704 more rows
## # ℹ 12 more variables: `NEW foi_liable party` <chr>, `NEW c/o address` <chr>,
## #   `rateable value` <dbl>, LIA <dbl>, TRL <dbl>, SBR <dbl>, Chr <dbl>,
## #   s44a <lgl>, EXM <dbl>, DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>

Some summary statistics

mean(table1$`rateable value`) # the average of rateable value
## [1] 28322.39
sd(table1$`LIA`) # standard deviation of variable LIA
## [1] 78594.26
summary(table1$`net charges`) # a range of summaries
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0     419   10380    3898 3598140

A histogram

hist(table1$`net charges`) # histogram of net charges

Better

hist(table1$`net charges`,
     breaks = 100, # approximately 100 bars
     main = NA, xlab = NA) # don't print the title and the x-axis label yet
title(main = "Histogram of net charges for November 2024",
      xlab = "Net charges (£)")

Dates / times

plot(table1$start, table1$`net charges`)

Better

plot(table1$start, table1$`net charges`, log = "y", # on log scale
     main = NA, xlab = NA, ylab = NA, # don't print title & axis labels yet
     cex = 0.1) # size of the points
title(main = "Net charges over the starting date of businesses",
      xlab = "Starting date", ylab = "Net charges (£)")

Preempting the questions

Preempting the questions

Preempting the questions

Preparing for the next section

Questions so far?

2. Practical 1

3. Data pipeline

The pipe operator

mean(sqrt(table1$`net charges`))
## [1] 48.93841
table1$`net charges` |> sqrt() |> mean()
## [1] 48.93841

The tidyverse

install.packages("tidyverse")

Alternatively

install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("readr")
install.packages("stringr")
install.packages("tibble")
install.packages("purrr")
install.packages("forcats")
library(dplyr) # manipulates columns
library(ggplot2) # for plots
library(tidyr) # pivot tables & more
library(readr) # reads tables nicely
library(stringr) # works with text data
library(tibble) # nice tables
library(purrr) # for functional programming
library(forcats) # for "factors"

dplyr

table2 <- table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") # select / reorder columns
table2
## # A tibble: 11,714 × 13
##    date        ratio `rateable value`    LIA   TRL    SBR   Chr s44a    EXM
##    <date>      <dbl>            <dbl>  <dbl> <dbl>  <dbl> <dbl> <lgl> <dbl>
##  1 2017-02-10 0.250             45750 22829.    NA    NA     NA NA       NA
##  2 2020-08-01 1                 35500 17714.    NA    NA     NA NA       NA
##  3 2014-11-28 0.2               11250  5614.    NA    NA  -4491 NA       NA
##  4 2016-06-10 0.250             16500  8234.    NA    NA     NA NA       NA
##  5 1995-04-01 1                 21000 10479     NA    NA     NA NA       NA
##  6 2015-08-05 1                 52000 28392     NA    NA     NA NA       NA
##  7 1995-04-01 0.0417            12500  6238.    NA -5198.    NA NA       NA
##  8 1995-04-01 0.339             13250  6612.    NA -3857.    NA NA       NA
##  9 1995-04-01 0                 11000  5489     NA -5489     NA NA       NA
## 10 2019-01-01 1                 34250 17091.    NA    NA     NA NA       NA
## # ℹ 11,704 more rows
## # ℹ 4 more variables: DRR1 <lgl>, DRR2 <dbl>, `net charges` <dbl>,
## #   address <chr>

dplyr

table2 |> filter(!is.na(DRR1))
## # A tibble: 3 × 13
##   date       ratio `rateable value`      LIA   TRL   SBR   Chr s44a    EXM DRR1 
##   <date>     <dbl>            <dbl>    <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 1995-04-01 0.814           272500  148785     NA    NA    NA NA       NA TRUE 
## 2 2021-03-09 0.911            47500   23702.    NA    NA    NA NA       NA TRUE 
## 3 2005-08-21 1              3060000 1670760      0    NA    NA NA       NA FALSE
## # ℹ 3 more variables: DRR2 <dbl>, `net charges` <dbl>, address <chr>

tidyr (+ dplyr)

table3 <- table2 |>
  pivot_longer(`rateable value`:`net charges`) |>
  select(date, ratio, name, value, everything()) |> # a dplyr function
  filter(!is.na(value)) # another dplyr function
table3
## # A tibble: 46,557 × 5
##    date       ratio name             value address                              
##    <date>     <dbl> <chr>            <dbl> <chr>                                
##  1 2017-02-10 0.250 rateable value  45750  Brunswick Park Industrial Estate, Br…
##  2 2017-02-10 0.250 LIA             22829. Brunswick Park Industrial Estate, Br…
##  3 2017-02-10 0.250 DRR2           -17122. Brunswick Park Industrial Estate, Br…
##  4 2017-02-10 0.250 net charges      5707. Brunswick Park Industrial Estate, Br…
##  5 2020-08-01 1     rateable value  35500  Brunswick Park Industrial Estate, Br…
##  6 2020-08-01 1     LIA             17714. Brunswick Park Industrial Estate, Br…
##  7 2020-08-01 1     net charges     17714. Brunswick Park Industrial Estate, Br…
##  8 2014-11-28 0.2   rateable value  11250  Antique Pine Imports, Brunswick Park…
##  9 2014-11-28 0.2   LIA              5614. Antique Pine Imports, Brunswick Park…
## 10 2014-11-28 0.2   Chr             -4491  Antique Pine Imports, Brunswick Park…
## # ℹ 46,547 more rows

stringr (+ dplyr)

table3 |>
  mutate(postcode = str_extract(address, "NE\\d+ \\d\\w{2}")) |> # within a dplyr function
  select(-address) # postcode suffices for finding the coordinates
## # A tibble: 46,557 × 5
##    date       ratio name             value postcode
##    <date>     <dbl> <chr>            <dbl> <chr>   
##  1 2017-02-10 0.250 rateable value  45750  NE13 7BA
##  2 2017-02-10 0.250 LIA             22829. NE13 7BA
##  3 2017-02-10 0.250 DRR2           -17122. NE13 7BA
##  4 2017-02-10 0.250 net charges      5707. NE13 7BA
##  5 2020-08-01 1     rateable value  35500  NE13 7BA
##  6 2020-08-01 1     LIA             17714. NE13 7BA
##  7 2020-08-01 1     net charges     17714. NE13 7BA
##  8 2014-11-28 0.2   rateable value  11250  NE13 7BA
##  9 2014-11-28 0.2   LIA              5614. NE13 7BA
## 10 2014-11-28 0.2   Chr             -4491  NE13 7BA
## # ℹ 46,547 more rows

ggplot2

table2 |> ggplot() +
  geom_histogram(aes(ratio))

ggplot2

table2 |> ggplot() +
  geom_density(aes(ratio)) +
  theme_bw(18)

Chaining every together

table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") |>
  ggplot() +
  geom_density(aes(ratio)) +
  theme_bw(18)

Chaining every together

table1 |>
  mutate(date = as.Date(start), ratio = `net charges` / LIA) |>
  select(date, ratio, `rateable value`:`net charges`, "address") |>
  ggplot() +
  geom_point(aes(date, ratio)) +
  theme_bw(18)

What about the problematic net charges?

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`))

Log scale on the \(y\)-axis

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10()

Polishing the plot

table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")

Taking it all the way

table1 <- read_xlsx("Business Rates properties - November 2024.xlsx")
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = "Histogram of net charges for November 2024", x = "Net charges (£)")

Semi-automation

month_year <- "November 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

Changes in content & title

month_year <- "August 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

As long as format of file name stays the same

month_year <- "May 2024"
table1 <- read_xlsx(paste0("Business Rates properties - ", month_year, ".xlsx"))
table1 |>
  ggplot() +
  geom_histogram(aes(`net charges`)) +
  scale_y_log10() +
  scale_x_continuous(labels = scales::comma) +
  theme_bw(18) +
  labs(title = paste0("Histogram of net charges for ", month_year), x = "Net charges (£)")

Questions?

4. Practical 2

5. Write documents & presentations

What good is it for me?

R Markdown to the rescue

Workflow part 1

Workflow part 2

Chunk options

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, # print the code or not?
                      message = FALSE, # print messages or not?
                      warning = FALSE, # print warnings or not?
                      include = FALSE) # print results or not?
```
```{r setup2, echo = FALSE, message = FALSE, warning = FALSE, include = FALSE}
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```  

```{r setup3}
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
#| include: FALSE
table1 |> ggplot() + geom_histogram(aes(`net charges`))
```  

Self-contained

Questions?

6. Practical 3

7. Summary

Taking stock

Is this way of making documents useful for me?

Cons

  • Needs more effort to refine layout

  • Needs to set things up & install packages

  • Needs all code working before getting the finished product

  • Up-front cost of learning the coding before achieving efficiency

 

Pros

  • Focus on content, which is separate from format

  • Semi-automation

  • Reproducible

  • Easy to go on to version control, CI/CD etc. in your workflow

“Learning all this is very costly”

Thank you for listening