Computational Reproducibility

UKRN Training Workshop

Clement Lee (Mathematics, Statistics & Physics)

2025-03-13 (Thu)

House rules

Outline

  1. Introduction, RStudio

  2. Reproducible documents with R Markdown

  3. Practical

  4. Advanced stuff

 

Why coding?

What tools do you use?

  • Statistical analysis: Python, MS Excel?

  • Writing presentations: MS Powerpoint?

  • Writing papers: MS Word, Overleaf?

 

We can do all in one unified approach (in R)

  • Statistical analysis: Python, MS Excel?

  • Writing presentations: MS Powerpoint?

  • Writing papers: MS Word, Overleaf?

 

  • There are some caveats

  • Showing you what’s possible

  • Particularly useful if you have a lot of numbers / tables / figures from your analysis

Scope

  • Can others run your code and get identical results?

  • Can reproducibility be done in an efficient manner?

 

Not covered

  • Can others perform the same (non-computational) experiment themselves, and reach the same conclusion with broadly the same results?

  • Is my hypothesis valid? Are my methods / models useful?

  • Python

Why not Python?

R

RStudio

So far so good?

The case for computational reproducibility

  • Coding is only the first step

  • Reproducibility not guaranteed in the whole process

 

The old-school way

What if your analysis or data changes?

Another scenario

  • Wanting to revert to previous version of the analysis, but the scripts have already been updated

  • Save a new set of scripts every time

  • Eventually it becomes difficult to find the right version again

  • Autosave & track changes help, but still not perfect

 

Side track: Literate programming

Enter R Markdown

Installation

Chunk options

Originally

 

The output (.pdf / .html)

The figure above shows the distance against speed in the cars data set.

Hide the code

 

The output (.pdf / .html) The figure above shows the distance against speed in the cars data set.

Make the plot smaller

 

The output (.pdf / .html) The figure above shows the distance against speed in the cars data set.

Proper figure caption

 

The output (.pdf / .html)
Distance against speed in cars data set.

Distance against speed in cars data set.

Don’t evaluate the code

The script (.Rmd)

Also useful when showing code that doesn’t work

 

The output (.pdf / .html)

Hide the plots (but still evaluate)

 

The output (.pdf / .html)

Hide the code & results (but still evaluate)

 

The output (.pdf / .html)

More chunk options

“But I use LaTeX …”

$$ \frac{\pi}{4} = 1-\frac{1}{3}+\frac{1}{5}-\frac{1}{7} + \cdots = \sum_{k=0}^{\infty}\frac{(-1)^k}{2k+1} $$

\[ \frac{\pi}{4} = 1-\frac{1}{3}+\frac{1}{5}-\frac{1}{7} + \cdots = \sum_{k=0}^{\infty}\frac{(-1)^k}{2k+1} \]

“But I use other languages …”

 

The output (.pdf / .html)

## 6.2832

Other features (if there’s time)

Your turn – Practical

Goals

  • Think about which part of your analysis can be made reproducible

  • Convert an existing script, or create a new analysis

  • Generate the output in multiple format

 

Troubleshooting

Advanced 1: Quarto

Advanced 2: Bookdown – R Markdown upgrade

A collection of .Rmd files

  • index.Rmd
  • 01-multivariate-data.Rmd
  • 02-pca.Rmd
  • 03-cluster.Rmd
  • 04-regression.Rmd
  • 05-regularisation.Rmd
  • 06-classification.Rmd
  • 07-matrix.Rmd
  • 08-factorisation.Rmd
  • 09-references.Rmd

 

Advanced 2.5: Writing a whole thesis!

Some blog posts / guides

An ongoing project

 

Advanced 3: Writing journal articles!

# Run this once
install.packages("rticles")

Reproducibility check

  • What if a journal has reproducibility requirements?

    • Just give them the Rmd (and data)!
  • No need to give multiple scripts with convoluted instructions

  • Related question: Can somebody reproduce your results on their computer?

    • Absolute paths vs relative paths
    • Package dependencies, environments
  • Scope for pre-submission checks within the university?

 

Summary 1 – scaffolding

  • Standalone outputs via R Markdown

    • R Markdown comes with RStudio
  • Thesis / lecture notes

  • Journal articles templates

  • Python content within

  • Standalone output via Quarto

 

Summary 2 – practical tips

  • First generate the output without altering the template

  • Can the output be generated and results updated if the data / analysis changes?

  • Can the output be generated on someone’s else computer?

    • Don’t include commands like install.packages()
    • Don’t use absolute paths (see right)
  • Don’t include interactive commands e.g. View()

  • pdf issues can usually be solved with the use of the {tinytex} package

 

Summary 3 – bigger picture

Cons

  • Initial cost (time) to convert \(\qquad\rightarrow\)

  • Not as flexible as LaTeX \(\qquad\rightarrow\)

  • Collaboration trickier \(\qquad\rightarrow\)

 

Pros

Lastly

  • Coding is the first step of computational reproducibility

  • Computational reproducibility is the first step of:

    • Open research
    • Useful analysis / method / model
  • If you find it useful, spread the word & let us know!

 

https://xkcd.com/2054/

https://xkcd.com/2054/