The purpose of this practical is to allow you to:
Before we begin working with data, let’s make sure you are set up correctly in RStudio. Follow these steps at the start of each practical session:
On university computers, find RStudio in the Start menu or Applications folder. If you’re using your own computer, launch the RStudio application you installed.
When RStudio opens, you’ll see several panels:
The working directory is the folder R uses by default when reading or writing files. Set it to the folder where your script and data are located.
Option A: Using the menu
Option B: Using code
Add this line at the top of your script, replacing the path with your folder:
setwd("H:/MAS2908/practicals") # Windows example
setwd("~/Documents/MAS2908/practicals") # Mac/Linux example
While some might recommend option B, I strongly recommend against it because it makes the script not reproducible, as the paths you see above might not exist on other computers.
To check your current working directory:
getwd()
Go to File > New File > R Script (or press Ctrl+Shift+N on Windows,
Cmd+Shift+N on Mac). This opens a blank script in the Source panel where
you can write and save your code.
Save your script immediately: File > Save (or Ctrl+S / Cmd+S).
Choose a sensible name like practical01.R and save it in your working
folder.
There are several ways to run code from your script:
Ctrl+Enter
(Cmd+Enter on Mac)Ctrl+EnterCtrl+Shift+Enter or click “Source”R packages extend the functionality of base R. In this module, we will primarily use two packages:
dplyr: For data manipulation (filtering, sorting, summarising)ggplot2: For creating visualisationsIf you are using your own computer, you need to install these packages once:
install.packages("dplyr")
install.packages("ggplot2")
Note: Only run install.packages() in the console, not in your R script.
Installation only needs to happen once, and including it in a script can cause
unnecessary re-installation every time the script runs.
On university computers, these packages are already installed.
At the start of each session, load the packages you need:
library(dplyr)
library(ggplot2)
A data set that has been loaded into R, and is ready for analysis, is normally stored in something called a data frame. A data frame can be thought of as a table of data where the columns are named vectors, with each vector containing a particular type of data (numeric, string, date, time). These columns are often called variables in statistics, with the rows of the data frame corresponding to individuals or single observations.
We will use the mpg dataset from the ggplot2 package to illustrate basic
operations. This dataset contains fuel economy data for 234 vehicles:
library(ggplot2) # mpg is part of ggplot2
First, let’s check that mpg really is a data frame:
class(mpg)
In R, the class of objects is very important. The same function, for example
summary(), will perform differently depending upon the class of the object.
In computer programming this is called “function overloading”.
We can use several functions to inspect the data frame:
head(mpg) # First 6 rows
dim(mpg) # Number of rows and columns
names(mpg) # Column names
str(mpg) # Structure summary
Another way to see the contents in a spreadsheet-type view is:
View(mpg)
Note: Don’t include View() commands in an R script. Only execute View()
and other “interactive” commands in the console.
To access individual variables (columns), we use the dollar syntax:
mpg$manufacturer # Returns the 'manufacturer' column as a vector
mpg$cty[1:5] # Returns the first 5 elements of 'cty'
We can also use square brackets to select individual elements:
mpg[1, 1] # First row, first column
mpg[1, ] # Entire first row
mpg[, 1] # Entire first column
mpg[1:3, ] # First 3 rows
Note: Absolute referencing like this is generally a bad idea. If the order of your columns changes, you may be referring to different columns than you thought. By using the dollar syntax, we use the name of the column rather than its location in the data frame, which makes the code easier to understand.
Using the dollar syntax, we can create new variables:
mpg$efficiency_ratio <- mpg$hwy / mpg$cty
head(mpg$efficiency_ratio)
This adds a new column called efficiency_ratio to the data frame. However,
there are more elegant ways of creating new variables, as you will see in the
next section.
dplyrThe dplyr package provides powerful tools for manipulating data frames.
Make sure you have loaded it:
library(dplyr)
The package gives us access to four key functions:
filter() - Focus on a subset of the rows of a data framearrange() - Reorder the rows in the data frameselect() - Select specific columnsmutate() - Add new columns that are functions of existing onesThe first argument to each of these functions is always the data frame we wish to work on. The subsequent arguments specify the variables we want to work with.
To focus on specific rows based on a condition, use filter(). For example,
to select only compact cars:
compact_cars <- filter(mpg, class == "compact")
head(compact_cars)
We can combine multiple conditions:
efficient_compact <- filter(mpg, class == "compact", cty > 20)
head(efficient_compact)
To sort rows by a variable, use arrange():
# Sort by city MPG (ascending by default)
arrange(mpg, cty) |> head()
# Sort by city MPG (descending)
arrange(mpg, desc(cty)) |> head()
To choose specific columns, use select():
# Select only manufacturer, model, and city MPG
mpg_subset <- select(mpg, manufacturer, model, cty)
head(mpg_subset)
To create new variables based on existing ones, use mutate():
mpg_new <- mutate(mpg,
avg_mpg = (cty + hwy) / 2,
efficiency = hwy / displ
)
head(select(mpg_new, model, cty, hwy, avg_mpg, efficiency))
When using mutate(), you refer to columns by name directly without the
dollar syntax. This makes the code cleaner and easier to read.
When performing multiple operations on data, we often need to chain several
functions together. R provides the pipe operator |> (introduced in
R 4.1) to make this easier to read.
The pipe takes the result of the expression on its left and passes it as the first argument to the function on its right:
# Without pipe: nested functions (read inside-out)
head(arrange(filter(mpg, class == "suv"), desc(cty)))
# With pipe: left to right (read naturally)
mpg |>
filter(class == "suv") |>
arrange(desc(cty)) |>
head()
Both produce the same result, but the piped version is easier to read: “Take mpg, then filter for SUVs, then arrange by city MPG, then show the head.”
You can combine all four dplyr functions with the pipe:
mpg |>
filter(year == 2008) |>
mutate(efficiency = hwy / displ) |>
arrange(desc(efficiency)) |>
select(manufacturer, model, efficiency) |>
head()
Note: You may also see %>% in older code or online examples. This is
the pipe operator from the magrittr package (loaded with dplyr). The
native pipe |> works the same way for most purposes and doesn’t require
loading any packages.
Using the mpg dataset:
mpg |>
filter(year == 1999) |>
arrange(desc(hwy)) |>
select(manufacturer, model, hwy)
## # A tibble: 117 × 3
## manufacturer model hwy
## <chr> <chr> <int>
## 1 volkswagen jetta 44
## 2 volkswagen new beetle 44
## 3 volkswagen new beetle 41
## 4 toyota corolla 35
## 5 honda civic 33
## 6 toyota corolla 33
## 7 honda civic 32
## 8 honda civic 32
## 9 honda civic 32
## 10 toyota corolla 30
## # ℹ 107 more rowsCreate a new variable called mpg_diff that is the difference between
highway and city MPG. Which car has the largest difference?
mpg |>
mutate(mpg_diff = hwy - cty) |>
arrange(desc(mpg_diff)) |>
select(manufacturer, model, cty, hwy, mpg_diff) |>
head()
## # A tibble: 6 × 5
## manufacturer model cty hwy mpg_diff
## <chr> <chr> <int> <int> <int>
## 1 honda civic 24 36 12
## 2 volkswagen new beetle 29 41 12
## 3 audi a4 18 29 11
## 4 audi a4 20 31 11
## 5 chevrolet malibu 18 29 11
## 6 honda civic 25 36 11
# The volkswagen new beetle has the largest differenceRewrite the following code using the pipe operator |>:
head(select(arrange(filter(mpg, cty > 20), desc(hwy)), manufacturer, model, hwy))
mpg |>
filter(cty > 20) |>
arrange(desc(hwy)) |>
select(manufacturer, model, hwy) |>
head()
## # A tibble: 6 × 3
## manufacturer model hwy
## <chr> <chr> <int>
## 1 volkswagen jetta 44
## 2 volkswagen new beetle 44
## 3 volkswagen new beetle 41
## 4 toyota corolla 37
## 5 honda civic 36
## 6 honda civic 36While we have been working with datasets that already exist in R packages (like
mpg from ggplot2), in practice you will often need to work with data
downloaded from the Internet, or save your data to share with collaborators.
Data frames only exist in your R session. When you quit R, they disappear unless you save them. A common way to export a data frame is to write it as a CSV (Comma Separated Values) file:
write.csv(mpg, file = "mpg_data.csv", row.names = FALSE)
The row.names = FALSE argument prevents R from adding an extra column of
row numbers.
Other formats are also available:
| Name | Extension | Write function | Read function |
|---|---|---|---|
| CSV | .csv |
write.csv() |
read.csv() |
| Tab-delimited | .tab |
write.table() |
read.table() |
To import data, use the corresponding read.*() functions:
dat <- read.csv(file = "mpg_data.csv")
Before importing a dataset, it’s good practice to inspect the file using a
text editor to confirm its structure. The read.table() function is very
flexible and can handle many different formats by adjusting its arguments.
write.csv().mpg to a CSV file.write.csv(head(mpg, 10), file = "mpg_sample.csv", row.names = FALSE)
mpg_sample <- read.csv("mpg_sample.csv")
head(mpg_sample)This section introduces data visualisation using base R functions. We will
continue using the mpg dataset. We focus on the cty variable (city miles
per gallon) for single-variable plots, and the class variable (vehicle type)
for group comparisons.
Histogram: Create a histogram of cty using hist().
col = "lightblue" to fill the bars.xlab and ylab.breaks argument.hist(mpg$cty, col = "lightblue",
xlab = "City MPG", ylab = "Frequency",
main = "Distribution of City Fuel Efficiency")
Density plot: Create a density plot of cty.
density(mpg$cty).plot().rug().plot(density(mpg$cty), main = "Density of City MPG",
xlab = "City MPG", col = "darkblue", lwd = 2)
rug(mpg$cty, col = "red")
Bar chart: Create a bar chart of class (vehicle type).
barplot(table(mpg$class)).barplot(table(mpg$class), xlab = "Vehicle Class", ylab = "Frequency",
col = "steelblue", main = "Vehicle Classes in mpg Dataset")
# SUV is the most common vehicle classScatterplot: Explore the relationship between two continuous variables.
cty (\(y\)-axis) vs hwy (\(x\)-axis) using plot().abline(lm(cty ~ hwy, data = mpg)).plot(mpg$hwy, mpg$cty, xlab = "Highway MPG", ylab = "City MPG",
main = "City vs Highway Fuel Efficiency", pch = 19)
abline(lm(cty ~ hwy, data = mpg), col = "red", lwd = 2)
# Strong positive linear relationship: higher highway MPG = higher city MPGLine plot: Create a line plot using the economics dataset.
psavert (personal savings rate) over time using
plot(economics$date, economics$psavert, type = "l").plot(economics$date, economics$psavert, type = "l",
xlab = "Date", ylab = "Personal Savings Rate (%)",
main = "US Personal Savings Rate Over Time", col = "darkblue")
# General declining trend from 1970s to 2005, then increase after 2008 crisisStripchart: Create a stripchart of cty by class.
stripchart(cty ~ class, data = mpg, method = "jitter").vertical = TRUE to show values on the \(y\)-axis.stripchart(cty ~ class, data = mpg, method = "jitter",
vertical = TRUE, pch = 19,
xlab = "Vehicle Class", ylab = "City MPG")
Boxplot: Compare the distribution of cty across vehicle classes.
boxplot(cty ~ class, data = mpg).boxplot(cty ~ class, data = mpg,
xlab = "Vehicle Class", ylab = "City MPG",
main = "City Fuel Efficiency by Vehicle Class")
# Subcompact has the best median city fuel efficiency
# Compact and midsize show the most variabilityThis practical covered the following key concepts:
Data frames are the standard object for storing tabular data in R. Each column is a variable that can contain different types of data.
Inspecting data frames: Use head(), dim(), names(), and str()
to understand the structure of a data frame.
Accessing columns: Use the dollar syntax (df$column) to access
columns by name.
dplyr functions for data manipulation:
| Function | Action |
|---|---|
filter() |
Select rows based on conditions |
arrange() |
Sort rows by variable values |
select() |
Choose specific columns |
mutate() |
Create new columns from existing ones |
The pipe operator |> chains operations together, making code more
readable by allowing left-to-right reading instead of nested function calls.
Exporting and importing data: Use write.csv() and read.csv() to
save and load data frames as CSV files.
Base R graphics functions for visualisation:
| Function | Plot type |
|---|---|
hist() |
Histogram |
plot() |
Scatterplot, line plot, density plot |
barplot() |
Bar chart |
boxplot() |
Boxplot |
stripchart() |
Stripchart |
Things to avoid in your R scripts:
setwd() with absolute paths (not reproducible on other computers)View() in scripts (only use interactively in the console)[row, col] (use column names instead)install.packages() in scripts (only run once in the console)You might ask, it’s all good to create these plots in R, but what if I want to put them in a report? Do I have to save them on the computer then copy them to a word document or similar? Is there an easier way of doing this? We will learn about this next week.
There is more to working with data in R than what we have covered here. The following extra sections on data types and logical comparisons may become useful as you progress through the module and encounter more complex data manipulation tasks.
To determine the data type of an object, use the class() function:
class(mpg$cty) # numeric
class(mpg$manufacturer) # character
class(c("peanut", "seed"))
class(TRUE)
More generally, class() can be applied to any object:
class(mpg)
class(plot)
When debugging errors, class() can help you check whether objects are
what you expect them to be.
Decimal values are stored as numeric data in R, which is the default computational data type:
x <- c(10.5, 19.2, 1)
class(x)
Even if we assign integers to a variable, R will store it as numeric:
k <- c(1, 2, 10, 33)
class(k)
Categorical variables (such as social class, diagnosis, gender, species) are often stored as factors in R. A factor stores categorical data as numerical codes along with labels that make the codes meaningful:
pain <- c(0, 1, 3, 2, 2, 1, 1, 3)
pain_f <- factor(pain, levels = 0:3,
labels = c("none", "mild", "medium", "severe"))
pain_f
To work with the actual string labels:
pain_c <- as.character(pain_f)
pain_c
You can modify factor levels after creation:
# What are the levels?
levels(pain_f)
# Change all levels
levels(pain_f) <- c("none", "uncomfortable", "unpleasant", "agonising")
levels(pain_f)
Dates and timestamps typically look like: 2014-10-09 01:45:00
This follows the format YYYY-MM-DD hh:mm:ss, but there are many variations.
By default, R reads dates as strings (and may convert them to factors). We will
cover proper date handling with lubridate later in the module.
Check the class of various columns in the mpg dataset. Which are numeric?
Which are character? Which are factors?
sapply(mpg, class)
## manufacturer model displ year cyl trans
## "character" "character" "numeric" "integer" "integer" "character"
## drv cty hwy fl class
## "character" "integer" "integer" "character" "character"
# Most are character, cty/hwy/displ/cyl are numeric/integer, year is integerCreate a factor variable for vehicle class with custom labels.
class_f <- factor(mpg$class)
levels(class_f)
## [1] "2seater" "compact" "midsize" "minivan" "pickup"
## [6] "subcompact" "suv"In practice, we often need to extract data that satisfy certain criteria, such as all data for a specific group or all observations above a threshold. This is when we use logical comparisons.
We can select parts of a vector using indexing:
mpg$manufacturer[1:5]
But if we want to filter based on a condition, we use logical comparisons
with filter():
filter(mpg, class == "suv") |> head()
The == means “equal to”. Here are other comparison operators:
| Symbol | Comparison |
|---|---|
< |
Less than |
> |
Greater than |
<= |
Less than or equal to |
>= |
Greater than or equal to |
== |
Equal to |
!= |
Not equal to |
When you use a logical comparison like class == "suv", R returns a vector
of TRUE and FALSE values:
(mpg$class == "suv")[1:10]
This logical vector is then used to filter the data frame, returning only
rows where the condition is TRUE.
You can combine logical comparisons using Boolean operators:
| Symbol | Phrase |
|---|---|
& |
And (ampersand) |
| |
Or (vertical bar) |
! |
Not or negation (exclamation) |
The & operator returns TRUE only when both comparisons are true.
The | operator returns TRUE if at least one comparison is true:
# Cars with city MPG > 20 AND highway MPG > 30
filter(mpg, (cty > 20) & (hwy > 30)) |> head()
# Cars that are either SUVs OR have city MPG > 25
filter(mpg, (class == "suv") | (cty > 25)) |> head()
Using brackets helps visually separate the different conditions and makes the code easier to read.