The purpose of this practical is to allow you to:

  1. (re-)familiarise yourself with using RStudio to carry out basic operations;
  2. work with and understand data frames;
  3. visualise data frames with base R graphics.

1 Getting started with RStudio

Before we begin working with data, let’s make sure you are set up correctly in RStudio. Follow these steps at the start of each practical session:

1.1 Step 1: Open RStudio

On university computers, find RStudio in the Start menu or Applications folder. If you’re using your own computer, launch the RStudio application you installed.

When RStudio opens, you’ll see several panels:

  • Console (bottom-left): Where R code is executed and output appears
  • Source (top-left): Where you write and edit scripts (may be hidden initially)
  • Environment (top-right): Shows variables and data you’ve created
  • Files/Plots/Help (bottom-right): File browser, plots, and documentation

1.2 Step 2: Set your working directory

The working directory is the folder R uses by default when reading or writing files. Set it to the folder where your script and data are located.

Option A: Using the menu

  1. Go to Session > Set Working Directory > Choose Directory…
  2. Navigate to your folder and click “Open”

Option B: Using code

Add this line at the top of your script, replacing the path with your folder:

setwd("H:/MAS2908/practicals")  # Windows example
setwd("~/Documents/MAS2908/practicals")  # Mac/Linux example

While some might recommend option B, I strongly recommend against it because it makes the script not reproducible, as the paths you see above might not exist on other computers.

To check your current working directory:

getwd()

1.3 Step 3: Create a new R script

Go to File > New File > R Script (or press Ctrl+Shift+N on Windows, Cmd+Shift+N on Mac). This opens a blank script in the Source panel where you can write and save your code.

Save your script immediately: File > Save (or Ctrl+S / Cmd+S). Choose a sensible name like practical01.R and save it in your working folder.

1.4 Step 4: Running code

There are several ways to run code from your script:

  • Run current line: Place cursor on a line and press Ctrl+Enter (Cmd+Enter on Mac)
  • Run selected code: Highlight code and press Ctrl+Enter
  • Run entire script: Press Ctrl+Shift+Enter or click “Source”

1.5 Step 5: Install and load packages

R packages extend the functionality of base R. In this module, we will primarily use two packages:

  • dplyr: For data manipulation (filtering, sorting, summarising)
  • ggplot2: For creating visualisations

If you are using your own computer, you need to install these packages once:

install.packages("dplyr")
install.packages("ggplot2")

Note: Only run install.packages() in the console, not in your R script. Installation only needs to happen once, and including it in a script can cause unnecessary re-installation every time the script runs.

On university computers, these packages are already installed.

At the start of each session, load the packages you need:

library(dplyr)
library(ggplot2)

2 Data frame basics

A data set that has been loaded into R, and is ready for analysis, is normally stored in something called a data frame. A data frame can be thought of as a table of data where the columns are named vectors, with each vector containing a particular type of data (numeric, string, date, time). These columns are often called variables in statistics, with the rows of the data frame corresponding to individuals or single observations.

We will use the mpg dataset from the ggplot2 package to illustrate basic operations. This dataset contains fuel economy data for 234 vehicles:

library(ggplot2)  # mpg is part of ggplot2

2.1 Inspecting a data frame

First, let’s check that mpg really is a data frame:

class(mpg)

In R, the class of objects is very important. The same function, for example summary(), will perform differently depending upon the class of the object. In computer programming this is called “function overloading”.

We can use several functions to inspect the data frame:

head(mpg)       # First 6 rows
dim(mpg)        # Number of rows and columns
names(mpg)      # Column names
str(mpg)        # Structure summary

Another way to see the contents in a spreadsheet-type view is:

View(mpg)

Note: Don’t include View() commands in an R script. Only execute View() and other “interactive” commands in the console.

2.2 Accessing columns

To access individual variables (columns), we use the dollar syntax:

mpg$manufacturer    # Returns the 'manufacturer' column as a vector
mpg$cty[1:5]        # Returns the first 5 elements of 'cty'

We can also use square brackets to select individual elements:

mpg[1, 1]      # First row, first column
mpg[1, ]       # Entire first row
mpg[, 1]       # Entire first column
mpg[1:3, ]     # First 3 rows

Note: Absolute referencing like this is generally a bad idea. If the order of your columns changes, you may be referring to different columns than you thought. By using the dollar syntax, we use the name of the column rather than its location in the data frame, which makes the code easier to understand.

2.3 Creating new variables

Using the dollar syntax, we can create new variables:

mpg$efficiency_ratio <- mpg$hwy / mpg$cty
head(mpg$efficiency_ratio)

This adds a new column called efficiency_ratio to the data frame. However, there are more elegant ways of creating new variables, as you will see in the next section.

3 Data manipulation with dplyr

The dplyr package provides powerful tools for manipulating data frames. Make sure you have loaded it:

library(dplyr)

The package gives us access to four key functions:

The first argument to each of these functions is always the data frame we wish to work on. The subsequent arguments specify the variables we want to work with.

3.1 filter()

To focus on specific rows based on a condition, use filter(). For example, to select only compact cars:

compact_cars <- filter(mpg, class == "compact")
head(compact_cars)

We can combine multiple conditions:

efficient_compact <- filter(mpg, class == "compact", cty > 20)
head(efficient_compact)

3.2 arrange()

To sort rows by a variable, use arrange():

# Sort by city MPG (ascending by default)
arrange(mpg, cty) |> head()

# Sort by city MPG (descending)
arrange(mpg, desc(cty)) |> head()

3.3 select()

To choose specific columns, use select():

# Select only manufacturer, model, and city MPG
mpg_subset <- select(mpg, manufacturer, model, cty)
head(mpg_subset)

3.4 mutate()

To create new variables based on existing ones, use mutate():

mpg_new <- mutate(mpg,
    avg_mpg = (cty + hwy) / 2,
    efficiency = hwy / displ
)
head(select(mpg_new, model, cty, hwy, avg_mpg, efficiency))

When using mutate(), you refer to columns by name directly without the dollar syntax. This makes the code cleaner and easier to read.

3.5 The pipe operator

When performing multiple operations on data, we often need to chain several functions together. R provides the pipe operator |> (introduced in R 4.1) to make this easier to read.

The pipe takes the result of the expression on its left and passes it as the first argument to the function on its right:

# Without pipe: nested functions (read inside-out)
head(arrange(filter(mpg, class == "suv"), desc(cty)))

# With pipe: left to right (read naturally)
mpg |>
  filter(class == "suv") |>
  arrange(desc(cty)) |>
  head()

Both produce the same result, but the piped version is easier to read: “Take mpg, then filter for SUVs, then arrange by city MPG, then show the head.”

You can combine all four dplyr functions with the pipe:

mpg |>
  filter(year == 2008) |>
  mutate(efficiency = hwy / displ) |>
  arrange(desc(efficiency)) |>
  select(manufacturer, model, efficiency) |>
  head()

Note: You may also see %>% in older code or online examples. This is the pipe operator from the magrittr package (loaded with dplyr). The native pipe |> works the same way for most purposes and doesn’t require loading any packages.

3.6 Exercises: Data manipulation

  1. Using the mpg dataset:

    1. Filter to show only cars from the year 1999.
    2. Arrange the result by highway MPG in descending order.
    3. Select only the manufacturer, model, and hwy columns.
    mpg |>
      filter(year == 1999) |>
      arrange(desc(hwy)) |>
      select(manufacturer, model, hwy)
    ## # A tibble: 117 × 3
    ##    manufacturer model        hwy
    ##    <chr>        <chr>      <int>
    ##  1 volkswagen   jetta         44
    ##  2 volkswagen   new beetle    44
    ##  3 volkswagen   new beetle    41
    ##  4 toyota       corolla       35
    ##  5 honda        civic         33
    ##  6 toyota       corolla       33
    ##  7 honda        civic         32
    ##  8 honda        civic         32
    ##  9 honda        civic         32
    ## 10 toyota       corolla       30
    ## # ℹ 107 more rows
  2. Create a new variable called mpg_diff that is the difference between highway and city MPG. Which car has the largest difference?

    mpg |>
      mutate(mpg_diff = hwy - cty) |>
      arrange(desc(mpg_diff)) |>
      select(manufacturer, model, cty, hwy, mpg_diff) |>
      head()
    ## # A tibble: 6 × 5
    ##   manufacturer model        cty   hwy mpg_diff
    ##   <chr>        <chr>      <int> <int>    <int>
    ## 1 honda        civic         24    36       12
    ## 2 volkswagen   new beetle    29    41       12
    ## 3 audi         a4            18    29       11
    ## 4 audi         a4            20    31       11
    ## 5 chevrolet    malibu        18    29       11
    ## 6 honda        civic         25    36       11
    # The volkswagen new beetle has the largest difference
  3. Rewrite the following code using the pipe operator |>:

    head(select(arrange(filter(mpg, cty > 20), desc(hwy)), manufacturer, model, hwy))
    mpg |>
      filter(cty > 20) |>
      arrange(desc(hwy)) |>
      select(manufacturer, model, hwy) |>
      head()
    ## # A tibble: 6 × 3
    ##   manufacturer model        hwy
    ##   <chr>        <chr>      <int>
    ## 1 volkswagen   jetta         44
    ## 2 volkswagen   new beetle    44
    ## 3 volkswagen   new beetle    41
    ## 4 toyota       corolla       37
    ## 5 honda        civic         36
    ## 6 honda        civic         36

4 Exporting & importing data

While we have been working with datasets that already exist in R packages (like mpg from ggplot2), in practice you will often need to work with data downloaded from the Internet, or save your data to share with collaborators.

4.1 Exporting data

Data frames only exist in your R session. When you quit R, they disappear unless you save them. A common way to export a data frame is to write it as a CSV (Comma Separated Values) file:

write.csv(mpg, file = "mpg_data.csv", row.names = FALSE)

The row.names = FALSE argument prevents R from adding an extra column of row numbers.

Other formats are also available:

General formats and functions for importing and exporting data.
Name Extension Write function Read function
CSV .csv write.csv() read.csv()
Tab-delimited .tab write.table() read.table()

4.2 Importing data

To import data, use the corresponding read.*() functions:

dat <- read.csv(file = "mpg_data.csv")

Before importing a dataset, it’s good practice to inspect the file using a text editor to confirm its structure. The read.table() function is very flexible and can handle many different formats by adjusting its arguments.

4.3 Exercises: Exporting & importing data

  1. Make sure you are familiar with writing and reading CSV files:
    1. Look up the help page for write.csv().
    2. Export the first 10 rows of mpg to a CSV file.
    3. Read it back into R and verify the contents.
    write.csv(head(mpg, 10), file = "mpg_sample.csv", row.names = FALSE)
    mpg_sample <- read.csv("mpg_sample.csv")
    head(mpg_sample)

5 Visualisation with base R graphics

This section introduces data visualisation using base R functions. We will continue using the mpg dataset. We focus on the cty variable (city miles per gallon) for single-variable plots, and the class variable (vehicle type) for group comparisons.

5.1 Exercises: Base R visualisation

  1. Histogram: Create a histogram of cty using hist().

    1. Use the argument col = "lightblue" to fill the bars.
    2. Add appropriate axis labels using xlab and ylab.
    3. Experiment with different numbers of bins using the breaks argument.
    hist(mpg$cty, col = "lightblue",
         xlab = "City MPG", ylab = "Frequency",
         main = "Distribution of City Fuel Efficiency")

  2. Density plot: Create a density plot of cty.

    1. First compute the density using density(mpg$cty).
    2. Then plot it using plot().
    3. Add a rug plot underneath using rug().
    plot(density(mpg$cty), main = "Density of City MPG",
         xlab = "City MPG", col = "darkblue", lwd = 2)
    rug(mpg$cty, col = "red")

  3. Bar chart: Create a bar chart of class (vehicle type).

    1. Use barplot(table(mpg$class)).
    2. Add appropriate axis labels.
    3. Which vehicle class is most common in the dataset?
    barplot(table(mpg$class), xlab = "Vehicle Class", ylab = "Frequency",
            col = "steelblue", main = "Vehicle Classes in mpg Dataset")

    # SUV is the most common vehicle class
  4. Scatterplot: Explore the relationship between two continuous variables.

    1. Create a scatterplot of cty (\(y\)-axis) vs hwy (\(x\)-axis) using plot().
    2. Add a regression line using abline(lm(cty ~ hwy, data = mpg)).
    3. Describe the relationship you observe.
    plot(mpg$hwy, mpg$cty, xlab = "Highway MPG", ylab = "City MPG",
         main = "City vs Highway Fuel Efficiency", pch = 19)
    abline(lm(cty ~ hwy, data = mpg), col = "red", lwd = 2)

    # Strong positive linear relationship: higher highway MPG = higher city MPG
  5. Line plot: Create a line plot using the economics dataset.

    1. Plot psavert (personal savings rate) over time using plot(economics$date, economics$psavert, type = "l").
    2. Add appropriate axis labels.
    3. Describe the trend you observe.
    plot(economics$date, economics$psavert, type = "l",
         xlab = "Date", ylab = "Personal Savings Rate (%)",
         main = "US Personal Savings Rate Over Time", col = "darkblue")

    # General declining trend from 1970s to 2005, then increase after 2008 crisis
  6. Stripchart: Create a stripchart of cty by class.

    1. Use stripchart(cty ~ class, data = mpg, method = "jitter").
    2. Set vertical = TRUE to show values on the \(y\)-axis.
    3. Add appropriate axis labels.
    stripchart(cty ~ class, data = mpg, method = "jitter",
               vertical = TRUE, pch = 19,
               xlab = "Vehicle Class", ylab = "City MPG")

  7. Boxplot: Compare the distribution of cty across vehicle classes.

    1. Create a boxplot using boxplot(cty ~ class, data = mpg).
    2. Add appropriate axis labels.
    3. Which vehicle class has the best city fuel efficiency? Which has the most variability?
    boxplot(cty ~ class, data = mpg,
            xlab = "Vehicle Class", ylab = "City MPG",
            main = "City Fuel Efficiency by Vehicle Class")

    # Subcompact has the best median city fuel efficiency
    # Compact and midsize show the most variability

6 Summary

This practical covered the following key concepts:

You might ask, it’s all good to create these plots in R, but what if I want to put them in a report? Do I have to save them on the computer then copy them to a word document or similar? Is there an easier way of doing this? We will learn about this next week.

There is more to working with data in R than what we have covered here. The following extra sections on data types and logical comparisons may become useful as you progress through the module and encounter more complex data manipulation tasks.

7 Data types

To determine the data type of an object, use the class() function:

class(mpg$cty)           # numeric
class(mpg$manufacturer)  # character
class(c("peanut", "seed"))
class(TRUE)

More generally, class() can be applied to any object:

class(mpg)
class(plot)

When debugging errors, class() can help you check whether objects are what you expect them to be.

7.1 Numeric

Decimal values are stored as numeric data in R, which is the default computational data type:

x <- c(10.5, 19.2, 1)
class(x)

Even if we assign integers to a variable, R will store it as numeric:

k <- c(1, 2, 10, 33)
class(k)

7.2 Factors

Categorical variables (such as social class, diagnosis, gender, species) are often stored as factors in R. A factor stores categorical data as numerical codes along with labels that make the codes meaningful:

pain <- c(0, 1, 3, 2, 2, 1, 1, 3)
pain_f <- factor(pain, levels = 0:3,
                 labels = c("none", "mild", "medium", "severe"))
pain_f

To work with the actual string labels:

pain_c <- as.character(pain_f)
pain_c

You can modify factor levels after creation:

# What are the levels?
levels(pain_f)

# Change all levels
levels(pain_f) <- c("none", "uncomfortable", "unpleasant", "agonising")
levels(pain_f)

7.3 Date and time

Dates and timestamps typically look like: 2014-10-09 01:45:00

This follows the format YYYY-MM-DD hh:mm:ss, but there are many variations. By default, R reads dates as strings (and may convert them to factors). We will cover proper date handling with lubridate later in the module.

7.4 Exercises: Data types

  1. Check the class of various columns in the mpg dataset. Which are numeric? Which are character? Which are factors?

    sapply(mpg, class)
    ## manufacturer        model        displ         year          cyl        trans 
    ##  "character"  "character"    "numeric"    "integer"    "integer"  "character" 
    ##          drv          cty          hwy           fl        class 
    ##  "character"    "integer"    "integer"  "character"  "character"
    # Most are character, cty/hwy/displ/cyl are numeric/integer, year is integer
  2. Create a factor variable for vehicle class with custom labels.

    class_f <- factor(mpg$class)
    levels(class_f)
    ## [1] "2seater"    "compact"    "midsize"    "minivan"    "pickup"    
    ## [6] "subcompact" "suv"

8 Logical comparisons and Boolean operations

In practice, we often need to extract data that satisfy certain criteria, such as all data for a specific group or all observations above a threshold. This is when we use logical comparisons.

8.1 Logical comparisons

We can select parts of a vector using indexing:

mpg$manufacturer[1:5]

But if we want to filter based on a condition, we use logical comparisons with filter():

filter(mpg, class == "suv") |> head()

The == means “equal to”. Here are other comparison operators:

Logical comparison operators in R.
Symbol Comparison
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

When you use a logical comparison like class == "suv", R returns a vector of TRUE and FALSE values:

(mpg$class == "suv")[1:10]

This logical vector is then used to filter the data frame, returning only rows where the condition is TRUE.

8.2 Boolean operations

You can combine logical comparisons using Boolean operators:

Boolean operators in R.
Symbol Phrase
& And (ampersand)
| Or (vertical bar)
! Not or negation (exclamation)

The & operator returns TRUE only when both comparisons are true. The | operator returns TRUE if at least one comparison is true:

# Cars with city MPG > 20 AND highway MPG > 30
filter(mpg, (cty > 20) & (hwy > 30)) |> head()

# Cars that are either SUVs OR have city MPG > 25
filter(mpg, (class == "suv") | (cty > 25)) |> head()

Using brackets helps visually separate the different conditions and makes the code easier to read.