ggplot2 to answer the questions.To generate PDF documents from R Markdown, you need a LaTeX distribution
installed on your computer. The easiest way to do this is to use the
tinytex package, which installs a lightweight TeX distribution:
# Run these commands ONCE in your R console (not in an Rmd file)
install.packages("tinytex")
tinytex::install_tinytex()
This is a one-off operation — you only need to do it once per computer. After installation, R Markdown will automatically use this distribution to compile PDF documents. If you only need HTML output, you can skip this step.
You will use the built-in
iris dataset, which contains measurements of 150 iris flowers from three
species: setosa, versicolor, and virginica.
The variables are:
Sepal.Length: Length of the sepal, in cmSepal.Width: Width of the sepal, in cmPetal.Length: Length of the petal, in cmPetal.Width: Width of the petal, in cmSpecies: The species of iris (setosa, versicolor, or virginica)First, familiarise yourself with the dataset using the following example commands in the RStudio console. You do not need to (and should not) include them in the your submission PDF.
head(iris)
str(iris)
?iris
You might want to include the following code chunk towards top of the R Markdown (Rmd) file, just below the lines ringfenced by the triple dashes (---):
```{r setup}
#| echo: false
library(ggplot2)
knitr::opts_chunk$set(
echo = FALSE,
out.width = "60%",
fig.align = "center",
fig.asp = 0.7
)
```
The echo = FALSE is so that space is saved by hiding the code in the PDF, as I will be able to see that in the Rmd file.
(4 marks) Figure 1 attempts to visualise the distribution of Sepal.Width by Species.
geom_*() and making necessary changes to the chunk options.Figure 1: Scatterplot of the species and width of the sepal.
Answer:
ggplot(iris, aes(Species, Sepal.Width)) +
geom_boxplot() +
labs(y = "Sepal width (cm)")
Figure 2: Boxplot of the species and width of the sepal.
(4 marks) The boxplot of Petal.Length is provided in Figure 3.
Petal.Length and not other variables, argue in words why the boxplot does not tell the whole picture of this variable’s distribution.Petal.Length’s distribution is not unimodal and symmetric, and provide a plot (that is not the same as part a) to illustrate your argument.Figure 3: Boxplot of the length of the petal.
Answer: a. The histogram (or density plot) reveals that the distribution is bimodal.
ggplot(iris, aes(Petal.Length)) +
geom_histogram(binwidth = 0.2) +
labs(x = "Petal length (cm)")
Figure 4: Histogram of the length of the petal.
Answer: b. The mean and standard deviation vary across species. A boxplot by species reveals this. A histogram or density plot by species (using e.g. colours) is also acceptable.
ggplot(iris, aes(Species, Petal.Length)) +
geom_boxplot() +
labs(y = "Petal length (cm)")
Figure 5: Boxplot of the length of the petal by species.
(3 marks) Find a pair of variables that are strongly linearly correlated, and do the following in one single plot:
Answer: a. Petal length and petal width are most strongly correlated amongst all pairs. Also acceptable are petal length and sepal length, or petal width and sepal length.
ggplot(iris, aes(Petal.Length, Petal.Width)) +
geom_point() +
geom_smooth(method = "lm") +
labs(y = "Petal width (cm)", x = "Petal length (cm)")
## `geom_smooth()` using formula = 'y ~ x'
Figure 6: Scatterplot of the width against the length of the petal.
Answer: b. geom_smooth() without method = 'lm' is not accepted, but geom_abline() with the correct slope and intercept is accepted.
(9 marks) Overall:
Answer: Full marks will be given by following all these rules.