1.4 Computer tasks

If you haven’t done so already, please download and install R and Rstudio. R is the programming language, and Rstudio is an integrated development environment that makes using R much more pleasurable. My advice is to always use Rstudio and never run code in R itself.

  1. For complete beginners: For those who are completely new to R (or those who want a refresher), I recommend working through an online tutorial. This tutorial looks good, but contains more than you’ll need.

  2. Warm-up: The most important aspects of R to focus on for this module are

    • Basic plotting
    • Manipulation of matrices and data frames.

    Let’s look at the iris dataset.

    • Can you plot the sepal length against the sepal width?

    We’ll now do some exercises on data manipulation. Note that there are several ways to do basic data manipulation in R. You can use base R commands or if you prefer, you can use the dplyr commands which are part of the tidyverse packages. For example, to select columns, in base you can do:

iris[,2] # selects column 2
iris$Sepal.Width # selects the same column by name

or using dplyr you can do

library(dplyr)
select(iris, "Sepal.Width")
  • Can you select the column of the iris data that contains just the sepal length and add it to the sepal width?

    To select only certain rows of the data (i.e. to filter it), we can again use either base R or dplyr.

iris[iris[,3]<5,] 
# select all rows that have a petal length less than 5.
filter(iris, Petal.Length<5) # do the same thing using dplyr
  • Can you now select all the rows of the iris data frame that are for species setosa? What is the mean petal length for these flowers?

  • Can you select all the flowers that have a sepal length greater than 5? What is the proportion of each species of iris in this set?

A nice aspect of dplyr is that you can chain commmands together. So for example, to select the versicolour flowers with petal width less than 1.5, we can do

iris |> filter(Species=='versicolor') |> filter(Petal.Width<1.5)
  • Can you select all the flowers that have a sepal length greater than 6, and a petal length less than 5? What is the proportion of each species in this set?

Note that iris is a data frame

is.data.frame(iris)
## [1] TRUE

which is a type of structure used in R. This is convenient for some tasks, but not for others. Let’s first extract the four numerical columns and store them as a matrix \(X\).

is.matrix(iris)
## [1] FALSE
X <- as.matrix(iris[,1:4])
is.matrix(X)
## [1] TRUE
  • Select the 4 numerical columns and multiply the first column by 1, the second by 2, the third by 3, and the 4th by 4. One way to do this is by multiplying \(X\) by the diagonal matrix
diag(1:4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    2    0    0
## [3,]    0    0    3    0
## [4,]    0    0    0    4
  1. The table below shows the module marks for 5 students on the modules G11PRB (\(P\)) and G11STA (\(S\)).
Student P S
A 41 63
B 72 82
C 46 38
D 77 57
E 59 85
  • As an exercise, calculate the sample mean, sample covariance, sample correlation and total variation by hand.
  • Now calculate these in R using colMeans, cov, and cor. These commands assume each column is a different variable, and each row a different observation.
library(dplyr)
Ex1 <- data.frame(
  Student=LETTERS[1:5],
  P = c(41,72,46,77,59),
  S = c(63,82,38,57,85)
  )

Ex1 |> select_if(is.numeric) |> colMeans()
##  P  S 
## 59 65
Ex1 |> select_if(is.numeric) |> cov()
##       P     S
## P 246.5 116.0
## S 116.0 371.5

Note that by default R uses \(n-1\) in the denominator for the variance and covariance commands, whereas we used \(n\) in our definition.

We will be using the dplyr R package to perform basic data manipulation in R. If you are unfamiliar with dplyr, you can read about it at https://dplyr.tidyverse.org/. The pipe command |> is particularly useful for chaining together multiple commands.

You could compute the same quantities using more familiar commands by selecting the numerical columns:

colMeans(Ex1[,2:3])
##  P  S 
## 59 65
cov(Ex1[,2:3])
##       P     S
## P 246.5 116.0
## S 116.0 371.5
  • Can you compute the covariance matrix using the definition in Equation (1.2)?
  1. The mtcars dataset is another built-in dataset in R. You can read about it by typing ?mtcars in R. Note that some of the variables are factors. You can ensure R treats them as factors by using the following command to create a dataset where they are listed as factors:
mtcars2 <- within(mtcars, {
   vs <- factor(vs, labels = c("V", "S"))
   am <- factor(am, labels = c("automatic", "manual"))
   cyl  <- ordered(cyl)
   gear <- ordered(gear)
   carb <- ordered(carb)
})

Work with the mtcars2 dataframe when you use ggplot2.

  • Create some plots to explore the structure of this dataset using ggplot2.
  • Try using the pairs command from base R and the ggpairs command from GGally.
  • Try colouring the scatter plots accoring to whether the car is automatic or not. - Create another plot using colour to represent the number of gears.
  • Find another type of plot from one of the plot galleries and try to create a similar plot with these data.
  1. We can 100 generate samples from the multivariate normal distribution with mean vector \[{\boldsymbol{\mu}}= \left(\begin{array}{c}1\\0\end{array}\right)\] and covariance matrix \[\boldsymbol{\Sigma}= \left(\begin{array}{cc}2&1\\1&2\end{array}\right)\] as follows (you may need to install the R package mvtnorm first):
library(mvtnorm)
mu = c(1,0)
Sigma=matrix(c(2,1,1,2), nr=2)
X <- rmvnorm(n=100, mean=mu, sigma=Sigma)
  • Compute the sample mean and covariance matrix of these samples.
  • Generate a new sample dataset, \(X\), and recompute the sample mean and covariance matrix. What do you notice?
  • Try changing \(n\), the number of samples (making it much larger say), and now recomputing the mean and covariance. What do you notice?
  1. Optional Download the MNIST data from Moodle and load it into R.
load('mnist.rda')

This loads a list mnist that splits the data into two parts

mnist$train ## a training set of 60000 images
mnist$test ## a test set of 10000 images

Let’s just look at the training set. This is also a list containing the image intensities and the image labels

mnist$train$x # image intensities
mnist$train$y # image labels

If we select just the first image we can see it is a vector of length 784 containing numbers between 0 and 1.

mnist$train$x[1,]

I’ve created a function to help you plot these images.

library(reshape2)
library(ggplot2)


plot.mnist <- function(im){
  #im[im<0]<-0 # set any negative intensities to zero
  #im[im>1]<-1 # set an intensities bigger than 1 to 1.
  
  
  if(is.vector(im)){ # a single image
    
    A<-matrix(im, nr=28, byrow=F)
    C<- melt(A, varnames = c("x", "y"), 
             value.name = "intensity")
    p<-ggplot(C, aes(x = x, y = y, fill = intensity))+
      geom_tile(aes(fill=intensity))+
      scale_fill_gradient(low='white', high='black')+
      scale_y_reverse()+theme(
        strip.background = element_blank(),
        strip.text.x = element_blank(),
        panel.spacing = unit(0, "lines"),
        axis.text = element_blank(),
        axis.ticks = element_blank()
      ) 
  }
  else{
    if (dim(im)[2]!=784){
      im = t(im)
    } 
    n <- dim(im)[1]
    As <- array(im, dim = c(n, 28, 28))
    
    Cs<- melt(As, varnames = c("image","x", "y"), 
              value.name = "intensity")
    p<-ggplot(Cs, aes(x = x, y = y, fill = intensity))+
      geom_tile(aes(fill=intensity))+
      scale_fill_gradient(low='white', high='black')+
      facet_wrap(~ image, nrow = floor(sqrt(n))+1, 
                 ncol = floor(sqrt(n))+1)+
      scale_y_reverse()+theme(
        strip.background = element_blank(),
        strip.text.x = element_blank(),
        panel.spacing = unit(0, "lines"),
        axis.text = element_blank(),
        axis.ticks = element_blank()
      ) 
    
  }
  return(p)
}
  • Use this command to plot the first 10 images from the MNIST training set.

  • Select all the 5s from the MNIST training set. Plot a selection of these digits.