## 1.4 Computer tasks

If you haven’t done so already, please download and install R and Rstudio. R is the programming language, and Rstudio is an integrated development environment that makes using R much more pleasurable. My advice is to always use Rstudio and never run code in R itself.

**For complete beginners:**For those who are completely new to R (or those who want a refresher), I recommend working through an online tutorial. This tutorial looks good, but contains more than you’ll need.**Warm-up:**The most important aspects of R to focus on for this module are- Basic plotting
- Manipulation of matrices and data frames.

Let’s look at the

`iris`

dataset.- Can you plot the sepal length against the sepal width?

We’ll now do some exercises on data manipulation. Note that there are several ways to do basic data manipulation in R. You can use base R commands or if you prefer, you can use the dplyr commands which are part of the

`tidyverse`

packages. For example, to select columns, in base you can do:

or using dplyr you can do

Can you select the column of the

`iris`

data that contains just the sepal length and add it to the sepal width?To select only certain rows of the data (i.e. to filter it), we can again use either base R or dplyr.

```
iris[iris[,3]<5,]
# select all rows that have a petal length less than 5.
filter(iris, Petal.Length<5) # do the same thing using dplyr
```

Can you now select all the rows of the

`iris`

data frame that are for species*setosa*? What is the mean petal length for these flowers?Can you select all the flowers that have a sepal length greater than 5? What is the proportion of each species of iris in this set?

A nice aspect of dplyr is that you can chain commmands together. So for example, to select the versicolour flowers with petal width less than 1.5, we can do

- Can you select all the flowers that have a sepal length greater than 6, and a petal length less than 5? What is the proportion of each species in this set?

Note that `iris`

is a data frame

`## [1] TRUE`

which is a type of structure used in R. This is convenient for some tasks, but not for others. Let’s first extract the four numerical columns and store them as a matrix \(X\).

`## [1] FALSE`

`## [1] TRUE`

- Select the 4 numerical columns and multiply the first column by 1, the second by 2, the third by 3, and the 4th by 4. One way to do this is by multiplying \(X\) by the diagonal matrix

```
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 2 0 0
## [3,] 0 0 3 0
## [4,] 0 0 0 4
```

- The table below shows the module marks for 5 students on the modules G11PRB (\(P\)) and G11STA (\(S\)).

Student | P | S |
---|---|---|

A | 41 | 63 |

B | 72 | 82 |

C | 46 | 38 |

D | 77 | 57 |

E | 59 | 85 |

- As an exercise, calculate the sample mean, sample covariance, sample correlation and total variation by hand.

- Now calculate these in R using
`colMeans`

,`cov`

, and`cor`

. These commands assume each column is a different variable, and each row a different observation.

```
library(dplyr)
Ex1 <- data.frame(
Student=LETTERS[1:5],
P = c(41,72,46,77,59),
S = c(63,82,38,57,85)
)
Ex1 |> select_if(is.numeric) |> colMeans()
```

```
## P S
## 59 65
```

```
## P S
## P 246.5 116.0
## S 116.0 371.5
```

Note that by default R uses \(n-1\) in the denominator for the variance and covariance commands, whereas we used \(n\) in our definition.

We will be using the `dplyr`

R package to perform basic data manipulation in R. If you are unfamiliar with `dplyr`

, you can read about it at https://dplyr.tidyverse.org/. The pipe command `|>`

is particularly useful for chaining together multiple commands.

You could compute the same quantities using more familiar commands by selecting the numerical columns:

```
## P S
## 59 65
```

```
## P S
## P 246.5 116.0
## S 116.0 371.5
```

- Can you compute the covariance matrix using the definition in Equation (1.2)?

- The
`mtcars`

dataset is another built-in dataset in R. You can read about it by typing`?mtcars`

in R. Note that some of the variables are factors. You can ensure R treats them as factors by using the following command to create a dataset where they are listed as factors:

```
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
```

Work with the `mtcars2`

dataframe when you use `ggplot2`

.

- Create some plots to explore the structure of this dataset using ggplot2.
- Try using the
`pairs`

command from base R and the`ggpairs`

command from GGally. - Try colouring the scatter plots accoring to whether the car is automatic or not. - Create another plot using colour to represent the number of gears.
- Find another type of plot from one of the plot galleries and try to create a similar plot with these data.

- We can 100 generate samples from the multivariate normal distribution with mean vector
\[{\boldsymbol{\mu}}= \left(\begin{array}{c}1\\0\end{array}\right)\]
and covariance matrix
\[\boldsymbol{\Sigma}= \left(\begin{array}{cc}2&1\\1&2\end{array}\right)\]
as follows (you may need to install the R package
`mvtnorm`

first):

```
library(mvtnorm)
mu = c(1,0)
Sigma=matrix(c(2,1,1,2), nr=2)
X <- rmvnorm(n=100, mean=mu, sigma=Sigma)
```

- Compute the sample mean and covariance matrix of these samples.
- Generate a new sample dataset, \(X\), and recompute the sample mean and covariance matrix. What do you notice?
- Try changing \(n\), the number of samples (making it much larger say), and now recomputing the mean and covariance. What do you notice?

**Optional**Download the MNIST data from Moodle and load it into R.

This loads a list `mnist`

that splits the data into two parts

Let’s just look at the training set. This is also a list containing the image intensities and the image labels

If we select just the first image we can see it is a vector of length 784 containing numbers between 0 and 1.

I’ve created a function to help you plot these images.

```
library(reshape2)
library(ggplot2)
plot.mnist <- function(im){
#im[im<0]<-0 # set any negative intensities to zero
#im[im>1]<-1 # set an intensities bigger than 1 to 1.
if(is.vector(im)){ # a single image
A<-matrix(im, nr=28, byrow=F)
C<- melt(A, varnames = c("x", "y"),
value.name = "intensity")
p<-ggplot(C, aes(x = x, y = y, fill = intensity))+
geom_tile(aes(fill=intensity))+
scale_fill_gradient(low='white', high='black')+
scale_y_reverse()+theme(
strip.background = element_blank(),
strip.text.x = element_blank(),
panel.spacing = unit(0, "lines"),
axis.text = element_blank(),
axis.ticks = element_blank()
)
}
else{
if (dim(im)[2]!=784){
im = t(im)
}
n <- dim(im)[1]
As <- array(im, dim = c(n, 28, 28))
Cs<- melt(As, varnames = c("image","x", "y"),
value.name = "intensity")
p<-ggplot(Cs, aes(x = x, y = y, fill = intensity))+
geom_tile(aes(fill=intensity))+
scale_fill_gradient(low='white', high='black')+
facet_wrap(~ image, nrow = floor(sqrt(n))+1,
ncol = floor(sqrt(n))+1)+
scale_y_reverse()+theme(
strip.background = element_blank(),
strip.text.x = element_blank(),
panel.spacing = unit(0, "lines"),
axis.text = element_blank(),
axis.ticks = element_blank()
)
}
return(p)
}
```

Use this command to plot the first 10 images from the MNIST training set.

Select all the 5s from the MNIST training set. Plot a selection of these digits.