4.4 Computer tasks

Exercise 1

Using the iris dataset, familiarize yourself with the prcomp command and its output.

Now, instead of using prcomp we will do the analysis ourselves using the eigen command.

Start by computing the sample mean and sample variance of the dataset (use \(n-1\) as the denominator when you compute the sample variance to get the same answer as provided by prcomp).
Now compute the eigenvalues and eigenvectors of the covariance matrix using eigen. Check that these agree with those computed by prcomp (noting that prcomp returns the standard deviation which is the square root of the eigenvalues).
Now compute the principal component scores by multiplying \(\mathbf X\) by the matrix of eigenvectors \(\mathbf V\). Check your answer agrees with the scores provided by prcomp.

Now we will do the same thing again, but using the svd command.

Compute the column centred data matrix \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\).
Compute the SVD of \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\) and \(\mathbf H\mathbf X\). How are the two sets of singular values related, and how do they relate to the eigenvalues computed previously. Are the singular vectors of \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\) and \(\mathbf H\mathbf X\) the same?
Compute the SVD scores by doing both \(\mathbf H\mathbf X\mathbf V\) and \(\mathbf U\boldsymbol{\Sigma}\), where \[\mathbf H\mathbf X= \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\] is the SVD of \(\mathbf H\mathbf X\).

Exercise 2

In this question we will look at the crabs data from the MASS R package. We will focus on the 5 continuous variables, all measured in mm:

FL = frontal lobe size
RW = rear width
CL = carapace length
CW = carapace width
BD = body depth.

The sample size is \(200\).

library(MASS)
?crabs # read the help page to find out about the dataset
X=crabs[,4:8]     
# construct data matrix X with columns FL, RW, CL, CW, BD

Carry out PCA on the data in \(X\) using the correlation matrix, including obtaining a scree plot and plotting the PC scores.

Some questions:

Do you have any suggestions for an interpretation for the 1st PC?
Are you able to come up with an interpretation for the 2nd PC?
Do you think an analysis based on the sample covariance matrix \({\bf S}\) or the correlation matrix \({\bf R}\) is preferable with this dataset? Does it make much difference which is used?
Without doing any computation, think about what you expect the sample mean and sample covariance matrix to be for the PC scores. Check this numerically.
Now check other properties of the PC scores listed in proposition 4.2.
Try the following transformations of the data.
- adding a constant to the data \[\mathbf z= \mathbf x+\mathbf c,\]
- scaling the data: \[\mathbf z= \mathbf D\mathbf x\] for some diagonal matrix \(\mathbf D\)
- rotating the data: \[\mathbf z= \mathbf U\mathbf x\] for some \(p\times p\) orthogonal matrix \(\mathbf U\). You can generate a random orthogonal matrix using the following commands

library(pracma)
U <- randortho(5)

Check the effect of each transformation on the principal components (the loadings/eigenvectors), the principal component scores, and the variance of the principal components (the eigenvalues).

Exercise 3

Download the final Premier League table for the 2022-23 season from https://www.rotowire.com/soccer/league-table.php?season=2022. There is a button to download the csv (comma separated variable) file in the bottom right hand corner.

Load the data into R using the command read.csv. Note that you may need to manually delete the first row of the csv file before doing this. This can be done by clicking on the file name in the File tab in Rstudio, or by opening the file in any text editor or spreadsheet.
Repeat the analysis from section 4.2.2. Does the meaning of the principal components change? Was the 2018-19 league season notably different to the 2019-20 season (which is the season analysed in the notes)?

Exercise 4

The decathlon2 data from the factoextra package contains data on 27 different decathletes’ performances during two different competitions. Let’s start by extracting their performance in the 10 different decathlon events at the Olympics.

library("factoextra")

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

data(decathlon2)
X <- decathlon2|> dplyr::filter(Competition=='OlympicG') |> dplyr::select(-Rank, -Points,-Competition)

Conduct PCA on these data, thinking about whether it is appropriate to use the correlation or covariance matrix. Give a scree plot, and determine how much information in the data is retained by the first two PCs. Interpret the leading PCs.

Plot the first two scores for the different athlete performances, labelling each point. From these plots, comment on particular strengths/weaknesses in some performances.