4.4 Computer tasks
Exercise 1
Using the iris
dataset, familiarize yourself with the prcomp
command and its output.
Now, instead of using prcomp
we will do the analysis ourselves using the eigen
command.
- Start by computing the sample mean and sample variance of the dataset (use \(n-1\) as the denominator when you compute the sample variance to get the same answer as provided by
prcomp
). - Now compute the eigenvalues and eigenvectors of the covariance matrix using
eigen
. Check that these agree with those computed byprcomp
(noting thatprcomp
returns the standard deviation which is the square root of the eigenvalues). - Now compute the principal component scores by multiplying \(\mathbf X\) by the matrix of eigenvectors \(\mathbf V\). Check your answer agrees with the scores provided by
prcomp
.
Now we will do the same thing again, but using the svd
command.
- Compute the column centred data matrix \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\).
Compute the SVD of \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\) and \(\mathbf H\mathbf X\). How are the two sets of singular values related, and how do they relate to the eigenvalues computed previously. Are the singular vectors of \(\frac{1}{\sqrt{n-1}}\mathbf H\mathbf X\) and \(\mathbf H\mathbf X\) the same?
Compute the SVD scores by doing both \(\mathbf H\mathbf X\mathbf V\) and \(\mathbf U\boldsymbol{\Sigma}\), where \[\mathbf H\mathbf X= \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\] is the SVD of \(\mathbf H\mathbf X\).
Exercise 2
In this question we will look at the crabs data from the MASS
R package.
We will focus on the 5 continuous variables, all measured in mm:
- FL = frontal lobe size
- RW = rear width
- CL = carapace length
- CW = carapace width
- BD = body depth.
The sample size is \(200\).
library(MASS)
?crabs # read the help page to find out about the dataset
X=crabs[,4:8] # construct data matrix X with columns FL, RW, CL, CW, BD
- Carry out PCA on the data in \(X\) using the covariance matrix, including obtaining a scree plot and plotting the PC scores.
Some questions:
Do you have any suggestions for an interpretation for the 1st PC?
Are you able to come up with an interpretation for the 2nd PC?
Do you think an analysis based on the sample covariance matrix \({\bf S}\) or the correlation matrix \({\bf R}\) is preferable with this dataset? Note that you can use
scale=TRUE
inprcomp
to carry out PCA on \({\bf R}\). Does it make much difference which is used?Without doing any computation, think about what you expect the sample mean and sample covariance matrix to be for the PC scores. Check this numerically.
Now check other properties of the PC scores listed in proposition 4.2.
Try the following transformations of the data.
adding a constant to the data \[\mathbf z= \mathbf x+\mathbf c,\]
- scaling the data: \[\mathbf z= \mathbf D\mathbf x\] for some diagonal matrix \(\mathbf D\)
rotating the data: \[\mathbf z= \mathbf U\mathbf x\] for some \(p\times p\) orthogonal matrix \(\mathbf U\). You can generate a random orthogonal matrix using the following commands
Check the effect of each transformation on the principal components (the loadings/eigenvectors), the principal component scores, and the variance of the principal components (the eigenvalues).
Exercise 3
Download the final Premier League table for the 2018-19 season from https://www.rotowire.com/soccer/league-table.php?season=2018. There is a button to download the csv (comma separated variable) file in the bottom right hand corner.
Load the data into R using the command
read.csv
. Note that you may need to manually delete the first row of the csv file before doing this. (I haven’t given you the command here, as learning how to do this yourself is important).Repeat the analysis from section 4.2.2. Does the meaning of the principal components change? Was the 2018-19 league season notably different to the 2019-20 season (which is the season analysed in the notes)?