5.4 Computer tasks
Task 1
Consider again the crabs dataset you looked at in the exercises in the chapter on PCA (see 4.4). We now consider a canonical correlation analysis in which one set of variables, the \(\mathbf x\)-set, is given by CL and CW and the other set, the \(\mathbf y\)-set, is given by FL, RW and BD.
library(MASS)
?crabs # read the help page to find out about the dataset
X=as.matrix(crabs[4:8])
n=200 # sample size
H=diag(rep(1,n))-rep(1,n)%*%t(rep(1,n))/n # n times n centering matrix
library(dplyr)
X1 = crabs %>% dplyr::select(CL, CW) %>%as.matrix
# store CL and CW in X1
Y1 = crabs %>% dplyr::select(FL, RW, BD) %>% as.matrix
# store FL, RW and BD in Y1
Sxx=t(X1)%*%H%*%X1/n # find x-variable variance matrix
Syy=t(Y1)%*%H%*%Y1/n # find y-variable variance matrix
Sxy=t(X1)%*%H%*%Y1/n # find cross-covariance matrix
calculate \({\bf S}_{\bf x x}^{-1/2}\) and \({\bf S}_{\bf yy}^{-1/2}\) by first computing the spectral decomposition of \(\mathbf S_{xx}\) and \(\mathbf S_{yy}\).
Now calculate the matrix \(\mathbf Q\) and compute its singular value decomposition.
Compute the first pair of CC vectors and CC variables \(\eta_1\) and \(\psi_1\). What is the 1st canonical correlation?
Plot \(\psi_1\) vs \(\eta_1\). What does the plot tell you (if anything)?
Repeat the above to find the second pair of CC vectors, and the second set of CC variables/scores, and plot these against each other and against the first CC scores. Is there any interesting structure in any of the plots? Which plots suggest random scatter?
- Finally, repeat the analysis above using the
cc
command andplt.cc
from the packageCCA
which you will need to download.
Task 2
The full Premier League dataset is available at https://www.rotowire.com/soccer/league-table.php?season=2018. There is a button to download the csv (comma separated variable) file in the bottom right hand corner.
Read the data into R (hint: try the read.csv
command).
If you are not sure what the name of YOURDIRECTORY is where the file is located, then a useful command to find out is file.choose()
Check that you can reproduce, and agree with, the calculations done in this chapter.
Consider now doing CCA with \(\mathbf x=(W,D)\) and \(\mathbf y=(G,GA, L)\). Note that if you knew \(W\) and \(D\), you could calculate \(L\). Without doing any computation, what do you expect the first canoncial correlation to be? What will the first pair of CC vectors be (upto a multiplicative constant)?
Check your intuition by doing the calculation in R:
Task 3
We will now look data measured from 600 first year university students. Measurements were made on three psychological variables:
- Locus of Control: the degree to someone believes that they, as opposed to external forces, have control over the outcome of events in their lives.
- Self Concept: an indication of whether a person tends to hold a generally positive and consistent or negative and variable self-view.
- Motivation: how motivated an individual is
which will form our \(\mathbf X\) variables. The \(\mathbf Y\) variables are four academic scores (standardized test scores)
- Reading
- Writing
- Math
- Science
and gender (1=Male, 0 = Female) We are interested in how the set of psychological variables relates to the academic variables and gender.
mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
colnames(mm) <- c("Control", "Concept", "Motivation", "Read", "Write", "Math",
"Science", "Sex")
summary(mm)
psych <- mm[, 1:3]
acad <- mm[, 4:7]
Conduct CCA on these data. Provide an interpretation of your results.