5.4 Computer tasks

Task 1

Consider again the crabs dataset you looked at in the exercises in the chapter on PCA (see 4.4). We now consider a canonical correlation analysis in which one set of variables, the \(\mathbf x\)-set, is given by CL and CW and the other set, the \(\mathbf y\)-set, is given by FL, RW and BD.

library(MASS)
?crabs           # read the help page to find out about the dataset
X1 = crabs %>% dplyr::select(CL, CW)  %>%as.matrix()  
Y1 = crabs %>% dplyr::select(FL, RW, BD) %>% as.matrix()

calculate \({\bf S}_{\bf x x}^{-1/2}\) and \({\bf S}_{\bf yy}^{-1/2}\) by first computing the spectral decomposition of \(\mathbf S_{xx}\) and \(\mathbf S_{yy}\).
Now calculate the matrix \(\mathbf Q\) and compute its singular value decomposition.

Compute the first pair of CC vectors and CC variables \(\eta_1\) and \(\psi_1\). What is the 1st canonical correlation?
Plot \(\psi_1\) vs \(\eta_1\). What does the plot tell you (if anything)?
Repeat the above to find the second pair of CC vectors, and the second set of CC variables/scores, and plot these against each other and against the first CC scores. Is there any interesting structure in any of the plots? Which plots suggest random scatter?

Finally, repeat the analysis above using the cc command and plt.cc from the package CCA which you will need to download.

cca<-cc(X1,Y1)
plt.cc(cca, var.label=TRUE)

Task 2

The data for previous Premier League seasons is available at:

[https://www.rotowire.com/soccer/league-table.php?season=2022]https://www.rotowire.com/soccer/league-table.php?season=2022)

There is a button to download the csv (comma separated variable) file in the bottom right hand corner. Read the data into R (hint: try the read.csv command).

x <- read.csv(x , file="/YOURDIRECTORY/prem_league_data.txt, 
sep=" ", header=TRUE)

If you are not sure what the name of YOURDIRECTORY is where the file is located, then a useful command to find out is file.choose()

Reproduce the analysis from the notes for the 2022-23 premier league season.
Give an interpretation of the CC scores. One of doing this is to think about the correlation between the original variables and the scores (the transformed variables). Note that there are four different correlation matrices we can look at to aid interpretation: correlation between X and \(\eta\), \(X\) and \(\psi\), \(Y\) and \(\eta\), and \(Y\) and \(\psi\).

Circle plots can also help. Look at the help page for plt.cc and try some circle plots.

Consider now doing CCA with \(\mathbf x=(W,D)\) and \(\mathbf y=(G,GA, L)\). Note that if you knew \(W\) and \(D\), you could calculate \(L\). Without doing any computation, what do you expect the first canoncial correlation to be? What will the first pair of CC vectors be (upto a multiplicative constant)?
Check your intuition by doing the calculation in R:

X <- table[,c('W','D')] 
Y <- table[,c('G','GA','L')] 
cc(X,Y)

Task 3

We will now look data measured from 600 first year university students. Measurements were made on three psychological variables:

Locus of Control: the degree to someone believes that they, as opposed to external forces, have control over the outcome of events in their lives.
Self Concept: an indication of whether a person tends to hold a generally positive and consistent or negative and variable self-view.
Motivation: how motivated an individual is

which will form our \(\mathbf X\) variables. The \(\mathbf Y\) variables are four academic scores (standardized test scores)

Reading
Writing
Math
Science

and gender (1=Male, 0 = Female) We are interested in how the set of psychological variables relates to the academic variables and gender.

mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
colnames(mm) <- c("Control", "Concept", "Motivation",
                  "Read", "Write", "Math",
    "Science", "Sex")
summary(mm)
psych <- mm[, 1:3]
acad <- mm[, 4:7]

Conduct CCA on these data. Provide an interpretation of your results.