3.5 Exercises

  1. A sales company surveyed \(50\) of its employees in order to determine the factors that influence sales performance. Two collections of variables were measured. The first set related to sales performance
  • Sales Growth
  • Sales Profitability
  • New Account Sales

The second set of variables are test scores measuring intelligence:

  • Creativity
  • Mechanical Reasoning
  • Abstract Reasoning
  • Mathematics

You can download the data set sales.csv from Moodle. The following analysis is carried out in R.

dat=read.csv(file='sales2.csv', sep=',',header=TRUE)
X = dat |> dplyr::select('growth', 'profit', 'new') 
Y = dat |> dplyr::select(-'growth', -'profit', -'new')

library(CCA)
cc.out <- cc(X,Y)
print(cc.out$cor)
## [1] 0.9944827 0.8781065 0.3836057
print(cc.out$xcoef)
##               [,1]       [,2]       [,3]
## growth -0.06237788 -0.1740703  0.3771529
## profit -0.02092564  0.2421641 -0.1035150
## new    -0.07825817 -0.2382940 -0.3834151
print(cc.out$ycoef)
##               [,1]        [,2]        [,3]
## create -0.06974814 -0.19239132 -0.24655659
## mech   -0.03073830  0.20157438  0.14189528
## abs    -0.08956418 -0.49576326  0.28022405
## math   -0.06282997  0.06831607 -0.01133259
plt.cc(cc.out,var.label = TRUE,
       type='v')

The following gives the correlation between the original variables and the transformed variables

print(cc.out$scores$corr.X.xscores)
##              [,1]          [,2]         [,3]
## growth -0.9798776  0.0006477883  0.199598477
## profit -0.9464085  0.3228847489 -0.007504408
## new    -0.9518620 -0.1863009724 -0.243414776
print(cc.out$scores$corr.Y.yscores)
##              [,1]       [,2]        [,3]
## create -0.6383313 -0.2156981 -0.65140953
## mech   -0.7211626  0.2375644  0.06773775
## abs    -0.6472493 -0.5013329  0.57422365
## math   -0.9440859  0.1975329  0.09422619
print(cc.out$scores$corr.X.yscores)
##              [,1]          [,2]         [,3]
## growth -0.9744713  0.0005688272  0.076567107
## profit -0.9411869  0.2835272081 -0.002878734
## new    -0.9466102 -0.1635921013 -0.093375287
print(cc.out$scores$corr.Y.xscores)
##              [,1]       [,2]        [,3]
## create -0.6348095 -0.1894059 -0.24988439
## mech   -0.7171837  0.2086069  0.02598458
## abs    -0.6436782 -0.4402237  0.22027544
## math   -0.9388771  0.1734549  0.03614570
  • Describe the first pair of canonical variables, give their correlation, and provide an interpretation.

  • Describe the second pair of canonical variables, and provide an interpretation.

  1. Attempt exam question 1 part (b) from the 2017-18 exam paper.

  1. Suppose that \(\bz = (\bx^\top \by^\top)^\top\) is a random vector, where both \(\bx\) and \(\by\) are sub-vectors of dimension \(p\), so that \(\bz\) is \((2p)\times 1\). Define \[\var(\bz)=\bSigma_{\bz \bz}=\begin{pmatrix} \bSigma_{\bx \bx} & \bSigma_{\bx \by}\\\bSigma_{\by \bx} & \bSigma_{\by \by} \end{pmatrix}.\]

    1. Suppose that \(\by = \bT \bx\) where \(\bT\) is a fixed matrix. Find \(\bSigma_{\bx \by}\) and \(\bSigma_{\by \by}\) in terms of \(\bSigma_{\bx \bx}\) and \(\bT\).
    2. Assuming now that \(\bT\) is an orthogonal matrix and \(\bSigma_{\bx \bx}\) is of full rank, determine the singular values of the matrix \(\bQ=\bSigma_{\bx \bx}^{-1/2}\bSigma_ {\bx \by}\bSigma_{\by \by}^{-1/2}\), and hence write down the canonical correlation coefficients.
    3. Suppose now that \(\bT\) is non-singular but not orthogonal. Comment on whether the answer to part (b) changes.
  2. We will now prove Proposition ?? by induction. The case for \(k=1\) was proved in Section 3.1 in Proposition ??. Assume the result is true for \(k\). Consider the objective \[\mathcal{L} = \ba^\top \bQ \bb + \sum_{i=1}^k \gamma_i\ba^\top \ba_i + \sum_{i=1}^k \mu_i\bb^\top \bb_i + \frac{\lambda_1}{2}(1-\ba^\top\ba)+ \frac{\lambda_2}{2}(1-\bb^\top\bb)\] where \(\lambda_i, \mu_i, \gamma_i\) are Lagrangian multipliers.

    1. By differentiating with respect to \(\ba\) and \(\bb\) and setting the derivative to zero show that \[\begin{align} \bQ\bb + \sum\gamma_i \ba_i - \lambda_1 \ba &= 0 \tag{3.1}\\ \bQ^\top\ba + \sum\mu_i \bb_i - \lambda_2 \bb &= 0. \tag{3.2} \end{align}\]

    2. By left multiplying the equations above by \(\ba^\top\) and \(\bb^\top\) respectively show that \[\lambda_1=\lambda_2 = \ba^\top \bQ \bb.\]

    3. By left multiplying (3.1) by \(\ba_i^\top\) show that \(\gamma_i=0\) for \(i=1, \ldots, k\). Show similarly that \(\mu_i =0\) for \(i=1, \ldots, k\).

    4. Finally, by copying the proof of Proposition ??, prove Proposition ??.

  3. Show the mean of the cc variables \(\eta_k\) and \(\psi_k\) is zero. Prove Proposition ?? giving the variance of covariance the cc variables.

  4. Attempt exam question 1 part (b) from the 2020-21 exam paper.

  1. The mtcars dataset in R contains data on 32 cars from the 1970s. We will split the data into two datasets. Let \(\bX\) be the matrix that contains variables pertaining to car characteristics:
  • cyl Number of cylinders
  • disp Displacement (cu.in.)
  • hp Gross horsepower
  • drat Rear axle ratio
  • wt Weight (1000 lbs)

and let \(\bY\) be the matrix containing variables pertaining to car performance:

  • mpg Miles/(US) gallon
  • qsec 1/4 mile time
X <- mtcars |> dplyr::select(cyl, disp, hp, drat, wt)
Y <- mtcars |> dplyr::select(mpg, qsec)

We can then do CCA on these two datasets

library(CCA)
cc.out<-cc(X,Y)
print(cc.out$cor)
## [1] 0.9270377 0.8307044
print(cc.out$xcoef)
##              [,1]         [,2]
## cyl  -0.277988315  0.373120734
## disp  0.001628855  0.003177998
## hp   -0.005964078  0.007770195
## drat  0.015934959  0.791032730
## wt   -0.388097090 -1.423595709
print(cc.out$ycoef)
##           [,1]       [,2]
## mpg  0.1451984  0.1109009
## qsec 0.1346424 -0.6013371
print(cc.out$scores$corr.X.xscores)
##            [,1]        [,2]
## cyl  -0.9578700  0.07914027
## disp -0.9126296 -0.12094085
## hp   -0.9164944  0.29160643
## drat  0.6666822  0.43010103
## wt   -0.8643963 -0.47212540
print(cc.out$scores$corr.Y.xscores)
##           [,1]       [,2]
## mpg  0.9046387  0.1815048
## qsec 0.5627028 -0.6601685
print(cc.out$scores$corr.X.yscores)
##            [,1]        [,2]
## cyl  -0.8879817  0.06574217
## disp -0.8460421 -0.10046609
## hp   -0.8496249  0.24223874
## drat  0.6180395  0.35728682
## wt   -0.8013280 -0.39219665
print(cc.out$scores$corr.Y.yscores)
##           [,1]       [,2]
## mpg  0.9758380  0.2184951
## qsec 0.6069902 -0.7947093
plt.cc(cc.out, var.label = T, type='v')

Interpret the output of this canonical correlation analysis.