10.2 Principal component regression (PCR)

In principal component regression, instead of regressing \(\mathbf y\) onto \(\mathbf X\) (or equivalently projecting \(\mathbf y\) onto the columns of \(\mathbf X\)), we instead regress \(\mathbf y\) onto the principal component scores of \(\mathbf X\). If the singular value decomposition (SVD) of \(\mathbf X\) is \(\mathbf X= \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\) (recall that \(\mathbf X\) is column centred), then we’ve seen that \[\hat{\boldsymbol \beta}= \mathbf V\boldsymbol{\Sigma}^{-1}\mathbf U^\top \mathbf y.\] Note that this does not rely upon \(\mathbf X^\top\mathbf X\) being invertable. If \(\mathbf X\) is rank \(r\), then it has \(r\) non-zero singular values, and \(\mathbf V\) is \(p\times r\) and \(\mathbf U\) is \(n \times r\). In principal component regression we take this idea further.

We start by projecting the covariates matrix \(\mathbf X\) onto the first \(k\) principal components of \(\mathbf X\), giving us the first \(k\) principal component scores of \(\mathbf X\): \[\mathbf Z= \mathbf X\mathbf V_k= \mathbf U_k \boldsymbol{\Sigma}_k\] where \(\mathbf U_k=\begin{pmatrix}\mathbf u_1&\ldots & \mathbf u_k\end{pmatrix} \in \mathbb{R}^{n \times k}\) is the matrix formed from the first \(k\) columns of \(\mathbf U\), and \(\mathbf V_k\in \mathbb{R}^{p \times k}\) is the matrix formed from the first \(k\) columns of \(\mathbf V\), and \(\boldsymbol{\Sigma}_k\in \mathbb{R}^{k \times k}\) is the diagonal matrix containing the first \(k\) singular values (cf Section 3.5).

In PCR, we then use the linear model \[\mathbf y= \mathbf Z\boldsymbol \alpha+\boldsymbol \epsilon.\] Estimating \(\boldsymbol \alpha\) by least squares gives \[\begin{align*} \hat{\boldsymbol \alpha} &= (\mathbf Z^\top \mathbf Z)^{-1}\mathbf Z^\top \mathbf y\\ &= (\boldsymbol{\Sigma}_k \mathbf U_k^\top \mathbf U_k\boldsymbol{\Sigma}_k)^{-1}\boldsymbol{\Sigma}_k \mathbf U_k^\top \mathbf y\\ &=\boldsymbol{\Sigma}_k^{-1}\mathbf U_k^\top \mathbf y \end{align*}\] as \(\mathbf U_k^\top\mathbf U_k =\mathbf I_k\). The corresponding fitted values will be

\[\begin{align} \hat{\mathbf y}^{PCR} &= \mathbf Z\hat{\boldsymbol \alpha} \\ &= \mathbf U_k \mathbf U_k^\top \mathbf y\\ &=\sum_{j=1}^k \mathbf u_k \mathbf u_j^\top \mathbf y. \end{align}\]

Comparing with Equation (10.4), we can see this is the same as before, except we have truncated the sum, so that we are only projecting onto the first \(k\) principal components. If we let \(k=r = \operatorname{rank}(\mathbf X)\) then PCR is the same as OLS.

To convert from the PCR coefficients \(\boldsymbol \alpha\) back to coefficients of the original \(\mathbf x\) variables, we multiply by the right singular vectors: \[\hat{\boldsymbol \beta}^{pcr} = \mathbf V_k \hat{\boldsymbol \alpha} = \mathbf V_k \boldsymbol{\Sigma}_k^{-1} \mathbf U_k^\top \mathbf y\] as \(\mathbf Z\boldsymbol \alpha= \mathbf X\mathbf V_k \boldsymbol \alpha\).

We choose the number of principal components to use in the regression \(k\), by assessing some form of predictive accuracy of the resulting model.

The are several motivations for using PCR. The first is that it can always be used, regardless of whether \(\mathbf X^\top\mathbf X\) is invertible. The second is that in many problems with large \(p\), it often has superior predictive performance to OLS, as it reduces the variance of the predictions.

10.2.1 PCR in R

PCR is easy to implement yourself in R. For example, using the iris regression problem from the previous section, we can do PCR using just the first 2 principal components as follows:

iris.pca <- prcomp(iris[,2:4],scale=TRUE)
Z = iris.pca$x[,1:2] # select the first two PCs
iris.lm <- lm(iris$Sepal.Length~Z)
iris.lm

## 
## Call:
## lm(formula = iris$Sepal.Length ~ Z)
## 
## Coefficients:
## (Intercept)         ZPC1         ZPC2  
##      5.8433       0.4230      -0.4348

Note that we didn’t centre the data here, and so we have estimate a non-zero intercept as well as the coefficients of \(\mathbf x\).

To convert from PCR coefficients to coefficients of \(\mathbf x\) we can multiply by \(\mathbf V_k\)

iris.pca$rotation[,1:2] %*% coef(iris.lm)[-1]

##                   [,1]
## Sepal.Width  0.2173767
## Petal.Length 0.3853085
## Petal.Width  0.4150270

In practice we do not need to do PCR ourselves in this way, as it has been implemented in the pls R package.

library(pls)
iris.pcr <- pcr(Sepal.Length~ Sepal.Width+Petal.Length+
                  Petal.Width, 2, # number of components,
                scale=TRUE, data=iris)
coef(iris.pcr)

## , , 2 comps
## 
##              Sepal.Length
## Sepal.Width     0.2173767
## Petal.Length    0.3853085
## Petal.Width     0.4150270

Note that the coefficients found match those we computed ourselves.

To choose the number of components to retain, we can use cross-validation. This is a generalization of the idea of splitting the data into training and test sets. The pls package reports the estimate root means square error on the test set (RMSEP) for each possible value of \(k\), up to the maximum specified by the user (taken to be the rank of \(\mathbf X\) if we leave it unspecified).

iris.pcr2 <- pcr(Sepal.Length~ Sepal.Width+Petal.Length+
                   Petal.Width, 3, # number of components,
                scale=TRUE, data=iris, validation='CV')
summary(iris.pcr2)

## Data:    X dimension: 150 3 
##  Y dimension: 150 1
## Fit method: svdpc
## Number of components considered: 3
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps
## CV          0.8308   0.5435   0.3897   0.3165
## adjCV       0.8308   0.5432   0.3893   0.3162
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps
## X               74.05    98.84   100.00
## Sepal.Length    57.97    78.47    85.86