6.6 Exercises

Read this set of tweets and revisit Exercises 4 and 5 from Chapter 3. Do you see the link between the exercises and the consequences for linear models?
Show that the ridge regression estimator of \(\bbeta\) is \[\hat{\bbeta}^{ridge} = (\bX^\top \bX + \lambda\bI_p)^{-1}\bX^\top \by.\] Prove that the inverse exists if \(\lambda > 0\).

Finally, prove that the estimator can rewritten as \[\hat{\bbeta}^{ridge} = \bX^\top(\bX \bX^\top + \lambda\bI_n)^{-1} \by.\] When might this be useful?

Let \[\hat{\bbeta}_\lambda=\arg \min_{\bbeta} ||\by-\bX\bbeta||_2^2+\lambda ||\bbeta||\] Prove that \[||\hat{\bbeta}_\lambda|| \leq ||\hat{\bbeta}^{ols}||.\] Note that we can prove this for a general norm \(||\cdot||\), not just the \(L_2\) norm.

For the normal linear model \[\by = \bX \bbeta + N(\bzero, \sigma^2 \bI)\] show that \[\begin{align} \var(\hat{\bbeta}^{ols}) &= \sigma^2 (\bX^\top \bX)^{-1}\\ &= \sigma^2\sum_{i=1}^p \frac{\bv_i \bv_i^\top}{\lambda_i} \end{align}\] where \(\lambda_i\) are the eigenvalues of \(\bX^\top \bX\), and \(\bv_i\) the corresponding unit eigenvectors.

If \(\hat{\bbeta}_k\) is the PCR estimator based on the first \(k\) principal components, show that \[\var(\hat{\bbeta}_k) = \sigma^2 \sum_{i=1}^k \frac{\bv_i \bv_i^\top}{\lambda_i}.\]

Thus, show that \[\bA_k = \var(\hat{\bbeta}^{ols})-\var(\hat{\bbeta}_k)\] is a positive semi-definite matrix. Consequently, show that any linear combination of \(\hat{\bbeta}_k\), e.g., a prediction \(\bx^\top \hat{\bbeta}_k\), has a lower variance compared to the same linear combination using the OLS estimator, i.e., \[\var(\bx^\top \hat{\bbeta}_k)\leq \var(\bx^\top \hat{\bbeta}^{ols}).\]
Attempt Q3c) from the 2019 paper (nb: it is a hard question).