10.1 Ordinary least squares (OLS)

This section contains a brief review of the ordinary least squares (OLS) approach to fitting the linear model to data. Most of you will have studied this material in other modules.

Our model is \[ \mathbf y=\mathbf X\boldsymbol \beta+{\pmb \epsilon}. \]

In OLS we make three assumptionsabout the error term \(\boldsymbol \epsilon\):

  1. \({\mathbb{E}}[\epsilon_i]=0\) for \(i=1, \ldots , n\).
  2. The \(\epsilon_i\) are uncorrelated, i.e. \({\mathbb{C}\operatorname{ov}}(\epsilon_i, \epsilon_j)=0\) for \(i \neq j\).
  3. The \(\epsilon_i\) have constant variance (i.e., are homoscedastic): \({\mathbb{V}\operatorname{ar}}(\epsilon_i)=\sigma^2\) for all \(i\).

If these assumptions are reasonable, then a common way to estimate \(\boldsymbol \beta\) is to choose it to minimize the sum of squared errors, i.e.,

\[\begin{align*} {\hat{\boldsymbol \beta}}^{ols} &= \arg \min_{\boldsymbol \beta} || \mathbf y- \mathbf X\boldsymbol \beta||_2^2\\ &= \arg \min_{\boldsymbol \beta} \sum_{i=1}^n (y_i - \mathbf x_i^\top \boldsymbol \beta)^2 \end{align*}\]

If we note that \[|| \mathbf y- \mathbf X\boldsymbol \beta||_2^2 = (\mathbf y-\mathbf X\boldsymbol \beta)^\top(\mathbf y-\mathbf X\boldsymbol \beta) \] then multiplying out and differentiating with respect to \(\boldsymbol \beta\) gives \[\frac{\mathrm d}{\mathrm d\boldsymbol \beta} || \mathbf y- \mathbf X\boldsymbol \beta||_2^2 = 2\mathbf X^\top (\mathbf y- \mathbf X\boldsymbol \beta).\] You may need to look again at 2.1.4 to remind yourself about vector differentiation. Setting the derivative equal to \(\boldsymbol 0\) and solving gives the least squares estimator

\[\begin{equation} \hat{\boldsymbol \beta}^{ols}=\left (\mathbf X^\top \mathbf X\right )^{-1}\mathbf X^\top \mathbf y \tag{10.3} \end{equation}\] when \(\mathbf X^\top \mathbf X\) is invertible (we’ll discuss what to do if it is not in the next section). The fitted-values are \[\hat{\mathbf y}= \mathbf X\hat{\boldsymbol \beta}^{ols}=\mathbf X(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y.\]

10.1.1 Geometry

Does this make sense intuitively? As discussed in Chapter 2.3.3.2, we know that \(\mathbf X\boldsymbol \beta\) is a linear combination of the columns of \(\mathbf X\) and is thus in \(\mathcal{C}(\mathbf X),\) the column space of \(\mathbf X\). Thus minimizing \(||\mathbf y- \mathbf X\boldsymbol \beta||_2^2\) is equivalent to finding the orthogonal projection of \(\mathbf y\) onto \(\mathcal{C}(\mathbf X)\).

\[\mathbf P_{\mathcal{C}(X)}\mathbf y=\arg \min_{\mathbf y': \mathbf y' \in \mathcal{C}(X)} ||\mathbf y-\mathbf y'||_2.\] and by Proposition 2.5 this is \[\mathbf P_{\mathcal{C}(X)}\mathbf y= \mathbf X(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y=\hat{\mathbf y}\] which again implies
\[\hat{\boldsymbol \beta}=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y\] in agreement with what we found in the previous section. So we can think of linear regression as projecting \(\mathbf y\) onto the columns of \(\mathbf X\).

Its useful here to recall the exercise we looked at in Chapter 3 solving the linear system of equations. If the singular value decomposition (SVD) of \(\mathbf X\) is \(\mathbf X= \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\), then we can see that

\[\begin{align*} \hat{\boldsymbol \beta}&=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y\\ &=\mathbf V\boldsymbol{\Sigma}^{-1}\mathbf U^\top \mathbf y\\ &= \mathbf X^{+}\mathbf y \end{align*}\] i.e., our estimate of \(\boldsymbol \beta\) is the generalized inverse of \(\mathbf X\) formed by the SVD, \(\mathbf X^{+}=\mathbf V\boldsymbol{\Sigma}^{-1}\mathbf U^\top\), times \(\mathbf y\).

The fitted values are \[\begin{align} \hat{\mathbf y}^{ols} &= \mathbf X\hat{\boldsymbol \beta}\\ &=\mathbf U\mathbf U^\top \mathbf y\\ &=\sum_{j=1}^p \mathbf u_j \mathbf u_j^\top \mathbf y.\tag{10.4} \end{align}\] \(\mathbf U^\top \mathbf y\) are the coordinates of \(\mathbf y\) with respect to the orthonormal basis \(\mathbf U\), and \(\mathbf U\mathbf U^\top \mathbf y\) is the projection of \(\mathbf y\) onto that basis. So yet another way to think about OLS is that it projects \(\mathbf y\) onto the left singular vectors \(\mathbf U\). As we said in Chapter 3, the left singular vectors form an orthonormal basis for the column space of \(\mathbf X\), and so projecting onto \(\mathbf U\) is the same as projecting onto \(\mathcal{C}(\mathbf X)\).

10.1.2 Normal linear model

Often, we make an additional (fourth) assumption about the random errors and assume that they are normally distributed:

  1. The \(\epsilon_i\) are IID \(N(0, \sigma^2)\).

The log-likelihood for models (10.1) and (10.2) under the Gaussian assumption 4. is \[\begin{align*} \ell(\boldsymbol \beta, \sigma^2)&=-\frac{n}{2}\log (\sigma^2)-\frac{n}{2}\log(2\pi)-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i-\mathbf x_i^\top \boldsymbol \beta)^2\\ & = -\frac{n}{2}\log (\sigma^2)-\frac{n}{2}\log(2\pi)-\frac{1}{2\sigma^2} (\mathbf y- \mathbf X\boldsymbol \beta)^\top (\mathbf y- \mathbf X\boldsymbol \beta). \end{align*}\] We can see that maximizing the log-likelihood with respect to \(\boldsymbol \beta\) is equivalent to minimizing \(||\mathbf y- \mathbf X\boldsymbol \beta||_2^2=(\mathbf y- \mathbf X\boldsymbol \beta)^\top (\mathbf y- \mathbf X\boldsymbol \beta)\) and thus \(\hat{\boldsymbol \beta}^{ols}\) is also the maximum likelihood estimator of \(\boldsymbol \beta\) when the errors have a Gaussian distribution.

10.1.3 Linear models in R

Linear models are easy to fit in R. For example, using the iris data, we can train a model to predict the sepal length from the other measurements as follows:

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Coefficients:
##  (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
##       1.8560        0.6508        0.7091       -0.5565

10.1.4 Problems with OLS

There are several problems encountered when fitting linear models using OLS when working with large datasets, two of which are

  • \(\mathbf X^\top \mathbf X\) may not be invertible, or the inversion may be numerically unstable (which is often called multicollinearity). For example, if the number of covariates \(p\) is larger than the number of observations \(n\) (which can easily happen when working with image data), then the \(p \times p\) matrix \(\mathbf X^\top \mathbf X\) is at most rank \(n<p\) and so is not invertible.

  • Prediction accuracy may be low. The Gauss Markov theorem says that the ordinary least squares estimator of \(\boldsymbol \beta\) has minimum variance amongst all unbiased estimators of \(\boldsymbol \beta\). The mean square error (MSE), which is a measure of accuracy, of an estimator \(\tilde{\theta}\) of unknown quantity \(\theta\) is \[\begin{align*} MSE(\tilde{\theta}) &= {\mathbb{E}}((\theta - \tilde{\theta})^2)\\ &={\mathbb{V}\operatorname{ar}}(\tilde{\theta})+ ({\mathbb{E}}(\tilde{\theta})- \theta)^2\\ &={\mathbb{V}\operatorname{ar}}(\tilde{\theta})+ \operatorname{bias}(\tilde{\theta})^2 \end{align*}\] By allowing a small amount of bias, we may be able to find estimators that have a smaller MSE than the least squares estimator of \(\boldsymbol \beta\).

In the next two sections we will look at some methods that have been proposed to solve these problems.