Chapter 4 Principal Component Analysis (PCA)
The videos for this chapter are available at the following links:
- 4.1 An informal introduction to PCA
- 4.1.5 Informal Examples
- 4.2 A more formal description of PCA
- 4.2.1 Properties of PC scores
- 4.2.2 Example: PCA of the football data
- 4.2.4 Population PCA and transformations
- 4.3 An alternative view of PCA: minimizing reconstruction error`
- 4.3.1 Example: PCA of the MNIST data
With multivariate data, it is common to want to reduce the dimension of the data in a sensible way. For example
exam marks across different modules are averaged to produce a single overall mark for each student
a football league table converts the numbers of wins, draws and losses to a single measure of points.
Mathematically, these summaries
are both linear combinations of the
original variables of the form
\[y = \mathbf u^\top \mathbf x.\]
for some choice of \(\mathbf u\).
For the exam marks example, suppose each student sits \(p=4\) modules with marks, \(x_1,x_2,x_3,x_4\). Then, writing \(\mathbf x=(x_1, x_2 , x_3, x_4)^\top\) and choosing \(\mathbf u= \left(\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4} \right)^\top\) gives an overall average, \[ y =\mathbf u^\top \mathbf x= \begin{pmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{pmatrix} = \frac{x_1}{4} + \frac{x_2}{4} + \frac{x_3}{4} + \frac{x_4}{4}.\]
For the football league table, if \(w\) is the number of wins, \(d\) is the number of draws and \(l\) is the number of losses then, writing \({\mathbf r}=(w,d,l)^\top\), we choose \(\mathbf u= \left(3,1,0 \right)^\top\) to get the points score \[ y = \mathbf u^\top {\mathbf r}=\begin{pmatrix} 3 & 1 & 0 \end{pmatrix} \begin{pmatrix} w \\ d \\ l \end{pmatrix} = 3w + 1d + 0l=3w+d.\]
Geometric interpretation
In the two examples above, we used the vector \(\mathbf u\) to convert our original variables, \(\mathbf x\), to a new variable, \(y\), by projecting \(\mathbf x\) onto \(\mathbf u\). We can think of this as a projection onto the subspace defined by \(\mathbf u\)
\[U = \operatorname{span}\{\mathbf u\} = \{\lambda \mathbf u: \lambda \in \mathbb{R}\}\subset \mathbb{R}^p,\]
For the exam data, each data point \(\mathbf x= \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{pmatrix}\) is a vector in \(\mathbb{R}^4\), and we’ve expressed \(\mathbf x\) in terms of its coordinates with respsect to the standard basis, \(\mathbf e_1^\top = (1\; 0\; 0 \; 0)\) etc: \[\mathbf x=x_1 \mathbf e_1 + x_2 \mathbf e_2 +x_3 \mathbf e_3 +x_4 \mathbf e_4.\] The vector subspace \(U\) is a line in \(\mathbb{R}^4\) along the direction \(\mathbf u= \begin{pmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{pmatrix}^\top\).
How do we project onto subspace \(U\)?
- If \(||\mathbf u||_2=1\) then the orthogonal projection of \(\mathbf x\) onto \(U\) is
\[\mathbf u\mathbf u^\top\mathbf x.\] Or in other words, the projection of \(\mathbf x\) onto subspace \(U\) has coordinate \(\mathbf u^\top \mathbf x\) with respect to basis \(\{\mathbf u\}\).
If you prefer to think in terms of projection matrices (see Chapter 2.3.3.1), then the matrix for projecting onto \(U\) is \[\mathbf P_U = \mathbf u(\mathbf u^\top \mathbf u)^{-1}\mathbf u^\top\] which simplifies to \[\mathbf P_U = \mathbf u\mathbf u^\top\] when \(||\mathbf u||=\sqrt{\mathbf u^\top\mathbf u}=1\) so that we again see the projection of \(\mathbf x\) onto \(U\) is \(y=\mathbf P_u \mathbf x= \mathbf u\mathbf u^\top\mathbf x\).
How should we choose \(\mathbf u\)?
The answer to that question depends upon the goal of the analysis. For the exam and football league examples, the choice of \(\mathbf u\) is an arbitrary decision taken in order to reduce a multidimensional dataset to a single variable (average mark, or points).
A single \(\mathbf u\) gives a snapshot or summary of the data. If \(\mathbf u\) is chosen well that snapshot may tell us much of what we want to know about the data, e.g.,
- Liverpool won the league,
- student \(X\)’s exam performance was first class etc.
In many cases we will want to use multiple snapshots: instead of using a single \(\mathbf u\), we will use a collection \(\mathbf u_1, \mathbf u_2, \ldots, \mathbf u_r\) and consider the derived variables
\[\mathbf y= \begin{pmatrix} y_1\\y_2 \\ \vdots \\ y_r\end{pmatrix} = \begin{pmatrix} \mathbf u_1^\top \mathbf x\\ \mathbf u_2^\top \mathbf x\\\vdots\\ \mathbf u_r^\top \mathbf x\end{pmatrix}\]
In matrix notation, if we set \[\mathbf U= \begin{pmatrix} |&&|\\ \mathbf u_1 & \ldots & \mathbf u_r\\ |&&|\end{pmatrix}\] then the new derived variable is \[\mathbf y= \mathbf U^\top \mathbf x.\]
If \(\dim(\mathbf y)=r<p=\dim(\mathbf x)\) then we have reduced the dimension of the data. If \(\mathbf y\) tells us all we need to know about the data, then we can work (plot, analyse, model) with \(\mathbf y\) instead of \(\mathbf x\). If \(r\ll p\) this can make working with the data significantly easier, as we can more easily visulise and understand low dimensional problems.
We will study a variety of methods for choosing \(\mathbf U\). The methods can all be expressed as constrained optimization problems:
\[\begin{align} \mbox{minimize} f_{\mathbf X}(\mathbf U) \tag{4.1} \\ \mbox{ subject to } \mathbf U\in \mathcal{U} \end{align}\]
The objective \(f_{\mathbf X}(\mathbf U)\) varies between methods: principal component analysis (PCA) maximizes variance or minimizes reconstruction error; canonical correlation analysis (CCA) maximizes correlation; multidimensional scaling (MDS) maximizes spread etc.
The contstraint on the search space \(\mathcal{U}\), is usually that \(\mathbf U\) must be (partially) orthogonal, but in other methods other constraints are used