2.3 Inner product spaces
2.3.1 Distances, and angles
Vector spaces are not particularly interesting from a statistical point of view until we equip them with a sense of geometry, i.e. distance and angle.
- \(\langle\cdot,\cdot\rangle\) is a linear map in both arguments: \[\langle \alpha \mathbf v_1+\beta \mathbf v_2, \mathbf u\rangle = \alpha \langle \mathbf v_1, \mathbf u\rangle + \beta \langle \mathbf v_2, \mathbf u\rangle\] for all \(\mathbf v_1, \mathbf v_2, \mathbf u\in V\) and \(\alpha, \beta \in \mathbb{R}\).
- \(\langle\cdot,\cdot\rangle\) is symmetric in its arguments: \(\langle \mathbf v, \mathbf u\rangle = \langle \mathbf u, \mathbf v\rangle\) for all \(\mathbf u,\mathbf v\in V\)
- \(\langle\cdot,\cdot\rangle\) is positive definite: \(\langle \mathbf v, \mathbf v\rangle \geq 0\) for all \(\mathbf v\in V\) with equality if and only if \(\mathbf v={\mathbf 0}\).
An inner product provides a vector space with the concepts of
distance: for all \(v\in V\) define the norm of \(v\) to be \[||\mathbf v|| = \langle \mathbf v, \mathbf v\rangle ^{\frac{1}{2}}\] Thus any inner-product space \((V, \langle\cdot,\cdot\rangle)\) is also a normed space \((V, ||\cdot||)\), and a metric space \((V, d(\mathbf x,\mathbf y)=||\mathbf x-\mathbf y||)\).
angle: for \(\mathbf u, \mathbf v\in V\) we define the angle between \(\mathbf u\) and \(\mathbf v\) to be \(\theta\) where \[\begin{align*} \langle \mathbf u,\mathbf v\rangle &= ||\mathbf u||.||\mathbf v||\cos \theta\\ \implies \theta &= \cos^{-1}\left( \frac{\langle \mathbf u, \mathbf v\rangle}{||\mathbf u|| \;||\mathbf v||}\right) \end{align*}\] We will primarily be interested in the concept of orthogonality. We say \(\mathbf u, \mathbf v\in V\) are orthogonal if \[\langle \mathbf u, \mathbf v\rangle =0\] i.e., the angle between them is \(\frac{\pi}{2}\).
If you have done any functional analysis, you may recall that a Hilbert space is a complete inner-product space, and a Banach space is a complete normed space. This is an applied module, so we will skirt much of the technical detail, but note that some of the proofs formally require us to be working in a Banach or Hilbert space. We will not concern ourselves with such detail.
Example 2.11 We will mostly be working with the Euclidean vector spaces \(V=\mathbb{R}^n\), in which we use the Euclidean inner product \[\langle \mathbf u, \mathbf v\rangle = \mathbf u^\top \mathbf v\] sometimes called the scalar or dot product of \(\mathbf u\) and \(\mathbf v\). Sometimes this gets weighted by a matrix so that \[\langle \mathbf u, \mathbf v\rangle_Q = \mathbf u^\top \mathbf Q\mathbf v.\]
The norm associated with the dot product is the square root of the sum of squared errors, denoted by \(|| \cdot ||_2\). The length of \(\mathbf u\) is then \[||\mathbf u||_2=\sqrt{\mathbf u^\top \mathbf u} =\left( \sum_{i=1}^n u_i^2\right)^\frac{1}{2}\geq 0.\] Note that \(||\mathbf u||_2=0\) if and only if \(\mathbf u={\mathbf 0}_n\) where \(\stackrel{n\times 1}{\mathbf 0}_n=(0,0,\dots ,0)^\top\).
We say \(\mathbf u\) is orthogonal to \(\mathbf v\) if \(\mathbf u^\top \mathbf v=0\). For example, if \[\mathbf u=\left(\begin{array}{c}1\\2\end{array}\right) \mbox{ and } \mathbf v=\left(\begin{array}{c}-2\\1\end{array}\right)\] then \[||\mathbf u||_2 = \sqrt{5}\mbox{ and } \mathbf u^\top \mathbf v=0.\] We will write \(\mathbf u\perp \mathbf v\) if \(\mathbf u\) is orthogonal to \(\mathbf v\).2.3.2 Orthogonal matrices
Definition 2.10 A unit vector \(\mathbf v\) is a vector satisfying \(||{\mathbf v}||=1\), i.e., it is a vector of length \(1\). Vectors \(\mathbf u\) and \(\mathbf v\) are orthonormal if \[||\mathbf u||=||\mathbf v|| = 1 \mbox{ and } \langle \mathbf u, \mathbf v\rangle =0.\]
An \(n\times n\) matrix \({\mathbf Q}\) is an orthogonal matrix if \[{\mathbf Q}\mathbf Q^\top = {\mathbf Q}^\top {\mathbf Q}={\mathbf I}_n.\]
Equivalently, a matrix \(\mathbf Q\) is orthogonal if \({\mathbf Q}^{-1}={\mathbf Q}^\top.\)
If \({\mathbf Q}=[\mathbf q_1,\ldots, \mathbf q_n]\) is an orthogonal matrix, then the columns \(\mathbf q_1, \ldots, \mathbf q_n\) are mutually orthonormal vectors, i.e. \[ \mathbf q_j^\top \mathbf q_k=\begin{cases} 1 &\hbox{ if } j=k\\ 0 &\hbox{ if } j \neq k. \\ \end{cases} \]Proof. Suppose \(n=p\), and think of \(\mathbf Q\) as a linear map \[\begin{align*} \mathbf Q: &\mathbb{R}^n \rightarrow \mathbb{R}^n\\ &\mathbf v\mapsto \mathbf Q\mathbf v \end{align*}\] By the rank-nullity theorem, \[\dim \operatorname{Ker}(\mathbf Q) + \dim \operatorname{Im}(\mathbf Q) =n\] and because \(\mathbf Q\) has a left-inverse, we must have \(\dim \operatorname{Ker}(\mathbf Q)=0\), as otherwise \(\mathbf Q^\top\) would have to map from a vector space of dimension less than \(n\) to \(\mathbb{R}^n\). So \(\mathbf Q\) is of full rank, and thus must also have a right inverse, \(\mathbf B\) say, with \(\mathbf Q\mathbf B=\mathbf I_n\). If we left multiply by \(\mathbf Q^\top\) we get \[\begin{align*} \mathbf Q\mathbf B&=\mathbf I_n\\ \mathbf Q^\top\mathbf Q\mathbf B&=\mathbf Q^\top\\ \mathbf I_n \mathbf B&= \mathbf Q^\top\\ \mathbf B&= \mathbf Q^\top\\ \end{align*}\] and so we have that \(\mathbf Q^{-1}=\mathbf Q^\top\).
Now suppose \(\mathbf Q\) is \(n \times p\) with \(n\not = p\). Then as \(\mathbf Q^\top \mathbf Q=\mathbf I_{p\times p}\), we must have \(\operatorname{tr}(\mathbf Q^\top \mathbf Q)=p\). This implies that \[\operatorname{tr}(\mathbf Q\mathbf Q^\top)=\operatorname{tr}(\mathbf Q^\top \mathbf Q)=m\] and so we cannot have \(\mathbf Q\mathbf Q^\top=\mathbf I_{n}\) as \(\operatorname{tr}{\mathbf I_{n}}=n\).2.3.3 Projections
View \(\mathbf P\) as a map from a vector space \(W\) to itself. Let \(U=\operatorname{Im}(\mathbf P)\) and \(V=\operatorname{Ker}(\mathbf P)\) be the image and kernel of \(\mathbf P\).
The kernel and image of \(\mathbf I-\mathbf P\) are the image and kernel (respectively) of \(\mathbf P\): \[\begin{align*} \operatorname{Ker}(\mathbf I-\mathbf P) &= U=\operatorname{Im}(\mathbf P)\\ \operatorname{Im}(\mathbf I-\mathbf P) &= V=\operatorname{Ker}(\mathbf P). \end{align*}\]
2.3.3.1 Orthogonal projection
We are mostly interested in orthogonal projections.
In other words, the orthogonal projection of \(\mathbf w\) onto \(U\) is the best possible approximation of \(\mathbf w\) in \(U\).
As above, we can split \(W\) into \(U\) and its orthogonal compliment \[U^\perp = \{\mathbf x\in W: \langle \mathbf x,\mathbf u\rangle = 0\}\] i.e., \(W=U \oplus U^\perp\) so that any \(\mathbf w\in W\) can be written as \(\mathbf w=\mathbf u+\mathbf v\) with \(\mathbf u\in U\) and \(\mathbf v\in U^\perp\).
Proof. We need to find \(\mathbf u= \sum \lambda_i \mathbf u_i = \mathbf A\boldsymbol \lambda\) that minimizes \(||\mathbf w-\mathbf u||\).
\[\begin{align*} ||\mathbf w-\mathbf u||^2 &= \langle \mathbf w-\mathbf u, \mathbf w-\mathbf u\rangle\\ &= \mathbf w^\top \mathbf w- 2\mathbf u^\top \mathbf w+ \mathbf u^\top \mathbf u\\ &= \mathbf w^\top \mathbf w-2\boldsymbol \lambda^\top \mathbf A^\top \mathbf w+ \boldsymbol \lambda^\top \mathbf A^\top \mathbf A\boldsymbol \lambda. \end{align*}\]
Differentiating with respect to \(\boldsymbol \lambda\) and setting equal to zero gives \[\boldsymbol 0=-2 \mathbf A^\top \mathbf w+2 \mathbf A^\top \mathbf A\boldsymbol \lambda\] and hence \[ \boldsymbol \lambda= (\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top \mathbf w.\] The orthogonal projection of \(\mathbf w\) is hence \[ \mathbf A\boldsymbol \lambda= \mathbf A(\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top \mathbf w\] and the projection matrix is \[\mathbf P_U = \mathbf A(\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top. \]Notes:
If \(\{\mathbf u_1, \ldots, \mathbf u_k\}\) is an orthonormal basis for \(U\) then \(\mathbf A^\top \mathbf A= \mathbf I\) and \(\mathbf P_U = \mathbf A\mathbf A^\top\). We can then write \[\mathbf P_U\mathbf w= \sum_i (\mathbf u_i^\top \mathbf w) \mathbf u_i\] and \[\mathbf P_U = \sum_{i=1}^k \mathbf u_i\mathbf u_i^\top.\] Note that if \(U=W\) (so that \(\mathbf P_U\) is a projection from \(W\) onto \(W\), i.e., the identity), then \(\mathbf A\) is a square matrix (\(n\times n\)) and thus \(\mathbf A^\top\mathbf A=\mathbf I_n \implies \mathbf A\mathbf A^\top\) and thus \(\mathbf P_U=\mathbf I_n\) as required. The coordinates (with respect to the orthonormal basis \(\{\mathbf u_1, \ldots, \mathbf u_k\}\)) of a point \(\mathbf w\) projected onto \(U\) are \(\mathbf A^\top \mathbf w\).
\(\mathbf P_U^2=\mathbf P_U\), so \(\mathbf P_U\) is a projection matrix in the sense of definition 2.11.
\(\mathbf P_U\) is symmetric (\(\mathbf P_U^\top=\mathbf P_U\)). This is true for orthogonal projection matrices, but not in general for projection matrices.
2.3.3.2 Geometric interpretation of linear regresssion
Consider the linear regression model \[\mathbf y= \mathbf X\boldsymbol \beta+\mathbf e\] where \(\mathbf y\in\mathbb{R}^n\) is the vector of observations, \(\mathbf X\) is the \(n\times p\) design matrix, \(\boldsymbol \beta\) is the \(p\times 1\) vector of parameters that we wish to estimate, and \(\mathbf e\) is a \(n\times 1\) vector of zero-mean errors.
Least-squares regression tries to find the value of \(\boldsymbol \beta\in \mathbb{R}^p\) that minimizes the sum of squared errors, i.e., we try to find \(\boldsymbol \beta\) to minimize \[||\mathbf y- \mathbf X\boldsymbol \beta||_2^2\]
We know that \(\mathbf X\boldsymbol \beta\) is in the column space of \(\mathbf X\), and so we can see that linear regression aims to find the orthogonal projection onto \(\mathcal{C}(X)\). \[\mathbf P_U\mathbf y=\arg \min_{\mathbf y': \mathbf y' \in \mathcal{C}(X)} ||\mathbf y-\mathbf y'||_2.\]
By Proposition 2.5 this is \[\mathbf P_U\mathbf y= \mathbf X(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y=\hat{\mathbf y}\] which equals the usual prediction obtained in linear regression (\(\hat{\mathbf y}\) are often called the fitted values). We can also see that the choice of \(\boldsymbol \beta\) that specifies this point in \(\mathcal{C}(X)\) is \[\hat{\boldsymbol \beta}=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y\] which is the usual least-squares estimator.