2.3 Inner product spaces

2.3.1 Distances, and angles

Vector spaces are not particularly interesting from a statistical point of view until we equip them with a sense of geometry, i.e. distance and angle.

Definition 2.7 A real inner product space \((V, \langle\cdot,\cdot\rangle)\) is a real vector space \(V\) equipped with a map \[ \langle\cdot,\cdot\rangle : V \times V \rightarrow \mathbb{R}\] such that

\(\langle\cdot,\cdot\rangle\) is a linear map in both arguments: \[\langle \alpha \mathbf v_1+\beta \mathbf v_2, \mathbf u\rangle = \alpha \langle \mathbf v_1, \mathbf u\rangle + \beta \langle \mathbf v_2, \mathbf u\rangle\] for all \(\mathbf v_1, \mathbf v_2, \mathbf u\in V\) and \(\alpha, \beta \in \mathbb{R}\).
\(\langle\cdot,\cdot\rangle\) is symmetric in its arguments: \(\langle \mathbf v, \mathbf u\rangle = \langle \mathbf u, \mathbf v\rangle\) for all \(\mathbf u,\mathbf v\in V\)
\(\langle\cdot,\cdot\rangle\) is positive definite: \(\langle \mathbf v, \mathbf v\rangle \geq 0\) for all \(\mathbf v\in V\) with equality if and only if \(\mathbf v={\mathbf 0}\).

An inner product provides a vector space with the concepts of

distance: for all \(v\in V\) define the norm of \(v\) to be \[||\mathbf v|| = \langle \mathbf v, \mathbf v\rangle ^{\frac{1}{2}}\] Thus any inner-product space \((V, \langle\cdot,\cdot\rangle)\) is also a normed space \((V, ||\cdot||)\), and a metric space \((V, d(\mathbf x,\mathbf y)=||\mathbf x-\mathbf y||)\).
angle: for \(\mathbf u, \mathbf v\in V\) we define the angle between \(\mathbf u\) and \(\mathbf v\) to be \(\theta\) where \[\begin{align*} \langle \mathbf u,\mathbf v\rangle &= ||\mathbf u||.||\mathbf v||\cos \theta\\ \implies \theta &= \cos^{-1}\left( \frac{\langle \mathbf u, \mathbf v\rangle}{||\mathbf u|| \;||\mathbf v||}\right) \end{align*}\] We will primarily be interested in the concept of orthogonality. We say \(\mathbf u, \mathbf v\in V\) are orthogonal if \[\langle \mathbf u, \mathbf v\rangle =0\] i.e., the angle between them is \(\frac{\pi}{2}\).

If you have done any functional analysis, you may recall that a Hilbert space is a complete inner-product space, and a Banach space is a complete normed space. This is an applied module, so we will skirt much of the technical detail, but note that some of the proofs formally require us to be working in a Banach or Hilbert space. We will not concern ourselves with such detail.

Example 2.11 We will mostly be working with the Euclidean vector spaces \(V=\mathbb{R}^n\), in which we use the Euclidean inner product \[\langle \mathbf u, \mathbf v\rangle = \mathbf u^\top \mathbf v\] sometimes called the scalar or dot product of \(\mathbf u\) and \(\mathbf v\). Sometimes this gets weighted by a matrix so that \[\langle \mathbf u, \mathbf v\rangle_Q = \mathbf u^\top \mathbf Q\mathbf v.\]

The norm associated with the dot product is the square root of the sum of squared errors, denoted by \(|| \cdot ||_2\). The length of \(\mathbf u\) is then \[||\mathbf u||_2=\sqrt{\mathbf u^\top \mathbf u} =\left( \sum_{i=1}^n u_i^2\right)^\frac{1}{2}\geq 0.\] Note that \(||\mathbf u||_2=0\) if and only if \(\mathbf u={\mathbf 0}_n\) where \(\stackrel{n\times 1}{\mathbf 0}_n=(0,0,\dots ,0)^\top\).

We say \(\mathbf u\) is orthogonal to \(\mathbf v\) if \(\mathbf u^\top \mathbf v=0\). For example, if \[\mathbf u=\left(\begin{array}{c}1\\2\end{array}\right) \mbox{ and } \mathbf v=\left(\begin{array}{c}-2\\1\end{array}\right)\] then \[||\mathbf u||_2 = \sqrt{5}\mbox{ and } \mathbf u^\top \mathbf v=0.\] We will write \(\mathbf u\perp \mathbf v\) if \(\mathbf u\) is orthogonal to \(\mathbf v\).

Definition 2.8 p-norm: The subscript \(2\) hints at a wider family of norms. We define the \(L_p\) norm to be \[|| \mathbf v||_p = \left(\sum_{i=1}^n |v_i|^p\right)^\frac{1}{p}.\]

2.3.2 Orthogonal matrices

Definition 2.9 A unit vector \(\mathbf v\) is a vector satisfying \(||{\mathbf v}||=1\), i.e., it is a vector of length \(1\). Vectors \(\mathbf u\) and \(\mathbf v\) are orthonormal if \[||\mathbf u||=||\mathbf v|| = 1 \mbox{ and } \langle \mathbf u, \mathbf v\rangle =0.\]

An \(n\times n\) matrix \({\mathbf Q}\) is an orthogonal matrix if \[{\mathbf Q}\mathbf Q^\top = {\mathbf Q}^\top {\mathbf Q}={\mathbf I}_n.\]

Equivalently, a matrix \(\mathbf Q\) is orthogonal if \({\mathbf Q}^{-1}={\mathbf Q}^\top.\)

If \({\mathbf Q}=[\mathbf q_1,\ldots, \mathbf q_n]\) is an orthogonal matrix, then the columns \(\mathbf q_1, \ldots, \mathbf q_n\) are mutually orthonormal vectors, i.e. \[ \mathbf q_j^\top \mathbf q_k=\begin{cases} 1 &\hbox{ if } j=k\\ 0 &\hbox{ if } j \neq k. \\ \end{cases} \]

Lemma 2.2 Let \(\mathbf Q\) be a \(n\times p\) matrix and suppose \(\mathbf Q^\top \mathbf Q=\mathbf I_p\), where \(\mathbf I_p\) is the \(p \times p\) identity matrix. If \(\mathbf Q\) is a square matrix (\(n=p\)), then \(\mathbf Q\mathbf Q^\top = \mathbf I_p\). If \(\mathbf Q\) is not square (\(n\not =p\)), then \(\mathbf Q\mathbf Q^\top \not = I_n\).

Proof. Suppose \(n=p\), and think of \(\mathbf Q\) as a linear map \[\begin{align*} \mathbf Q: &\mathbb{R}^n \rightarrow \mathbb{R}^n\\ &\mathbf v\mapsto \mathbf Q\mathbf v \end{align*}\] By the rank-nullity theorem, \[\dim \operatorname{Ker}(\mathbf Q) + \dim \operatorname{Im}(\mathbf Q) =n\] and because \(\mathbf Q\) has a left-inverse, we must have \(\dim \operatorname{Ker}(\mathbf Q)=0\), as otherwise \(\mathbf Q^\top\) would have to map from a vector space of dimension less than \(n\) to \(\mathbb{R}^n\). So \(\mathbf Q\) is of full rank, and thus must also have a right inverse, \(\mathbf B\) say, with \(\mathbf Q\mathbf B=\mathbf I_n\). If we left multiply by \(\mathbf Q^\top\) we get \[\begin{align*} \mathbf Q\mathbf B&=\mathbf I_n\\ \mathbf Q^\top\mathbf Q\mathbf B&=\mathbf Q^\top\\ \mathbf I_n \mathbf B&= \mathbf Q^\top\\ \mathbf B&= \mathbf Q^\top\\ \end{align*}\] and so we have that \(\mathbf Q^{-1}=\mathbf Q^\top\).

Now suppose \(\mathbf Q\) is \(n \times p\) with \(n\not = p\). Then as \(\mathbf Q^\top \mathbf Q=\mathbf I_{p\times p}\), we must have \(\operatorname{tr}(\mathbf Q^\top \mathbf Q)=p\). This implies that \[\operatorname{tr}(\mathbf Q\mathbf Q^\top)=\operatorname{tr}(\mathbf Q^\top \mathbf Q)=p\] and so we cannot have \(\mathbf Q\mathbf Q^\top=\mathbf I_{n}\) as \(\operatorname{tr}{(\mathbf I_{n})}=n\).

Corollary 2.2 If \(\mathbf q_1, \ldots , \mathbf q_n\) are mutually orthogonal \(n \times 1\) unit vectors then \[ \sum_{i=1}^n \mathbf q_i \mathbf q_i^\top = {\mathbf I}_n. \]

Proof. Let \(\mathbf Q\) be the matrix with \(i^{th}\) column \(\mathbf q_i\) \[\mathbf Q=\left( \begin{array}{ccc} | &&|\\ \mathbf q_1& \ldots& \mathbf q_n\\ | &&| \end{array}\right).\] Then \(\mathbf Q^\top \mathbf Q=\mathbf I_n\), and \(\mathbf Q\) is \(n\times n\). Thus by Lemma 2.2, we must also have \(\mathbf Q\mathbf Q^\top=\mathbf I_n\) and if we think about matrix-matrix multiplication as columns times rows (c.f. section 3.1), we get \[\mathbf I_n=\mathbf Q\mathbf Q^\top=\left( \begin{array}{ccc} | &&|\\ \mathbf q_1& \ldots& \mathbf q_n\\ | &&| \end{array}\right) \left( \begin{array}{ccc} - &\mathbf q_1^\top&-\\ & \vdots& \\ - &\mathbf q_n^\top&- \end{array}\right) = \sum_{i=1}^n \mathbf q_i \mathbf q_i^\top\] as required.

2.3.3 Projections

Definition 2.10 \(\stackrel{n \times n}{\mathbf P}\) is a projection matrix if \[\mathbf P^2 =\mathbf P\] i.e., if it is idempotent.

View \(\mathbf P\) as a map from a vector space \(W\) to itself. Let \(U=\operatorname{Im}(\mathbf P)\) and \(V=\operatorname{Ker}(\mathbf P)\) be the image and kernel of \(\mathbf P\).

Proposition 2.3 We can write \(\mathbf w\in W\) as the sum of \(\mathbf u\in U\) and \(\mathbf v\in V\).

Proof. Let \(\mathbf w\in W\). Then \[\mathbf w= \mathbf I_n \mathbf w=(\mathbf I-\mathbf P)\mathbf w+ \mathbf P\mathbf w\] Now \(\mathbf P\mathbf w\in \operatorname{Im}(\mathbf P)\) and \((\mathbf I-\mathbf P)\mathbf w\in \operatorname{Ker}(\mathbf P)\) as \[\mathbf P(\mathbf I-\mathbf P)\mathbf w= (\mathbf P-\mathbf P^2)\mathbf w={\boldsymbol 0}.\]

Proposition 2.4 If \(\stackrel{n \times n}{\mathbf P}\) is a projection matrix then \({\mathbf I}_n - \mathbf P\) is also a projection matrix.

The kernel and image of \(\mathbf I-\mathbf P\) are the image and kernel (respectively) of \(\mathbf P\): \[\begin{align*} \operatorname{Ker}(\mathbf I-\mathbf P) &= U=\operatorname{Im}(\mathbf P)\\ \operatorname{Im}(\mathbf I-\mathbf P) &= V=\operatorname{Ker}(\mathbf P). \end{align*}\]

2.3.3.1 Orthogonal projection

We are mostly interested in orthogonal projections.

Definition 2.11 If \(W\) is an inner product space, and \(U\) is a subspace of \(W\), then the orthogonal projection of \(\mathbf w\in W\) onto \(U\) is the unique vector \(\mathbf u\in U\) that minimizes \[||\mathbf w-\mathbf u||.\]

In other words, the orthogonal projection of \(\mathbf w\) onto \(U\) is the best possible approximation of \(\mathbf w\) in \(U\).

As above, we can split \(W\) into \(U\) and its orthogonal compliment \[U^\perp = \{\mathbf x\in W: \langle \mathbf x,\mathbf u\rangle = 0\}\] i.e., \(W=U \oplus U^\perp\) so that any \(\mathbf w\in W\) can be uniquely written as \(\mathbf w=\mathbf u+\mathbf v\) with \(\mathbf u\in U\) and \(\mathbf v\in U^\perp\). This makes clear why we call \(\mathbf u= \arg \min_{\mathbf u\in \mathbf U} ||\mathbf w-\mathbf u||\) the orthogonal projection of \(\mathbf w\), it is because it splits \(\mathbf w\) into two orthogonal components: \(\mathbf w= \mathbf u+\mathbf v\) with \(\langle \mathbf u, \mathbf v\rangle=0\).

Proposition 2.5 If \(\{\mathbf u_1, \ldots, \mathbf u_k\}\) is a basis for \(U\), then the orthogonal projection matrix (i.e., the matrix that projects \(\mathbf w\in W\) onto \(U\)) is \[\mathbf P_U = \mathbf A(\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top\] where \(\mathbf A=[\mathbf u_1\; \ldots\; \mathbf u_k]\) is the matrix with columns given by the basis vectors.

Proof. We need to find \(\mathbf u= \sum \lambda_i \mathbf u_i = \mathbf A\boldsymbol \lambda\) that minimizes \(||\mathbf w-\mathbf u||\).

\[\begin{align*} ||\mathbf w-\mathbf u||^2 &= \langle \mathbf w-\mathbf u, \mathbf w-\mathbf u\rangle\\ &= \mathbf w^\top \mathbf w- 2\mathbf u^\top \mathbf w+ \mathbf u^\top \mathbf u\\ &= \mathbf w^\top \mathbf w-2\boldsymbol \lambda^\top \mathbf A^\top \mathbf w+ \boldsymbol \lambda^\top \mathbf A^\top \mathbf A\boldsymbol \lambda. \end{align*}\]

Differentiating with respect to \(\boldsymbol \lambda\) and setting equal to zero gives \[{\boldsymbol 0}=-2 \mathbf A^\top \mathbf w+2 \mathbf A^\top \mathbf A\boldsymbol \lambda\] and hence \[ \boldsymbol \lambda= (\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top \mathbf w.\] The orthogonal projection of \(\mathbf w\) is hence \[ \mathbf A\boldsymbol \lambda= \mathbf A(\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top \mathbf w\] and the projection matrix is \[\mathbf P_U = \mathbf A(\mathbf A^\top \mathbf A)^{-1}\mathbf A^\top. \]

Notes:

If \(\{\mathbf u_1, \ldots, \mathbf u_k\}\) is an orthonormal basis for \(U\) then \(\mathbf A^\top \mathbf A= \mathbf I\) and \(\mathbf P_U = \mathbf A\mathbf A^\top\). We can then write \[\mathbf P_U\mathbf w= \sum_i (\mathbf u_i^\top \mathbf w) \mathbf u_i\] and \[\mathbf P_U = \sum_{i=1}^k \mathbf u_i\mathbf u_i^\top.\] Note that if \(U=W\) (so that \(\mathbf P_U\) is a projection from \(W\) onto \(W\), i.e., the identity), then \(\mathbf A\) is a square matrix (\(n\times n\)) and thus \(\mathbf A^\top\mathbf A=\mathbf I_n \implies \mathbf A\mathbf A^\top\) and thus \(\mathbf P_U=\mathbf I_n\) as required. The coordinates (with respect to the orthonormal basis \(\{\mathbf u_1, \ldots, \mathbf u_k\}\)) of a point \(\mathbf w\) projected onto \(U\) are \(\mathbf A^\top \mathbf w\).
\(\mathbf P_U^2=\mathbf P_U\), so \(\mathbf P_U\) is a projection matrix in the sense of definition 2.10.
\(\mathbf P_U\) is symmetric (\(\mathbf P_U^\top=\mathbf P_U\)). This is true for orthogonal projection matrices, but not in general for projection matrices.

Example 2.12 Consider the vector space \(\mathbb{R}^2\) and let \(\mathbf u=\frac{1}{\sqrt{2}}\left(\begin{array}{c}1\\1\end{array}\right)\). The projection of \(\mathbf v\in \mathbb{R}^2\) onto \(\mathbf u\) is given by \((\mathbf v^\top \mathbf u) \mathbf u\). So for example, if \(\mathbf v= (2, \; 1)^\top\), then its projection onto \(\mathbf u\) is \[\mathbf P_U \mathbf v= \frac{3}{\sqrt{2}}\left(\begin{array}{c}1\\1\end{array}\right).\] Alternatively, if we treat \(\mathbf u\) as a basis for \(U\), then the coordinate of \(\mathbf P_U \mathbf v\) with respect to the basis is \(3\). To check this, draw a picture!

2.3.3.2 Geometric interpretation of linear regresssion

Consider the linear regression model \[\mathbf y= \mathbf X\boldsymbol \beta+\mathbf e\] where \(\mathbf y\in\mathbb{R}^n\) is the vector of observations, \(\mathbf X\) is the \(n\times p\) design matrix, \(\boldsymbol \beta\) is the \(p\times 1\) vector of parameters that we wish to estimate, and \(\mathbf e\) is a \(n\times 1\) vector of zero-mean errors.

Least-squares regression tries to find the value of \(\boldsymbol \beta\in \mathbb{R}^p\) that minimizes the sum of squared errors, i.e., we try to find \(\boldsymbol \beta\) to minimize \[||\mathbf y- \mathbf X\boldsymbol \beta||_2^2\]

We know that \(\mathbf X\boldsymbol \beta\) is in the column space of \(\mathbf X\), and so we can see that linear regression aims to find the orthogonal projection onto \(\mathcal{C}(X)\). \[\mathbf P_U\mathbf y=\arg \min_{\mathbf y': \mathbf y' \in \mathcal{C}(X)} ||\mathbf y-\mathbf y'||_2.\]

By Proposition 2.5 this is \[\mathbf P_U\mathbf y= \mathbf X(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y=\hat{\mathbf y}\] which equals the usual prediction obtained in linear regression (\(\hat{\mathbf y}\) are often called the fitted values). We can also see that the choice of \(\boldsymbol \beta\) that specifies this point in \(\mathcal{C}(X)\) is \[\hat{\boldsymbol \beta}=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y\] which is the usual least-squares estimator.