2.4 Multi-output Linear Model
In the standard linear model the responses \(y_i\) are univariate. In the multivariate linear model, the data are \(\{\mathbf x_1, \mathbf y_1\}, \ldots, \{\mathbf x_n, \mathbf y_n\}\) with the responses \(\mathbf y_i \in \mathbb{R}^q\). Here, the linear model takes the form \[\begin{equation} \mathbf y_i= \mathbf B^\top \mathbf x_i +{\pmb \epsilon}_i, \qquad i=1, \ldots , n, \tag{2.5} \end{equation}\] where \(\mathbf B\) is now a \(p \times q\) parameter matrix, and we have an error vector \({\pmb \epsilon}_i\in \mathbb{R}^q\) are \(q \times 1\). The model (2.5) may be written in matrix form as \[\begin{equation} \mathbf Y= \mathbf X\mathbf B+\mathbf E, \tag{2.6} \end{equation}\] where \[\mathbf Y= \begin{pmatrix} - & \mathbf y_1^\top &-\\ &\vdots&\\ -&\mathbf y_n^\top&-\end{pmatrix}\] is the \(n \times q\) data matrix for the \(y\)-variables, \(\mathbf X\) is the \(n \times p\) data matrix as defined before, and \[\stackrel{n \times p}{\mathbf E}=\begin{pmatrix} - & \boldsymbol \epsilon_1^\top &-\\ &\vdots&\\ -&\boldsymbol \epsilon_n^\top&-\end{pmatrix}.\]
How should we estimate \(\mathbf B\)? One approach is to choose \(\mathbf B\) to minimize the sum of squared errors: \[\begin{align}\sum_{i=1}^n\sum_{j=1}^q \epsilon_{ij}^2 &= \operatorname{tr}(\mathbf E^\top \mathbf E) \\ &= \operatorname{tr}((\mathbf Y-\mathbf X\mathbf B)^\top(\mathbf Y-\mathbf X\mathbf B))\\ &=||\mathbf Y-\mathbf X\mathbf B||_F^2. \end{align}\]
You will see how to fit multivariate linear models in R in the computer tasks.
\iffalse{} Proof. NON-EXAMINABLE The usual proof of this involves differentiating with respect to a matrix, and is horribly messy. However, there is a geometric proof as in the univariate case.
The space of matrices \(M_{n,q} = \{\mathbf Y\in \mathbb{R}^{n \times q} \}\) is a vector space, and we can equip it with an inner product \[\langle \mathbf X, \mathbf Y\rangle_F = \operatorname{tr}(\mathbf X^\top \mathbf Y)\] This is the Frobenius inner product and it induces the Frobenius norm on \(M_{n,q}\) \[||\mathbf X||_F = \langle \mathbf X, \mathbf X\rangle_F^{\frac{1}{2}}\]
Consider the matrix analogue of the column space of \(\mathbf X\) \[\mathcal{C}_q(\mathbf X) =\{\mathbf X\mathbf B: \mathbf B\in \mathbb{R}^{p\times q}\}.\] This is a vector subspace of \(M_{n,q}\). As before, we can see that
\[\min_{\mathbf B} ||\mathbf Y-\mathbf X\mathbf B||_F^2=\min_{\mathbf Y' \in \mathcal{C}_q(\mathbf X)} ||\mathbf Y- \mathbf Y'||_F^2.\]
In other words, we want to find the orthogonal projection of \(\mathbf Y\) onto \(\mathcal{C}_q(\mathbf X)\). Let \(\hat{\mathbf Y} = \arg \min_{\mathbf Y' \in \mathcal{C}_q(\mathbf X)} ||\mathbf Y- \mathbf Y'||_F^2\) be the orthogonal projection, and let \(\hat{\mathbf B}\in \mathbb{R}^{p \times q}\) be the matrix such that \(\hat{\mathbf Y}=\mathbf X\hat{\mathbf B}\). Then \[\mathbf Y-\hat{\mathbf Y} \in \mathcal{C}_q(\mathbf X)^\perp\] and so \[\langle \mathbf Z, \mathbf Y-\hat{\mathbf Y}\rangle \quad \mbox{ for all}\quad \mathbf Z\in \mathcal{C}_q(\mathbf X).\] Writing \(\mathbf Z= \mathbf X\mathbf A\) for some \(\mathbf A\in M_{p,q}\) we can see we must have
\[\operatorname{tr}(\mathbf A^\top \mathbf X^\top(\mathbf Y-\mathbf X\hat{\mathbf B}))=0 \quad \mbox{ for all}\quad \mathbf A\in M_{p,q}.\]
Setting \(\mathbf A=\mathbf e_{ij}\), the standard basis for \(M_{p,q}\) (ie the \(p\times q\) matrix that is all zeros except for the \(ij\) entry which is \(1\)), for \(i=1,\ldots, p\) and \(j=1, \ldots,q\) we get that \[\mathbf X^\top(\mathbf Y-\mathbf X\hat{\mathbf B})=\boldsymbol 0_{p,q}.\] Rearranging this gives the result.Remarks:
Note how similar (2.7) is to its univariate counterpart (2.3). The only thing that is different is that \(\mathbf Y\) is now an \(n \times q\) matrix rather than an \(n \times 1\) vector.
The coefficients for the \(k^{th}\) output, i.e., the \(k^{th}\) column of \(\mathbf B\), are just the OLS estimates from regressing the \(k^{th}\) column of \(\mathbf Y\) on \(\mathbf X\). I.e., the multivariate linear regression estimates are just the univariate estimates applied to each output in turn: the different outputs do not affect each others least squares estimates.
2.4.1 Normal linear model
If we assume that \[\begin{equation} {\pmb \epsilon_1}, \ldots , {\pmb \epsilon}_n \quad \text{are IID}\quad N_q({\mathbf 0}_q, \boldsymbol{\Sigma}). \tag{2.8} \end{equation}\] where \(\boldsymbol{\Sigma}\) is a \(q \times q\) covariance matrix, then we can also seek the maximum likelihood estimator of \(\mathbf B\).
Proposition 2.2 The maximum likelihood estimator of \(\mathbf B\) is
\[\begin{equation} \hat{\mathbf B}= (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf Y \end{equation}\]\iffalse{} Proof. NON-EXAMINABLE
Assume \(\boldsymbol{\Sigma}\) is positive definite. Then the log-likelihood of \(\mathbf B\) and \(\boldsymbol{\Sigma}\) is \[\begin{align} \ell(\mathbf B, \boldsymbol{\Sigma})&=-\frac{nq}{2}\log(2\pi) -\frac{n}{2}\log(\vert \boldsymbol{\Sigma}\vert) \nonumber \\ & \qquad \qquad -\frac{1}{2}\text{tr}\left \{ (\mathbf Y-\mathbf X\mathbf B) \boldsymbol{\Sigma}^{-1} (\mathbf Y- \mathbf X\mathbf B)^\top\right \}. \tag{2.9} \end{align}\] We can see that maximizing the log-likelihood with respect to \(\mathbf B\) is equivalent to minimizing \[\text{tr}\left \{ (\mathbf Y-\mathbf X\mathbf B) \boldsymbol{\Sigma}^{-1} (\mathbf Y- \mathbf X\mathbf B)^\top\right \}\]
We can now modify the proof of 2.1 by using the following inner product on \(M_{n,q}\) \[\langle \mathbf X, \mathbf Y\rangle = \operatorname{tr}(\mathbf X\boldsymbol{\Sigma}^{-1}\mathbf Y^\top)\] and corresponding norm \[||\mathbf X||^2= \operatorname{tr}(\mathbf X\boldsymbol{\Sigma}^{-1}\mathbf X^\top)\] Then if we follow the same steps as before, to find \(\mathbf B\) to minimize \(||\mathbf Y-\mathbf X\mathbf B||\) is equivalent to finding \({\mathbf Y}' \in \mathcal{C}_q(\mathbf X)\) to minimize \(||\mathbf Y-{\mathbf Y}'||\), i.e., finding the orthogonal projection of \(\mathbf Y\) onto \(\mathcal{C}_q(\mathbf X)\), which we will denote as \(\hat{\mathbf Y}\), and write \(\hat{\mathbf Y}=\mathbf X\hat{\mathbf B}\). Then as before, we must have
\[\langle \mathbf X\mathbf A, \mathbf Y-\mathbf X\hat{\mathbf B}\rangle=0 \quad \mbox{ for all} \quad \mathbf A\in M_{p,q}\] which implies \[\operatorname{tr}(\mathbf X\mathbf A\boldsymbol{\Sigma}^{-1} (\mathbf Y-\mathbf X\hat{\mathbf B})^\top )=0\] Now as \(\operatorname{tr}(\mathbf A\mathbf B\mathbf C)=\operatorname{tr}(\mathbf B\mathbf C\mathbf A)=\operatorname{tr}(\mathbf C\mathbf A\mathbf B)\), we must also have \[\operatorname{tr}(\mathbf A\boldsymbol{\Sigma}^{-1} (\mathbf Y-\mathbf X\hat{\mathbf B})^\top \mathbf X)=0\quad \mbox{ for all} \quad \mathbf A\in M_{p,q}.\] Setting \(\mathbf A= \mathbf e_{ij}\boldsymbol{\Sigma}\) gives \[(\mathbf Y-\mathbf X\hat{\mathbf B})^\top \mathbf X=0\] and the result follows.Non-examinable extension
To prove this using calculus is tricky, or at least it is, if we use the standard approach. However, there is another approach for proving this using the Fr'{e}chet derivative. Recall that the usual definition of a derivative of a function \(f: \mathbb{R}\rightarrow \mathbb{R}\) satisfies \[f(x+h)=f(x)+f'(x)h +o(h)\] as \(h \rightarrow 0\), where \(o(h)\) indicates a term that goes to zero faster than \(h\), i.e., \[f'(x) = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h}.\]
If instead \(\mathbf x\in \mathbb{R}^d\), so that \(f: \mathbb{R}^d\rightarrow \mathbb{R}\), then the derivative is a vector and we usually write it as \(\nabla f(\mathbf x)\in \mathbb{R}^d\). The corresponding relationship is \[f(\mathbf x+\mathbf h)=f(\mathbf x)+\mathbf h^\top \nabla f(\mathbf x) +o(\mathbf h)\] which we could write as \[f(\mathbf x+\mathbf h)=f(\mathbf x)+\langle\mathbf h, \nabla f(\mathbf x) \rangle+o(||\mathbf h||)\] where \(\langle\cdot, \cdot \rangle\) is the usual Euclidean inner-product. We say that \(\langle\mathbf h, \nabla f(\mathbf x)\) is the derivative of \(f\) at \(\mathbf x\) in the direction \(\mathbf h\).
This way of writing things hints at the definition of the Frechet derivative. If now \(\mathbf x\) lives in some general normed space, and \(f\) maps \(\mathbf x\) to an inner-product space, then if as \(\mathbf h\rightarrow 0\) \[f(\mathbf x+\mathbf h)=f(\mathbf x)+\langle\mathbf h, \mathbf g\rangle+o(||\mathbf h||)\] we say that \(\mathbf g\) is the Frechet derivative of \(f\) at \(\mathbf x\), and that \(\langle\mathbf h, \mathbf g\rangle\) is the Frechet derivative of \(f\) at \(\mathbf x\) in the direction \(\mathbf h\). It can be shown that for finite dimensional vector spaces, the Frechet derivative always equals the usual derivative.
Example 1 Recall the function from the first chapter… \[f(\mathbf x) = \mathbf a^\top \mathbf x\] If we consider \(f(\mathbf x+\mathbf h)\) we can see that we have \[\begin{align*} f(\mathbf x+\mathbf h) &=\mathbf a^\top(\mathbf x+\mathbf h)\\ &= \mathbf a^\top \mathbf x+\mathbf a^\top \mathbf h\\ &= f(\mathbf x)+\langle \mathbf a, \mathbf h\rangle +o(\mathbf h) \end{align*}\] where we have used the fact that \(0\) is \(o(\mathbf h)\). Thus the derivative is \(\nabla_\mathbf xf(\mathbf x) =\mathbf a.\)
Example 2 Now consider the quadratic form \[f(\mathbf x) = \mathbf x^\top \mathbf A\mathbf x\] for a symmetric \(n\times n\) matrix \(\mathbf A\). Then \[\begin{align} f(\mathbf x+\mathbf h)&=(\mathbf x+\mathbf h)^\top \mathbf A(\mathbf x+\mathbf h)\\ &= f(\mathbf x)+\mathbf h^\top \mathbf A\mathbf x+ \mathbf x^\top \mathbf A\mathbf h+ \mathbf h^\top\mathbf A\mathbf h\\ &= f(\mathbf x)+\langle\mathbf h, 2\mathbf A\mathbf x\rangle + o(||\mathbf h||) \end{align}\] and so \(\nabla_\mathbf xf(\mathbf x) =2\mathbf A\mathbf x.\)
An alternative proof
Returning to the multi-output linear regression problem, let \[f(\mathbf B)= \text{tr}\left \{ (\mathbf Y-\mathbf X\mathbf B) \boldsymbol{\Sigma}^{-1} (\mathbf Y- \mathbf X\mathbf B)^\top\right \}\]
We can now repeat the trick from above with the Frechet derivative. The variable we’re differentiating with respect to is now the matrix \(\mathbf B\), which lives in the inner product space \(M_{p,q}\) with inner product \(\langle \mathbf X\mathbf Y\rangle =\operatorname{tr}(\mathbf X^\top \mathbf Y)\)
\[\begin{align} f(\mathbf B+\mathbf H) &=\text{tr}\left \{ (\mathbf Y-\mathbf X(\mathbf B+\mathbf H)) \boldsymbol{\Sigma}^{-1} (\mathbf Y- \mathbf X(\mathbf B+\mathbf H))^\top\right\}\\ &=f(\mathbf B)+\operatorname{tr}\{-\mathbf X\mathbf H\boldsymbol{\Sigma}^{-1}(\mathbf Y-\mathbf X\mathbf B)^\top\}\\ &\quad +\operatorname{tr}\{-(\mathbf Y-\mathbf X\mathbf B)\boldsymbol{\Sigma}^{-1}(\mathbf X\mathbf H)^\top\}+\operatorname{tr}\{\mathbf X\mathbf H\boldsymbol{\Sigma}^{-1}(\mathbf X\mathbf H)^\top\} \end{align}\] The final term here is \(o(||\mathbf H||)\), so let’s worry about the other two terms. Firstly by recalling again that \(\operatorname{tr}(\mathbf A\mathbf B\mathbf C)=\operatorname{tr}(\mathbf B\mathbf C\mathbf A)=\operatorname{tr}(\mathbf C\mathbf A\mathbf B)\)
we can see that \[\begin{align} \operatorname{tr}\{\mathbf X\mathbf H\boldsymbol{\Sigma}^{-1}(\mathbf Y-\mathbf X\mathbf B)^\top\} &= \operatorname{tr}\{\boldsymbol{\Sigma}^{-1}(\mathbf Y-\mathbf X\mathbf B)^\top \mathbf X\mathbf H\}\\ &=\langle (\boldsymbol{\Sigma}^{-1}(\mathbf Y-\mathbf X\mathbf B)^\top \mathbf X)^\top, \mathbf H\rangle. \end{align}\]
The other term works out to be the same, and so collecting terms, we get that
\[f(\mathbf B+\mathbf H) =f(\mathbf B)- \langle 2\mathbf X^\top (\mathbf Y-\mathbf X\mathbf B)\boldsymbol{\Sigma}^{-1}, \mathbf H\rangle +o(||\mathbf H||)\] and thus the derivative of \(f(\mathbf B)\) with respect to the matrix \(\mathbf B\) is \[\nabla_{\mathbf B}f(\mathbf B)= -2\mathbf X^\top (\mathbf Y-\mathbf X\mathbf B)\boldsymbol{\Sigma}^{-1}.\]
If we set this equal to zero, noting that \(\boldsymbol{\Sigma}\) is full rank (as it is invertible), gives that \[\mathbf X^\top (\mathbf Y-\mathbf X\mathbf B)=\boldsymbol 0\] and the result follows.
2.4.2 Reduced rank regression
Suppose we know that there is redundancy in the data, so that there is an \(r\)-dimensional active subspace in the input space \(\mathbb{R}^p\) that contains all the useful information about \(\mathbf y\). In other words, we believe there is a \(p \times r\) matrix \(\mathbf A\) such that \[\mathbf z=\mathbf A^\top\mathbf x\] is all we need to predict \(\mathbf y\). Then our regression model is \[\mathbf y= \mathbf C^\top \mathbf z+ \mathbf e= \mathbf C^\top \mathbf A^\top \mathbf x+ \mathbf e\] where \(\mathbf C\) is a \(r\times q\) matrix of coefficients. Rewriting this back in terms of \(\mathbf x\) we can see that this is equivalent to assuming \(\mathbf B= \mathbf A\mathbf C\) and in matrix form \[\mathbf Y= \mathbf X\mathbf A\mathbf C+ \mathbf E\] This is known as reduced rank regression, as if we restrict \(\mathbf B\) to be rank \(r\), this is equivalent to assuming \(\mathbf B= \mathbf A\mathbf C\) for \(p\times r\) and \(r \times q\) matrices \(\mathbf A\) and \(\mathbf C\).
Note that this doesn’t ensure uniqueness as any \(r \times r\) orthogonal transformation \(\mathbf Q\) for the form \(\mathbf A\mathbf Q\) and \(\mathbf Q^\top \mathbf C\) results in the same model. Only the subspace spanned by the columns of \(\mathbf A\) are identifiable.
\iffalse{} Proof. If \(\mathbf B\) is at most rank \(r\) we can write it as \(\mathbf B=\mathbf A\mathbf C\), and then \[\begin{align} \arg \min_{\mathbf A, \mathbf C} ||\mathbf Y-\mathbf X\mathbf A\mathbf C||_F^2 &= \arg \min_{\mathbf A, \mathbf C} ||\mathbf Y-\hat{\mathbf Y}||_F^2 + ||\hat{\mathbf Y}-\mathbf X\mathbf A\mathbf C||_F^2\\ &= \arg \min_{\mathbf A, \mathbf C} ||\hat{\mathbf Y}-\mathbf X\mathbf A\mathbf C||_F^2 \end{align}\] where \(\hat{\mathbf Y}=\mathbf X\mathbf B^{ols}\). The first equality follows as \[\begin{align*} \langle \mathbf Y-\hat{\mathbf Y},\; \hat{\mathbf Y}-\mathbf X\mathbf A\mathbf C\rangle_F &= \langle \mathbf Y-\mathbf X\hat{\mathbf B}^{ols}, \; \mathbf X\hat{\mathbf B}^{ols}-\mathbf X\mathbf A\mathbf C\rangle\\ &=\langle \mathbf X^\top\mathbf Y-\mathbf X^\top\mathbf X\hat{\mathbf B}^{ols}, \; \hat{\mathbf B}^{ols}-\mathbf A\mathbf C\rangle\\ &=0 \mbox{ as }\mathbf X^\top\mathbf X\hat{\mathbf B}^{ols} = \mathbf X^\top\mathbf Y. \end{align*}\]
\(\mathbf X\mathbf A\mathbf C\) is at most rank \(r\). If the SVD of \(\hat{\mathbf Y}\) is \(\hat{\mathbf Y} = \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\), then by the Eckart-Young-Mirsky theorem \[\hat{\mathbf Y}_r =\mathbf U_r\boldsymbol{\Sigma}_r\mathbf V_r^\top\] is the best rank \(r\) approximation to \(\hat{\mathbf Y}\) in the sense that it is the rank \(r\) matrix which minimizes \(||\hat{\mathbf Y}-\mathbf W||\).
But \[\mathbf X\hat{\mathbf B}^{ols}\mathbf V_r\mathbf V^\top_r = \hat{\mathbf Y}\mathbf V_r\mathbf V^\top_r = \hat{\mathbf Y}_r.\]
Thus proving the result.In the notation from above, we have that \(\mathbf A=\hat{\mathbf B}^{ols}\mathbf V_r\) and \(\mathbf C=\mathbf V^r\)?
2.4.2.1 Brillinger’s theorem
Brillinger’s theorem is a generalization of this result.
Notes
Usually this result is expressed in terms of random variables, not the sample covariance matrices. I.e. we have zero mean random variables \(\mathbf x\in \mathbb{R}^p\) and \(\mathbf y\in \mathbb{R}^q\), and seek to minimize \[\operatorname{tr}{\mathbb{E}}\left( \boldsymbol \Gamma^{1/2}(\mathbf y-\mathbf C^\top \mathbf A^\top \mathbf x)(\mathbf y-\mathbf C^\top \mathbf A^\top \mathbf x)^\top \boldsymbol \Gamma^{1/2}\right)\] If we substitute empirical estimates for covariance matrices (i.e. replace \(\boldsymbol{\Sigma}_{xx}\) by \(\hat{\boldsymbol{\Sigma}}_{xx}=\frac{1}{n}\mathbf X^\top\mathbf X\)), then this quantity is estimated by \(||(\mathbf Y- \mathbf X\mathbf A\mathbf C)\boldsymbol \Gamma||_F^2\). Brillingers theorem then works using the eigenvectors of \[\boldsymbol \Gamma^{1/2}\boldsymbol{\Sigma}_{yx}\boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy}\boldsymbol \Gamma^{1/2}.\]
If \(\boldsymbol \Gamma=\mathbf I\) then we recover the reduced rank regression estimator, \(\hat{\mathbf B}^{rr}=\hat{\mathbf A}\hat{\mathbf C}\).
\iffalse{} Proof. Essentially the same as before. \[ ||(\mathbf Y-\mathbf X\mathbf A\mathbf C)\boldsymbol \Gamma^{1/2}||_F^2 = ||(\mathbf Y-\hat{\mathbf Y})\boldsymbol \Gamma^{1/2}||_F^2+||(\hat{\mathbf Y}-\mathbf X\mathbf A\mathbf C)\boldsymbol \Gamma^{1/2}||_F^2\] and so we only need minimize \[||(\hat{\mathbf Y}-\mathbf X\mathbf A\mathbf C)\boldsymbol \Gamma^{1/2}||_F^2.\]
The best rank \(k\) approximation to \(\hat{\mathbf Y}\boldsymbol \Gamma^{1/2} = \mathbf U\boldsymbol{\Sigma}\mathbf V^\top\) is \(\mathbf U_k \boldsymbol{\Sigma}_k \mathbf V_k^\top\) by Eckart-Young-Mirsky. Using the claimed \(\hat{\mathbf C}\) and \(\hat{\mathbf A}\), we find \[\begin{align*} \mathbf X\hat{\mathbf A}\hat{\mathbf C} \boldsymbol \Gamma^{1/2}&= \mathbf X\hat{\mathbf B}^{ols}\boldsymbol \Gamma^{1/2}\mathbf V_r\mathbf V_r^\top\\ &= \hat{\mathbf Y}\boldsymbol \Gamma^{1/2}\mathbf V_r \mathbf V_r^\top\\ &=\mathbf U_r \boldsymbol{\Sigma}_r \mathbf V_r^\top \end{align*}\] and so we’re done.This result also shows the link to CCA.