7.6 Computer tasks

Task 1

Download the wine dataset from the UCI Machine Learning Repository

download.file(url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", destfile="wine.data")
wine <- read.csv("wine.data", header=FALSE)
colnames(wine) <- c("Type", "Alcohol", "Malic",
"Ash",
"Alcalinity",
"Magnesium",
"Phenols",
"Flavanoids",
"Nonflavanoids",
"Proanthocyanins",
"Color",
"Hue",
"Dilution",
"Proline")

Let’s assume this selection of wines is a random sample from the population of all possible wines. Conduct a multivariate hypothesis test to see whether the average wine has an alcohol content different from 13 and an malic acid content different from 2?
Test whether the alcohol and malic acid content are significantly different between wines of type 1 and wines of type 2.

Task 2

The command rWishart can be used to simulate from a Wishart distribution. Alternatively, to sample from \(W_p(\boldsymbol{\Sigma},n)\), you can sample \(\mathbf x_1, \ldots, \mathbf x_n \sim N_p(\boldsymbol 0, \boldsymbol{\Sigma})\) and set \(\mathbf M= \sum \mathbf x_i \mathbf x_i^\top\).

Generate 10,000 samples \(M_1, \ldots,\) from a \(W_2(\boldsymbol{\Sigma}, 10)\) distribution with \(\boldsymbol{\Sigma}= \operatorname{diag}(2,1)\) using these two approaches. Compute the mean and variance of the two samples and check these accord with Proposition 7.6.
Set \(\mathbf a= (1\; 1)^\top\). Check empirically that \(\mathbf a^\top \mathbf M\mathbf a\sim 3 \chi^2_{10}\). Hint plot the theoretical densities on top of a histogram of the sampled quantities.
Suppose \(\mathbf x_1, \ldots, \mathbf x_{10} \sim N_2(\boldsymbol 0, \operatorname{diag}(2,1))\). Empirically check that Proposition 7.10 is true by computing the covariance matrix of a large number of such samples, and comparing this to the mean and variance of the Wishart distribution specified in the proposition.
Similarly, validate Corollary 7.4 by comparing the distribution of \(\gamma^2\) with a \(F_{p, n-p}\) distribution.

Task 3

Download the exam data from Moodle.

library(dplyr)
load(file='exam.rda')
N<-dim(exam)[1]

We will now work through how to plot the confidence regions shown in the notes.

Firstly, let’s plot a circle with equation \[x^2+y^2=c^2\] or in vector form: \[\mathbf x^\top\mathbf x=c^2.\] Why must \(x\) be in the range \((-c,c)\)? Suppose \(c=10\). In this case you can plot a circle using the following commands

theta <- seq(0, 2*pi, 0.01)
x<- 10*cos(theta)
y <- 10*sin(theta)
par(pty="s")  # ensures the plot window is square
plot(x,y,type='l')

Explain why this works. Note if your plotting window is not square, your circle will look like an ellipse!

Let’s now plot the ellipse \[\mathbf x^\top \mathbf S^{-1}\mathbf x=c^2\] where \(\mathbf S\) is the covariance matrix for the exam data.

We can do this by noting that \(\mathbf u= \mathbf S^{-1/2}\mathbf x\) obeys the equation \[\mathbf u^\top\mathbf u=c^2,\] i.e., a circle. Thus you can plot an ellipse by using the code above to generate a circle, and then transforming it to be an ellipse. Plot the ellipse for \(\mathbf S\) given by the sample covariance matrix of the data.
What are the major and minor axes of these ellipses?
Finally, we can plot the ellipse \[(\mathbf x-{\boldsymbol{\mu}})^\top \mathbf S^{-1}(\mathbf x-{\boldsymbol{\mu}})=c^2\] by shifting the ellipse to be centered around \({\boldsymbol{\mu}}\). Thus plot the 95% confidence region for the population mean \({\boldsymbol{\mu}}\) for the exam data. You will need to use Corollary 7.4 to determine the value of \(c\).