6.5 Computer Tasks

Task 1

The eurodist dataset in R gives the road distances between 21 European cities. Note that this is stored as a dist type of object, as outputted by the dist command, i.e. as a lower tri-diagonal matrix. cmdscale will take this directly as input.

data(eurodist)
eurodist
?eurodist

Perform multidimensional scaling on this data, and find a two-dimensional set of points which has interpoint distances approximately equal to the data.
Plot these coordinates and label them with the city names. Does your plot look like the map of Europe?
Is the distance matrix eurodist a Euclidean data matrix and how do you know? If it is not Euclidean, why do you think that might be?
Create the Euclidean distance matrix from your set of 2-dimensional points. What is the Frobenius norm between this matrix and the original distance matrix? Use cmdscale to create a set of points in 3 dimensional space and recompute the distance matrix.

Task 2

Consider the synthetic data of 9 binary attributes on 11 cases.

df=structure(list(a = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0), b = c(0, 
0, 0, 0, 0, 0, 1, 0, 1, 0, 1), c = c(0, 0, 0, 0, 1, 0, 0, 1, 
0, 1, 0), d = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), e = c(0, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 1), f = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 1), g = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), h = c(1, 0, 0, 
0, 0, 0, 0, 1, 1, 0, 0), i = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
0)), class = "data.frame", row.names = c(NA, -11L), .Names = c("a", 
"b", "c", "d", "e", "f", "g", "h", "i"))
df

##    a b c d e f g h i
## 1  0 0 0 1 0 0 0 1 0
## 2  0 0 0 0 0 0 1 0 0
## 3  0 0 0 0 1 1 1 0 0
## 4  0 0 0 0 0 0 1 0 0
## 5  1 0 1 0 0 0 0 0 0
## 6  0 0 0 0 0 0 0 0 1
## 7  0 1 0 1 0 0 0 0 0
## 8  1 0 1 0 0 0 0 1 0
## 9  0 1 0 0 0 0 0 1 1
## 10 0 0 1 1 0 0 0 0 1
## 11 0 1 0 0 1 1 0 0 0

Compute the Jaccard index and SMC similarity matrices for these data.
Perform classical MDS for both similarity matrices, producing a plot of the coordinates (in 2d). Are the results similar?

Task 3

In this question we will look at data from 1888 on the fertility and socio-economic status of 47 French speaking provinces in Switzerland.

data(swiss)
head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

?swiss

We will use MDS to find which provinces are similar to each other.

Compute the Euclidean distance matrix for these data.
Use MDS to create a 2-dimensional representation of the data and plot these points, labelling them with the province name.
Use MDS to create a 3-dimensional representation of the data. You can plot this using the plot3d command from the rgl package. See, for example, here.

library(rgl)
plot3d(mds3)

MDS can be also used to reveal a hidden pattern in a correlation matrix. Find the correlation matrix, \(\mathbf R\), of the swiss data. Perform MDS using \(1-\mathbf R\) as the distance matrix and plot the results. Positively correlated covariates are close together on the same side of the plot.

Task 4 (if you have time…)

Try MDS on the MNIST data but looking for a 3d representation. Colour the points by their digit label, and create some interactive 3d plots. Does this find useful structure in the data? And is it more informative than the 2d plots we created in the notes.

Read the description of more advance methods here. Pick one and find an R package that implements it and try it on the MNIST data.

Warning: The MNIST dataset is large, and so computations can take a long time if you use the full dataset. Thus I usually work with a selection of just 1000 images, which is enough to find interesting patterns in most cases.