9.5 Computer tasks

Q1

Perform hierarchical clustering with single, complete and average linkage using the iris data. You could also look at other method’s such as Ward’s method (see ?hclust for details).

In each case, cut the dendrogram to give three distinct groups, and compute the confusion matrix comparing the clusters found with the Species label. Commment on which linkage method has worked best in this case.

We do not normally know the species/cluster-label when carrying out cluster analysis, and so can we still say anything about which methods are better if you were expecting to see three distinct groups?

Compare the hierarchical clustering methods with the results of doing K-means clustering and model-based clustering (assuming multivariate normal distributions for each population).

Q2

Download the Indian Premier League data from Moodle and load it into R. We will filter the data to only look at players who played at least 10 innings, and the select just the information on the number of runs they scored, their high score (HS), their batting average (Avg), their best figures (BF), their strike rate (SR), and the number of 4s and 6s they hit.

library(dplyr)
IPL<-read.csv('IPL.csv')
IPL10<-IPL %>% dplyr::filter(Mat.x>=10) %>%
  dplyr::select(PLAYER,Runs.x, HS, Avg.x, BF, SR.x, X4s, X6s)

Apply agglomerative hierarchical clustering to these data. You will need to first compute a distance matrix for the data, which can be done with the dist command.

Using the Euclidean distance, do single linkage, complete linkage, and average linkage give similar dendrograms?
Do your results change much if we use a different distance measure (e.g. Manhattan)?
There are several R packages for creating different types of dendrogram plots. Have a look at the link here and try creating an alternative type of dendrogram. Consider whether this has helped communicate anything of interest about the data.

Q3

Look at the data stored in the USArrests data frame in R. You can read about the data by typing ?USArrests.

Apply a selection of clustering methods to these data and discuss how many clusters appear to be present.