Chapter 9 Cluster Analysis

Discriminant analysis, covered in Chapter 8, is a supervised learning method: in order to train the classifier we had access to both the input \(\mathbf x\) and the label \(y\) for that case (what group it belonged to). This chapter focusses on cluster analysis, which is an unsupervised learning method: we have to train the model without knowledge of the true cluster labels. In many problems there may not be anything we could describe as a ‘true’ cluster label - in which case we are using cluster analysis as a way of finding descriptive statistics about the data.

The aim is to group cases into ‘clusters’ such that cases within each cluster are more closely related to each other than to cases in other clusters. Some methods do this using the attributes/measurements \(\mathbf x\) for each case. These methods include k-means clustering and model-based clustering. In contrast, other methods such as hierarchical clusteringmethods, use the proximities or distances between cases (as in MDS).

The videos for this chapter are available at the following links: