Part IV: Classification and Clustering

In Part IV, we focus on different methods of classification, i.e. allocating the observations in a sample to different subsets (or groups). We distinguish between supervised and unsupervised learning methods.

In supervised learning, our training data consists of the measurements \(\mathbf x_i\) on each case, and a class label \(y_i\). Our aim is to learn the mapping from \(\mathbf x\) to \(y\). Linear regression is an example of a supervised learning method.

In Chapter 8, we focus on another supervised learning approach called discriminant analysis which aims to allocate observations to distinct groups. We are given a training set containing data with cases and their group label, and we must use this training sample to set up a suitable classification rule (classifing cases to groups). An important type of situation where discriminant analysis is used is in screening tests. Here, several variables may be measured on each individual, and we want to decide whether each individual is negative for some disease, in which case no further investigations are required, or positive, in which case further tests are required.

In unsupervised learning, we do not have labelled data (ie there is no \(y\)), we only have the measurements \(\mathbf x\). The method has to find or invent its own representation of the data structure. PCA and MDS are both unsupervised learning methods. In Chapter 9, we will consider cluster analysis, in which we group observations into clusters (or similar subsets). Here, data labels (\(y\) variables) are not available and the number of clusters will not typically be known in advance. The idea is to form clusters in such a way that experimental units within clusters are as similar as possible, in a suitable sense, and experimental units in different clusters are as dissimilar as possible.