Chapter 6 Multidimensional Scaling (MDS)

The videos for this chapter are available at:

In PCA, we start with \(n\) data points \(\mathbf x_i \in \mathbb{R}^p\), and then try to find a low dimensional projection of these points, e.g., \(\mathbf y_1, \ldots, \mathbf y_n \in \mathbb{R}^r\) with \(r<p\), in such a way that they minimize the reconstruction error (or maximize the variance).

The focus in Multidimensional Scaling (MDS) is somewhat different. Instead of being given the data \(\mathbf X\), our starting point is often a matrix of distances or dissimilarities between the data points, \(\mathbf D\). For example, if we have data on \(n\) different experimental units, then we would be given the distances \(d_{ij}\) between any pair of experimental units \(i\) and \(j\). We compile these into a \(n\times n\) distance matrix \(\mathbf D=(d_{ij}: \, i,j=1, \ldots , n)\).

The goal of MDS is to find a set of points in a low-dimensional Euclidean space \(\mathbb{R}^r\), usually \(\mathbb{R}\) or \(\mathbb{R}^2\), whose inter-point distances (or dissimilarities) are as close as possible to the \(d_{ij}\). That is, we want to find \(\mathbf y_1, \ldots, \mathbf y_n \in \mathbb{R}^r\) whose distance matrix is approximately \(\mathbf D\), i.e., for which \[\operatorname{distance}(\mathbf y_i, \mathbf y_j) \approx d_{ij}.\] In other words, we are trying to create a spatial representation of the data, \(\mathbf y_1, \ldots, \mathbf y_n\), from a distance matrix \(\mathbf D\). The vectors \(\mathbf y_i\) have no meaning by themselves, but by visualising their spatial pattern we can hope to learn something about the dataset represented by \(\mathbf D\).

If we define the errors in terms of a square distance, then we can write the goal of MDS as the following optimization problem: \[\begin{align} \mbox{Find} \quad& \mathbf y_1, \ldots, \mathbf y_n \in \mathbb{R}^r\\ \mbox{to minimize} \quad &\sum_{i=1}^n \sum_{j=1}^n (d_{ij} - d(\mathbf y_i, \mathbf y_j))^2.\tag{6.1} \end{align}\]

As an illustrative example, consider the location of some of England’s cities.

If we are told their latitude and longitude, it is easy to calculate the distances between the cities. I.e., given the latitude and longitude of each city, \(\mathbf x_i\), we can compute the distance matrix between pairs of cities \(\mathbf D\):

London Birmingham Manchester Leeds Newcastle Liverpool Portsmouth Southampton Nottingham Bristol Sheffield Norwich Plymouth
London 0 163 262 273 403 286 103 112 175 170 228 159 308
Birmingham 163 0 114 149 282 125 195 179 73 124 105 217 281
Manchester 262 114 0 58 174 50 308 293 94 227 53 255 369
Leeds 273 149 58 0 135 105 335 323 98 271 47 230 420
Newcastle 403 282 174 135 0 199 469 458 231 401 181 328 542
Liverpool 286 125 50 105 199 0 317 299 132 219 101 299 346
Portsmouth 103 195 308 335 469 317 0 24 239 127 288 261 221
Southampton 112 179 293 323 458 299 24 0 229 103 276 268 202
Nottingham 175 73 94 98 231 132 239 229 0 193 53 169 353
Bristol 170 124 227 271 401 219 127 103 193 0 228 296 162
Sheffield 228 105 53 47 181 101 288 276 53 228 0 203 382
Norwich 159 217 255 230 328 299 261 268 169 296 203 0 453
Plymouth 308 281 369 420 542 346 221 202 353 162 382 453 0

But can we do the reverse and construct a map from the distance matrix? This is the aim of multidimensional scaling: MDS constructs a set of points, \(\mathbf y_1, \ldots, \mathbf y_n\), that have distances between them given by the distance matrix \(\mathbf D\). In other words, it creates a map with a set of coordinates for which the distances between points are approximately the same as in the real data.

Of course, this illustrative example is not very interesting, as the original data (the city locations) are only two-dimensional, but in problems with high dimensional data, finding a way to represent the points in a low-dimensional space will make visualization and statistical analysis easier. We shall see, perhaps unsurprisingly, that there is a close connection between MDS and PCA.

Notation

There are many ways to compute distances between points. For example, you may have studied metrics previously, which are distance functions that satisfy certain axioms. In MDS, we do not need to work with distances that are metrics (although sometimes we do), and instead we only require that distances satisfy a weaker set of conditions:

Definition 6.1 The \(n \times n\) matrix \(\mathbf D=(d_{ij})_{i,j=1}^n\) is a distance matrix (sometimes called a dissimilarity matrix) if

  1. \(d_{ij}\geq 0\) for all \(i, j=1,\ldots,n\).
  2. \(d_{ii}=0\) for \(i=1,\ldots, n\) and
  3. \(\mathbf D=\mathbf D^\top\), i.e., \(\mathbf D\) is symmetric (\(d_{ij}=d_{ji}\)).

Note that we do not require distances to necessarily satisfy the triangle inequality \[\begin{equation} d_{ik} \leq d_{ij}+d_{jk}. \tag{6.2} \end{equation}\] A distance function which always satisfies the triangle inequality is called a metric distance or just a metric. A distance function which does not always satisfy the triangle inequality is called a non-metric distance.

There are many ways to measure distance. Two common choices of metric distances are

  • Euclidean distance \(||\mathbf x-\mathbf x'||_2\), also know as the ‘crow flies’ distance.
  • \(L_1\) distance \(||\mathbf x-\mathbf x'||_1\), sometimes called the Manhattan or taxicab metric.

But for the city data, we could also consider distance by road, i.e., we could measure the minimum distance by road between each pair of cities. This would satisfy the axioms for being a distance, but is not a metric distance

The choice of which distance is appropriate is problem specific (see section 6.2 for an example). The focus in this chapter is on Euclidean distance, as this leads to an explicit mathematical solution to the optimization problem (6.1). However, note that MDS techniques have also been developed for non-Euclidean distances.