4.4 Similarity measures

So far we have presented classical MDS as starting with a distance (or dissimilarity) matrix \(\mathbf D=(d_{ij})_{i,j=1}^n\). In this setting, the larger \(d_{ij}\) is, the more distant, or dissimilar, object \(i\) is from object \(j\). We then convert \(\mathbf D\) to a centred inner product matrix \(\mathbf B\). We said that we can think of \(\mathbf B\) as being a similarity matrix.

We can also use a more general concept of similarity in MDS.

Definition 4.4 A similarity matrix is defined to be an \(n \times n\) matrix \(\mathbf=(f_{ij})_{i,j=1}^n\) with the following properties:

Symmetry, i.e. \(f_{ij} =f_{ji}\), \(i,j=1, \ldots , n\).
For all \(i,j=1, \ldots , n\), \(f_{ij} \leq f_{ii}\).

Note that when working with similarities \(f_{ij}\), the larger \(f_{ij}\) is, the more similar objects \(i\) and \(j\) are.

Condition 1. implies that object \(i\) is as similar to object \(j\) as object \(j\) is to object \(i\) (symmetry).
Condition 2. implies that an object is at least as similar to itself as it is to any other object.

In this section, we consider the analysis of measures of similarity as opposed to measures of dissimilarity. We begin by showing that we can convert a positive semi-definite similarity matrix \(\mathbf F\) into a distance matrix \(\mathbf D\) and then into a centred inner product matrix \(\mathbf B\), allowing us to use the classical MDS approach from the previous section.

Proposition 4.2 Suppose that \(\mathbf F\) is a positive semi-definite similarity matrix. Then the matrix \(\mathbf D\) with elements \[\begin{equation} d_{ij}=\left ( f_{ii}+f_{jj} -2f_{ij} \right )^{1/2}, \qquad i,j=1, \ldots , n. \tag{4.5} \end{equation}\] is a Euclidean distance matrix. Its centred inner product matrix, \(\mathbf B= -\frac{1}{2}\mathbf H(\mathbf D\odot\mathbf D)\mathbf H\), can be computed via \[\begin{equation} \mathbf B=\mathbf H\mathbf F\mathbf H. \tag{4.6} \end{equation}\]

Proof. Firstly, note that as \(\mathbf F\) is a similarity matrix, \(f_{ii}+f_{jj}-2f_{ij}\geq 0\) by condition 2., and so the \(d_{ij}\) are all well-defined (i.e. real, not imaginary).

We will now show that Equation (4.6) holds. Let \(\mathbf A= -\frac{1}{2}\mathbf D\odot \mathbf D\) as in Equation (4.3). Then \[ a_{ij}=-\frac{1}{2}d_{ij}^2 =f_{ij}-\frac{1}{2}(f_{ii}+f_{jj}). \]

Define \[ t=n^{-1}\sum_{i=1}^n f_{ii}. \] Then, summing over \(j=1, \ldots , n\) for fixed \(i\), \[ \bar{a}_{i+}=n^{-1}\sum_{j=1}^n a_{ij} = \bar{f}_{i+}-\frac{1}{2}(f_{ii}+t); \] similarly, \[ \bar{a}_{+j}=n^{-1}\sum_{i=1}^n a_{ij}=\bar{f}_{+j}-\frac{1}{2}(f_{jj}+t), \] and also \[ \bar{a}_{++}=n^{-2}\sum_{i,j=1}^n a_{ij}=\bar{f}_{++}-\frac{1}{2}(t+t). \] Recall property (vii) from Section 1.4: \[ b_{ij}=a_{ij}-\bar{a}_{i+}-\bar{a}_{+j}+\bar{a}_{++} \] noting that \(\mathbf A\) is symmetric. Thus \[\begin{align*} b_{ij}&=f_{ij}-\frac{1}{2}(f_{ii}+f_{jj})-\bar{f}_{i+}+\frac{1}{2}(f_{ii}+t)\\ & \qquad -\bar{f}_{+j}+\frac{1}{2}(f_{jj}+t) +\bar{f}_{++}-t\\ & =\qquad f_{ij}-\bar{f}_{i+}-\bar{f}_{+j}+\bar{f}_{++}. \end{align*}\] Consequently, \(\mathbf B=\mathbf H\mathbf F\mathbf H\), again using property (vii) from Section 1.4.

So we’ve shown that \(\mathbf B= \mathbf H\mathbf F\mathbf H\). It only remains to show \(\mathbf D\) is Euclidean. Since \(\mathbf F\) is positive semi-definite by assumption, and \(\mathbf H^\top =\mathbf H\), it follows that \(\mathbf B=\mathbf H\mathbf F\mathbf H\) must also be positive semi-definite. So by Theorem 4.1 \(\mathbf D\) is a Euclidean distance matrix.

4.4.1 Binary attributes

One important class of problems is when the similarity between any two objects is measured by the number of common attributes. The underlying data on each object is a binary vector of 0s and 1s indicating absence or presence of an attribute. These binary vectors are then converted to similarities by comparing which attributes two objects have in common.

We illustrate this through two examples.

Example 4.1 Suppose there are 4 attributes we wish to consider.

Attribute 1: Carnivore? If yes, put \(x_1=1\); if no, put \(x_1=0\).
Attribute 2: Mammal? If yes, put \(x_2=1\); if no, put \(x_2=0\).
Attribute 3: Natural habitat in Africa? If yes, put \(x_3=1\); if no, put \(x_3=0\).
Attribute 4: Can climb trees? If yes, put \(x_4=1\); if no, put \(x_4=0\).

Consider a lion. Each of the attributes is present so \(x_1=x_2=x_3=x_4=1\). Its attribute vector is \(\begin{pmatrix} 1&1&1&1\end{pmatrix}^\top\).

What about a tiger? In this case, 3 of the attributes are present (1, 2 and 4) but 3 is absent. So for a tiger, \(x_1=x_2=x_4=1\) and \(x_3=0\) or in vector form, its attributes are \(\begin{pmatrix} 1&1&0&1\end{pmatrix}^\top\).

How might we measure the similarity of lions and tigers based on the presence or absence of these four attributes?

First form a \(2 \times 2\) table as follows. \[ \begin{array}{cccc} &1 &0\\ 1& a & b\\ 0& c & d \end{array} \] Here \(a\) counts the number of attributes common to both lion and tiger; \(b\) counts the number of attributes the lion has but the tiger does not have; \(c\) counts the number of attributes the tigher has that the lion does not have; and \(d\) counts the number of attributes which neither the lion nor the tiger has. In the above, \(a=3\), \(b=1\) and \(c=d=0\).

How might we make use of the information in the \(2 \times 2\) table to construct a measure of similarity? There are two commonly used measures of similarity.

The **Jaccard similarity coefficient* is the proportion of the attributes which are shared (NOT QUITE): \[ \frac{a}{a+b+c} \] which gives \(0.75\) in this example.

The simple matching coefficient (SMC) is given by \[\begin{equation} \frac{a+d}{a+b+c+d}. \tag{4.7} \end{equation}\] This is also 0.75 in this example.

Although the Jaccard and SMC are the same in this case, this is not true in general. The SMC counts mutual absence (when an attribute is present for both objects) and compares it to the total numnber of attributes (\(a+b+c+d\)). The Jaccard similarity only counts mutual presence and compares this to the number of attributes that are present in at least one of the two objects.

This difference between the two similarities can matter. For example, consider a market basket analysis where we compare shoppers. If a shop sells \(n\) different products, we might record the products purchased by each shopper (their ‘basket’) as a vector of length \(n\), with a \(1\) in position \(i\) if the shopper purchased object \(i\), and \(0\) otherewise. In this case, the observation for each shopper will be a long vector (depending on how many different products are available), consisting mostly of zeros.

There are many other possible ways of measuring similarity. For example, we could consider weighted versions of the above if we wish to weight different attributes differently.

In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.

In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.

Example 4.2 Let us now consider a similar but more complex example with 6 unspecified attributes (not the same attributes as in Example 1) and 5 types of living creature, with the following data matrix, consisting of zeros and ones. \[ \begin{array}{lcccccc} &1&2&3&4&5&6\\ Lion&1&1&0&0&1&1\\ Giraffe&1&1&1&0&0&1\\ Cow&1&0&0&1&0&1\\ Sheep&1&0&0&1&0&1\\ Human&0&0&0&0&1&0 \end{array} \] Suppose we decide to use the simple matching coefficient (4.7) to measure similarity. Then the following similarity matrix is obtained. \[ \mathbf F=\begin{array}{lccccc} &\text{Lion}&\text{Giraffe}&\text{Cow}&\text{Sheep}&\text{Human}\\ Lion&1&2/3&1/2&1/2&1/2\\ Giraffe&2/3&1&1/2&1/2&1/6\\ Cow&1/2&1/2&1&1&1/3\\ Sheep&1/2&1/2&1&1&1/3\\ Human&1/2&1/6&1/3&1/3&1 \end{array} \] It is easily checked from the definition that \(\mathbf=(f_{ij})_{i,j=1}^5\) is a similarity matrix.

animal <- read.table('data/animal_similarity', sep='&', header=TRUE, row.names=1)

SMC <- function(x,y){
  sum(x==y)/length(x)
}
SMC(animal[1,], animal[2,])

## [1] 0.6666667

n=dim(animal)[1]
F_SMC = outer(1:n,1:n, Vectorize(function(i,j){SMC(animal[i,], animal[j,])}))

F_SMC

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.6666667 0.5000000 0.5000000 0.5000000
## [2,] 0.6666667 1.0000000 0.5000000 0.5000000 0.1666667
## [3,] 0.5000000 0.5000000 1.0000000 1.0000000 0.3333333
## [4,] 0.5000000 0.5000000 1.0000000 1.0000000 0.3333333
## [5,] 0.5000000 0.1666667 0.3333333 0.3333333 1.0000000

# same as
(6-dist(animal, method="manhattan"))/6

##              Lion   Giraffe       Cow     Sheep
## Giraffe 0.6666667                              
## Cow     0.5000000 0.5000000                    
## Sheep   0.5000000 0.5000000 1.0000000          
## Human   0.5000000 0.1666667 0.3333333 0.3333333

jaccard = function(x, y) {
  bt = table(y, x)
  return((bt[2, 2])/(bt[1, 2] + bt[2, 1] + bt[2, 2]))
}

jaccard(animal[1,], animal[2,])

## [1] 0.6

F_jaccard = outer(1:n,1:n, Vectorize(function(i,j){jaccard(animal[i,], animal[j,])}))

F_jaccard

##      [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00  0.6  0.4  0.4 0.25
## [2,] 0.60  1.0  0.4  0.4 0.00
## [3,] 0.40  0.4  1.0  1.0 0.00
## [4,] 0.40  0.4  1.0  1.0 0.00
## [5,] 0.25  0.0  0.0  0.0 1.00

1-dist(animal, method="binary")

##         Lion Giraffe  Cow Sheep
## Giraffe 0.60                   
## Cow     0.40    0.40           
## Sheep   0.40    0.40 1.00      
## Human   0.25    0.00 0.00  0.00

D <- matrix(nr=5,nc=5)
for(ii in 1:5){
  for(jj in 1:5){
    D[ii,jj] <- sqrt(F_SMC[ii,ii]+F_SMC[jj,jj]-2*F_SMC[ii,jj])
  }
}

#install.packages("RFLPtools", repos="http://R-Forge.R-project.org")
library(RFLPtools)

## Loading required package: RColorBrewer

sim2dist

## function (x, maxSim = 1) 
## {
##     stopifnot(is.matrix(x))
##     stopifnot(isSymmetric(x))
##     stopifnot(maxSim >= max(x))
##     d <- maxSim - x
##     return(as.dist(d))
## }
## <bytecode: 0x7fadb109b0c8>
## <environment: namespace:RFLPtools>

4.4.2 Example

Do football data. Use points totals as similarity?????????