Lecture 4b Similarity and Dissimilarity - Fudan...

2007-4-1 Data Mining:Tech. and Appl. 1

Lecture 4bSimilarity and Dissimilarity

Zhou Shuigeng

April 1, 2007


Measures of Similarity and Dissimilarity

Used by a number of data mining techniquesClustering

Anomaly detection

Nearest-neighbor classification


Measures of Similarity and Dissimilarity

SimilarityNumerical measure of how alike two data objects are.Higher when objects are more alike.Often falls in the range [0,1]

Dissimilarity (sometimes using Distance)Numerical measure of how different two data objects areLower when objects are more alikeMinimum dissimilarity is often 0Upper limit varies

Proximity refers to a similarity or dissimilarity


Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects


Similarity/Dissimilarity for Data Objects

Dissimilarity (mostly for continuous data)

Euclidean distance

Minkowski distance

Mahalanobis distance

Similarity

Binary data (SMC, Jaccard, cosine, Hamming)

Continuous data (Tanimoto, correlation)


Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and

qk are, respectively, the kth attributes (components) or data

objects p and q.

Standardization is necessary, if scales differ


Euclidean Distance


Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kthattributes (components) or data objects p and q


Minkowski Distance: Examplesr = 1: City block (Manhattan, taxicab, L1 norm) distance

A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2: Euclidean distancer →∞. “supremum” (Lmax norm, L ∞ norm) distance.

This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions


Minkowski Distance


Mahalanobis Distance

2


Mahalanobis Distance


Mahalanobis DistanceMahalanobis distance is a generalization of Eulidean distance. MD is degenerated to ED when the covariance matrix is an identity matrixMahalanobis distance is useful when the attributes

are correlated, have different ranges of values (different variance), and the distribution of the data is approximately Gaussian (normal)


Common Properties of A Distance

Distances, such as the Euclidean distance, have some well known properties

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q

A distance that satisfies these properties is a metric


Common Properties of A Similarity

Similarities, also have some well known properties

where s(p, q) is the similarity between points (data objects), p and q


Similarity between Binary VectorsCommon situation is that objects, p and q, have only binary attributesCompute similarities using the following quantities

Simple Matching and Jaccard Coefficients


SMC vs. Jaccard: Example


Cosine Similarity

If d1 and d2 are two document vectors, then

where · indicates vector dot product and || d || is the length of vector d

Example:


Extended Jaccard Coefficient(Tanimoto)

Variation of Jaccard for continuous or count attributes

Reduces to Jaccard for binary attributes


Correlation

Correlation measures the linear relationship between objectsTo compute correlation, we standardize data objects, p and q, and then take their dot product


Visually Evaluating Correlation


Drawback of Correlation

X = (-3, -2, -1, 0, 1, 2, 3)

Y = (9, 4, 1, 0, 1, 4, 9) Y = X2

Mean(X) = 0, Mean(Y) = 4

Correlation

= (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5)

= 0


Bregman Divergence

Bregman divergences are loss or distortion functionsGiven a strictly convex function φ

L(x): represents a plane that is tangent to function φ at y

φφ


Bregman Divergence

Example: squared Euclidean distance

φ(t) = t2 (convex function)

D(x,y) = x2 – y2 -2y(x-y) = (x-y) 2


Sometimes attributes are of many different types, but an overall similarity is needed

General Approach for Combining Similarities


Using Weights to Combine Similarities

May not want to treat all attributes the same.Use weights wk which are between 0 and 1 and sum to 1

Lecture 4b Similarity and Dissimilarity - Fudan...

Documents

Transcript of Lecture 4b Similarity and Dissimilarity - Fudan...