Lecture 4b Similarity and Dissimilarity - Fudan...
Transcript of Lecture 4b Similarity and Dissimilarity - Fudan...
2007-4-1 Data Mining:Tech. and Appl. 1
Lecture 4bSimilarity and Dissimilarity
Zhou Shuigeng
April 1, 2007
2007-4-1 Data Mining:Tech. and Appl. 2
Measures of Similarity and Dissimilarity
Used by a number of data mining techniquesClustering
Anomaly detection
Nearest-neighbor classification
2007-4-1 Data Mining:Tech. and Appl. 3
Measures of Similarity and Dissimilarity
SimilarityNumerical measure of how alike two data objects are.Higher when objects are more alike.Often falls in the range [0,1]
Dissimilarity (sometimes using Distance)Numerical measure of how different two data objects areLower when objects are more alikeMinimum dissimilarity is often 0Upper limit varies
Proximity refers to a similarity or dissimilarity
2007-4-1 Data Mining:Tech. and Appl. 4
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects
2007-4-1 Data Mining:Tech. and Appl. 5
Similarity/Dissimilarity for Data Objects
Dissimilarity (mostly for continuous data)
Euclidean distance
Minkowski distance
Mahalanobis distance
Similarity
Binary data (SMC, Jaccard, cosine, Hamming)
Continuous data (Tanimoto, correlation)
2007-4-1 Data Mining:Tech. and Appl. 6
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and
qk are, respectively, the kth attributes (components) or data
objects p and q.
Standardization is necessary, if scales differ
2007-4-1 Data Mining:Tech. and Appl. 7
Euclidean Distance
2007-4-1 Data Mining:Tech. and Appl. 8
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kthattributes (components) or data objects p and q
2007-4-1 Data Mining:Tech. and Appl. 9
Minkowski Distance: Examplesr = 1: City block (Manhattan, taxicab, L1 norm) distance
A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
r = 2: Euclidean distancer →∞. “supremum” (Lmax norm, L ∞ norm) distance.
This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions
2007-4-1 Data Mining:Tech. and Appl. 10
Minkowski Distance
2007-4-1 Data Mining:Tech. and Appl. 11
Mahalanobis Distance
2
2007-4-1 Data Mining:Tech. and Appl. 12
Mahalanobis Distance
2007-4-1 Data Mining:Tech. and Appl. 13
Mahalanobis DistanceMahalanobis distance is a generalization of Eulidean distance. MD is degenerated to ED when the covariance matrix is an identity matrixMahalanobis distance is useful when the attributes
are correlated, have different ranges of values (different variance), and the distribution of the data is approximately Gaussian (normal)
2007-4-1 Data Mining:Tech. and Appl. 14
Common Properties of A Distance
Distances, such as the Euclidean distance, have some well known properties
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q
A distance that satisfies these properties is a metric
2007-4-1 Data Mining:Tech. and Appl. 15
Common Properties of A Similarity
Similarities, also have some well known properties
where s(p, q) is the similarity between points (data objects), p and q
2007-4-1 Data Mining:Tech. and Appl. 16
Similarity between Binary VectorsCommon situation is that objects, p and q, have only binary attributesCompute similarities using the following quantities
Simple Matching and Jaccard Coefficients
2007-4-1 Data Mining:Tech. and Appl. 17
SMC vs. Jaccard: Example
2007-4-1 Data Mining:Tech. and Appl. 18
Cosine Similarity
If d1 and d2 are two document vectors, then
where · indicates vector dot product and || d || is the length of vector d
Example:
2007-4-1 Data Mining:Tech. and Appl. 19
Extended Jaccard Coefficient(Tanimoto)
Variation of Jaccard for continuous or count attributes
Reduces to Jaccard for binary attributes
2007-4-1 Data Mining:Tech. and Appl. 20
Correlation
Correlation measures the linear relationship between objectsTo compute correlation, we standardize data objects, p and q, and then take their dot product
2007-4-1 Data Mining:Tech. and Appl. 21
Visually Evaluating Correlation
2007-4-1 Data Mining:Tech. and Appl. 22
Drawback of Correlation
X = (-3, -2, -1, 0, 1, 2, 3)
Y = (9, 4, 1, 0, 1, 4, 9) Y = X2
Mean(X) = 0, Mean(Y) = 4
Correlation
= (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5)
= 0
2007-4-1 Data Mining:Tech. and Appl. 23
Bregman Divergence
Bregman divergences are loss or distortion functionsGiven a strictly convex function φ
L(x): represents a plane that is tangent to function φ at y
φφ
2007-4-1 Data Mining:Tech. and Appl. 24
Bregman Divergence
Example: squared Euclidean distance
φ(t) = t2 (convex function)
D(x,y) = x2 – y2 -2y(x-y) = (x-y) 2
2007-4-1 Data Mining:Tech. and Appl. 25
Sometimes attributes are of many different types, but an overall similarity is needed
General Approach for Combining Similarities
2007-4-1 Data Mining:Tech. and Appl. 26
Using Weights to Combine Similarities
May not want to treat all attributes the same.Use weights wk which are between 0 and 1 and sum to 1