AMEETA AGRAWAL. Outline Parser Evaluation Text Clustering Common N-Grams classification method...
-
Upload
patience-carter -
Category
Documents
-
view
229 -
download
7
Transcript of AMEETA AGRAWAL. Outline Parser Evaluation Text Clustering Common N-Grams classification method...
AMEETA AGRAWALAMEETA AGRAWAL
OutlineParser EvaluationText ClusteringCommon N-Grams classification method
(CNG)
2
Parser EvaluationPARSEVAL measurePrecision, recall, F-measure and cross-brackets
http://www.haskell.org/communities/11-2009/html/Parsing-ParseModule.jpg
3
Why automate evaluationManual inspection is slow and error-prone.Possibility of introducing bias by human
evaluators.
4
Parser output vs. Gold standardFormal definitionsLet GS=(S,A) be a gold standard. Let P be a parser.Parser output O(P,GS)=(P(s1), P(s2),… P(s|S|)) is a
sequence of analyses such that P(si) for each i, 1 < i < n is the analysis assigned by parser P for sentence si € S.
Gold standard is a 2-tuple GS=(S,A) whereS=(s1,s2,…,sn) is a finite sequence of grammatical
structures, i.e. constituents, dependency links or sentences.
A=(a1,a2,…,an) is a finite sequence of analyses. For each i, 1 < i < n , ai € A, is the analysis of si € S.
5
Parser evaluationCompare each element in O(P, GS) to each
element in A.
The sets of analyses in parser evaluation.6
PARSEVALParsers are usually evaluated using the
PARSEVAL measures (Black et al. 1991).To compute the PARSEVAL measures:
the parse trees are decomposed into labelled constituents (LC) LC = triples consisting of the starting and ending
point of a constituent’s span in a sentence, and the constituent’s label.
for each sentence, the sets of LC got from a parser (PT), and gold standard parse tree (GT) are compared.
7
Labelled vs. UnlabelledLabelled Parseval:Two analyses match if and only if both the
brackets and the labels (POS and syntactic tags) match.
Unlabelled Parseval:Compares only the brackets
8
PARSEVAL measuresPrecisionRecallF-scoreCross-brackets
9
Precision, Recall, F-scorePrecision:
# of correct constituents in parser output
total # of constituents in parser output
Recall: # of correct constituents in parser outputtotal # of constituents in gold standard
The F-score: harmonic mean of precision and recall2 · (labelled precision) · (labelled recall)(labelled precision) + (labelled recall)
10
Crossing bracketsThe mean number of bracketed sequences in
which the parser output overlaps with the gold standard structure.
Non-crossing and crossing brackets. The phrase boundaries [i, j] and [i’, j’] are boundaries in the gold standard and the parser output respectively. Pair [i, j] [i’, j’] is defined as a pair of crossing brackets if they overlap, that is, if i < i’ < j < j’.
11
Labelled PARSEVAL exampleConsider the following two sentences:
Time flies like an arrow.He ate the cake with a spoon.
Ambiguous sentences for a parser...
12
Gold standard parse tree(S (NP (NN time) (NN flies))
(VP (VB like)
(NP (DT an) (NN arrow))))
(S (NP (PRP he))
(VP (VBD ate) (NP (DT the)
(NN cake))
(PP (IN with)
(NP (DT a) (NN spoon)))))
13
Parser output parse tree(S (NP (NN time))
(VP (VB flies)
(PP (IN like)
(NP (DT an) (NN arrow)))))
(S (NP (PRP he))
(VP (VBD ate)
(NP (DT the) (NN cake)
(PP (IN with)
(NP (DT a) (NN spoon)))))) 14
Labelled edges of parse trees - 1
15
Labelled edges of parse trees - 2
16
ResultPrecision = 73.9 % (17/23)Recall = 77.2 % (17/22)F-score = 75.5 %
17
Unlabelled PARSEVAL exampleA) [[He [hit [the post]]] [while [[the all-star
goalkeeper] [was [out [of [the goal]]]]]]]
B) [He [[hit [the post]] [while [[the [[all-star] goalkeeper]] [was [out of [the goal]]]]]]]
A) is the gold standard structure and B) the parser output adapted from Lin 1998.
Precision = 75.0% (9/12)Recall = 81.8% (9/11)F-score = 78.3%Crossing brackets = 1 pair
18
Strengths & Weakness: PARSEVALStrength:The state-of-the-art parsers obtain up to 90% precision and recall
on the Penn Treebank data (Bod, 2003; Charniak and Johnson, 2005)
Weaknesses:Evaluation based on phrase-structure constituents abstracts
away from basic predicate-argument relationships which are important for correctly capturing the semantics of the sentence (Lin, 1998; Carroll et al., 2002).
Also, using the same resource for training and testing may result in the parser learning systematic errors which are present in both the training and testing material (Rehbein and van Genabith, 2007).
Other metrics: the Leaf-Ancestor metric (G. Sampson, 1980s)19
Text ClusteringTask definitionPartitional clustering
Simple K-meansHierarchical clustering
Divisive & agglomerativeEvaluation of clustering Inter-cluster similarityCluster purityEntropy or information gain
http://www.miner3d.com/images/kmeans_medium.jpg
20
ClusteringPartition unlabeled examples into disjoint subsets of
clusters, so that: examples within a cluster are very similar examples in different clusters are very different
Discover new categories in an unsupervised mannerInter-cluster distances are maximized
Intra-cluster distances are
minimized
21
Notion of a cluster can be ambiguous
How many clusters?
Four Clusters Two Clusters
Six Clusters
Data Mining, Cluster Analysis: Basic Concepts and Algorithms, by Tan, Steinbach, Kumar
22
Ambiguous web queriesWeb queries are often truly ambiguous:
jaguarNLPparis hilton
Seems like word sense ambiguation should helpDifferent senses of jaguar: animal, car, OS X…
In practice WSD doesn’t help for web queriesDisambiguation is either impossible (“jaguar”) or
trivial (“jaguar car”)“Cluster” the results into useful groupings
23
Clusty: the clustering search engine
24
Types of clusteringPartitional clustering
A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
E.g. K-means
Hierarchical clusteringA set of nested clusters organized as a hierarchical tree.e.g. Agglomerative and Divisive
Density-based clusteringArbitrary-shaped clusters; a cluster is regarded as a
region in which the density of data objects exceeds a threshold.
e.g. DBSCAN and OPTICS
25
Partitional clustering
Original Points A Partitional Clustering
26
Hierarchical clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
Traditional Dendrogram
27
Other types of clusteringHard vs. soft
Hard: each document is a member of exactly one clusterSoft: a document has fractional membership in several
clustersExclusive vs. non-exclusive
In non-exclusive clustering, points may belong to multiple clusters
Can represent multiple classes or ‘border’ pointsFuzzy vs. non-fuzzy
In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
Weights must sum to 1Probabilistic clustering has similar characteristics
Partial vs. completeIn some cases, we only want to cluster some of the data
Heterogeneous vs. homogeneousCluster of widely different sizes, shapes, and densities
28
K-means clusteringDocuments are represented as length-
normalized vectors in a real-valued space.use normalized, TF/IDF-weighted vectors
Initial centroids are often chosen randomly.clusters produced vary from one run to
another.The centroid is, typically, the mean of the
points in the cluster.‘Closeness’ is measured by Euclidean
distance, cosine similarity, correlation, etc.
29
K-means algorithm
30
Stopping criteriaA fixed number of iterations has been
completed. Assignment of documents to clusters does not
change between iterations.Centroids do not change between iterations.When distance between the centroid and data
points falls below a threshold.
31
Two different K-means clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
32
Choosing initial centroids - 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 6
33
Choosing initial centroids - 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
34
Choosing initial centroids - 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 5
35
Choosing initial centroids - 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
36
Strengths & weaknesses: K-meansStrength: Relatively efficient: complexity is O( n * K * I )
n = number of points, K = number of clusters, I = number of iterations, normally, k, i << n.
Weakness: Sensitive to the initial clusters Need to specify k, the number of clusters, in advance Very sensitive to noise and outliers May have a problem when clusters have different
sizes Not suitable to discover clusters with non-convex
shapes Often terminates at a local optimum.
37
Hierarchical clusteringTwo main types of hierarchical clustering
Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point
(or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time
38
Agglomerative clustering Bottom-up
Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
39
Agglomerative exampleStart with clusters of individualpoints and a proximity matrixp1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
40
Agglomerative exampleAfter some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
41
Agglomerative exampleWe want to merge the two closest clusters (C2 and C5)and update the proximity matrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p1242
The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative example
43
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group average Distance between centroids Other methods driven by an objective
function Ward’s Method uses squared error
Proximity Matrix
44
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group average Distance between centroids Other methods driven by an
objective function Ward’s Method uses squared error 45
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group average Distance between centroids Other methods driven by an
objective function Ward’s Method uses squared error 46
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group average Distance between centroids Other methods driven by an
objective function Ward’s Method uses squared error 47
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group average Distance between centroids Other methods driven by an
objective function Ward’s Method uses squared error
48
Inter-cluster similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group average Distance between centroids Other methods driven by an
objective function Ward’s Method uses squared error 49
Hierarchical clustering comparison
Group Average
Ward’s Method
1
2
3
4
5
61
2
5
3
4
MIN MAX
1
2
3
4
5
61
2
5
34
1
2
3
4
5
61
2 5
3
41
2
3
4
5
6
12
3
4
5
http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4
50
Divisive clusteringTop-down.Split a cluster iteratively.Start with all objects in one cluster and
subdivide them into smaller pieces.Less popular than agglomerative.
51
Hierarchical: strengths & weaknessesStrengths:Do not have to assume any particular number of
clustersAny desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper levelThey may correspond to meaningful taxonomies
E.g. animal kingdom in biological sciencesWeaknesses:When clusters are merged/split, the decision is
permanentErroneous decisions are impossible to correct later
Do not scale wellSpace complexity O(n2), n = total # of pointsTime complexity is O(n3) in many cases;
n steps and at each step the proximity matrix of size n2 must be updated and searched
52
Evaluating clusteringInternal criterion
high intra-cluster similarity and low inter-cluster similarity
External criteriacompare against gold standard produced by
humans Purity Normalized mutual information Rand index penalizes F measure
53
PurityEach cluster is assigned to the class which is
most frequent in the clusterThen the accuracy of this assignment is
measured by counting the number of correctly assigned documents and dividing by N, where N = data points
54
Purity
Purity as an external evaluation criterion for cluster quality. x, 5(cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3). Purity is (5 + 4 + 3) / 17 ≈ 0.71.0 < Purity < 1
55
Pitfall of PurityPurity is 1 if each document gets its own
cluster. Thus, we cannot use purity to trade off the
quality of the clustering against the number of clusters.
Solution: Normalized Mutual Information
56
Normalized Mutual Information
57
I is mutual information, H is entropy, NMI is mutual information divided by normalized entropy
where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document beingin cluster ωk, class cj, and in the intersection of ωk and cj, respectively.
Mutual informationMI measures the amount of information by which
our knowledge about the classes increases when we are told what the clusters are.
Minimum MI if the clustering is random. Maximum MI if K = N one-document clusters.So MI has the same problem as purity -- it does
not penalize large cardinalities and fewer clusters are better.
The normalization by the denominator [H(W)+H(C)]/2 fixes this problem since entropy tends to increase with the number of clusters.
0 < NMI < 1
58
CNG - Common N-Gram analysisDefinitionExampleSimilarity measure
http://afflatus.ucd.ie/attachment/2009_6/tn_1246380626644.jpg
http://home.arcor.de/David-Peters/n-Grams.png 59
N-gramsAn n-gram model is a type of probabilistic
model for predicting the next item in such a sequenceItems can be phonemes, syllables, letters,
words or base pairsInvolves splitting sentence into chunks of
consecutive items of length n
60
N-grams example“I don’t know what to say”1-gram (unigram): I, don’t, know, what, to, say2-gram (bigram): I don’t, don’t know, know what, what to, to say3-gram (trigram): I don’t know, don’t know what, know what to, etc.…n-gram
“TEXT”unigram : {T, E, X, T}bigram : { _T, TE, EX, XT, T_}trigram : {_TE, TEX, EXT, XT_, T__}...n-gram
61
Why do we want to predict items?Author attribution
Plagiarism detection
Malicious code detection
Genre classification
Sentiment classification
Spam identification
Language and encoding identification
Spelling correction
62
Common N-grams methodCompare the semantic of two texts or audio or
video data files.Build a byte-level n-gram author profile of an
author's writing.The profile is a small set of L pairs {(x1, f1), (x2, f2),
...,(xL, fL)} of frequent n-grams and their normalized frequencies, generated from training data.
Two important operations:choose the optimal set of n-grams for a profilecalculate the similarity between two profiles
63
Common N-grams methodDoes not use any language-dependent
information (no information about space character, new line character, uppercase, lowercase).
The approach does not depend on a specific language.does not require segmentation for languages
such as Chinese or Thai. There is no text preprocessing.
so we avoid the necessity for use of taggers, parsers, feature selection.
64
How do n-grams work?Marley was dead: to begin with. There
is no doubt whatever about that. …
n = 3
Mararlrleleyey_y_w_wa
was
_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007 _an 0.007 nd_ 0.007 ed_ 0.006
sort by frequency
L=5
(from Christmas Carol by Charles Dickens)
…Detection of New Malicious Code Using N-grams Signatures© 2004 T. Abou-Assaleh, N. Cercone, V. Keselj, & R. Sweidan65
Comparing profiles
_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007
Dickens: Christmas Carol _th 0.016 the 0.014 he_ 0.012 and 0.007 nd_ 0.007
Dickens: A Tale of Two Cities
_th 0.017 ___ 0.017 the 0.014 he_ 0.014 ing 0.007
Carroll: Alice’s adventures in wonderland
?
?
66
Similarity measureIn order to “normalize" the differences between two profiles, we divide them by the average frequency for a given n-gram (f1(s) + f2(s))/2 . E.g. the difference of 0.1 for an n-gram with frequencies 0.9 and 0.8 in two profiles will be less weighted than the same difference for an n-gram with frequencies 0.2 and 0.1.
weight
2
profile 21
21
2
profile 21
21
)()(
))()((2
2)()()()(
ss sfsf
sfsfsfsfsfsf
s is any n-gram from one of the two profiles, and f1(n) and f2(n) are n-gram frequencies in two profiles.
67
Profile dissimilarity algorithm
Returns a positive number, which is a measure of dissimilarity.
For identical texts, the dissimilarity is 0.
68
Text classification using CNGGiven a test document, a test profile is
produced.The distances between the test profile and
the author profiles are calculated.The test document is classified using k-
nearest neighbours method with k = 1, the test document is attributed to the author
whose profile is closest to the test profile.
69
Strengths & Weaknesses: CNG methodStrengths:
Easy to computeEasy to test
Weaknesses:Computational resources for trainingImbalanced datasetsAutomatic selection of N and L
70
As an aside: Ordering doesn’t matterAoccdrnig to rscheearch at an Elingsh
uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.
“Humans are interesting” – Ryuk
71
ReferencesA procedure for quantitatively comparing the syntactic
coverage of English grammars, E. Black et al., 1991Framework and resources for natural language Parser
evaluation, Tuomo Kakkonen, 2007Solving the heterogeneity problem in e-government using
n-grams, Cornoiu SorinaOn automatic plagiarism detection based on n-grams
comparison, Alberto Barron-Cedeno and Paolo RossoN-gram-based author profiles for authorship attribution,
V. Keseljy, N. Cercone et al., 2003CNG method with weighted voting, V. Keseljy, N. CerconeN-gram-based detection of new malicious code, T. Abou-
Assaleh, N. Cercone et al., 2004Book: Introduction to Data Mining, Tan, Steinbach, Kumar
72
Thank you! Questions?
73
Evaluating K-means clustersMost common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
K
i Cxi
i
xmdistSSE1
2 ),(
74
Measures of cluster validity Numerical measures:
External Index: Used to measure the extent to which cluster labels match externally supplied class labels.
Entropy
Internal Index: Used to measure the goodness of a clustering structure without respect to external information.
Sum of Squared Error (SSE)
Relative Index: Used to compare two different clustering or clusters.
Often an external or internal index is used for this function, e.g., SSE or entropy
75
External Measures of Cluster Validity: Entropy and Purity
76
Cluster validityFor supervised classification we have a variety of
measures to evaluate how good our model isAccuracy, precision, recall
For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?To avoid finding patterns in noiseTo compare clustering algorithmsTo compare two sets of clustersTo compare two clusters
77
Clusters found in random data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
yComplete Link
78
K-means clustering Partitional clustering approach Each cluster is associated with a
centroid (center point) Each point is assigned to the cluster with
the closest centroid Number of clusters, K, must be specified
79
Text clusteringText clustering is quite different…
Feature representations of text will typically have a large number of dimensions (103 - 106)
Euclidean distance isn’t necessarily the best distance metric for feature representations
Typically use normalized, TF/IDF-weighted vectors and cosine similarity.
Optimize computations for sparse vectors.Applications:
During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall.
Clustering of results of retrieval to present more organized results to the user (e.g. Clusty, Northernlight folders).
Automated production of hierarchical taxonomies of documents for browsing purposes (e.g. Yahoo).
80
Cluster similarity: MIN (Single Link)Based on the two most similar (closest)
points in the different clustersDetermined by one pair of points, i.e., by one
link in the proximity graph
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
81
Strengths: MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
82
Limitation: MIN
Original Points Two Clusters
• Sensitive to noise and outliers83
Cluster similarity: MAX (Complete Linkage)Similarity of two clusters is based on the two
least similar (most distant) points in the different clustersDetermined by all pairs of points in the two
clustersI1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
84
Strength: MAX
Original Points Two Clusters
• Less susceptible to noise and outliers85
Limitation: MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters 86
Cluster similarity: Group averageProximity of two clusters is the average of pairwise
proximity between points in the two clusters.
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00
1 2 3 4 5 87
Strength & Limitation: Group average Compromise between Single and
Complete Link Strength:
Less susceptible to noise and outliers Limitation:
Biased towards globular clusters
88
Cluster similarity: Ward’s methodSimilarity of two clusters is based on the
increase in squared error when two clusters are mergedSimilar to group average if distance between
points is distance squaredLess susceptible to noise and outliersBiased towards globular clustersCan be used to initialize K-means
89