cs707_030112
-
Upload
cecsdistancelab -
Category
Documents
-
view
214 -
download
0
Transcript of cs707_030112
-
8/2/2019 cs707_030112
1/53
1
Agglomerative (bottom-up):Precondition: Start with each document as a
separate cluster.Postcondition: Eventually all documents belong to
the same cluster.
Divisive (top-down):Precondition: Start with all documents belonging
to the same cluster.
Postcondition: Eventually each document forms acluster of its own.
Does not require the number of clusters kinadvance.
Needs a termination/readout condition
Hierarchical Clustering algorithms
Prasad L17HierCluster
-
8/2/2019 cs707_030112
2/53
2
Hierarchical Agglomerative Clustering (HAC)
Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single clustercicj
Prasad L17HierCluster
-
8/2/2019 cs707_030112
3/53
3
Dendrogram: Document Example
As clusters agglomerate, docs likely to fallinto a hierarchy oftopics or concepts.
d1
d2
d3
d4
d5
d1,d2 d4,d5 d3
d3,d4,d5
Prasad L17HierCluster
-
8/2/2019 cs707_030112
4/53
4
Key notion: cluster representative
We want a notion of a representative point in acluster, to represent the location of each cluster
Representative should be some sort oftypicalor central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the average of all docs in the cluster
Centroid or center of gravity Measure intercluster distances by distances of centroids.
Prasad L17HierCluster
-
8/2/2019 cs707_030112
5/53
5
Example: n=6, k=3, closest pair of centroids
d1 d2
d3
d4
d5
d6
Centroid after first step.
Centroid after
second step.
Prasad
-
8/2/2019 cs707_030112
6/53
6
Outliers in centroid computation
Can ignore outliers when computingcentroid.
What is an outlier?Lots of statistical definitions, e.g. momentof point to centroid > M some clustermoment.
Centroid
Outlier
Say 10.
Prasad L17HierCluster
-
8/2/2019 cs707_030112
7/53
7
Closest pairof clusters
Many variants to defining closest pair of clusters Single-link
Similarity of the mostcosine-similar (single-link) Complete-link Similarity of the furthest points, the leastcosine-
similar
Centroid Clusters whose centroids (centers of gravity) are the
most cosine-similar Average-link
Average cosine between pairs of elements
-
8/2/2019 cs707_030112
8/53
8
Single Link Agglomerative Clustering
Use maximum similarity of pairs:
Can result in straggly (long and thin)clusters due to chaining effect.
After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:
),(max),( , yxsimccsim ji cycxji =
)),(),,(max()),(( kjkikji ccsimccsimcccsim =
Prasad L17HierCluster
-
8/2/2019 cs707_030112
9/53
9
Single Link Example
Prasad L17HierCluster
-
8/2/2019 cs707_030112
10/53
10
Complete Link Agglomerative
Clustering Use minimum similarity of pairs:
Makes tighter, spherical clusters that aretypically preferable.
After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:
),(min),(,
yxsimccsimji cycx
ji
=
)),(),,(min()),(( kjkikji ccsimccsimcccsim =
Ci Cj Ck
Prasad
-
8/2/2019 cs707_030112
11/53
11
Complete Link Example
Prasad L17HierCluster
-
8/2/2019 cs707_030112
12/53
12
Computational Complexity
In the first iteration, all HAC methods need tocompute similarity of all pairs ofn individualinstances which is O(n2).
In each of the subsequent n-2 mergingiterations, compute the distance between themost recently created cluster and all otherexisting clusters.
In order to maintain an overall O(n2)performance, computing similarity to eachcluster must be done in constant time.Often O(n3) if done naively or O(n2log n) if done
more cleverly
Prasad L17HierCluster
-
8/2/2019 cs707_030112
13/53
13
Group Average Agglomerative
Clustering Use average similarity across all pairs within the merged
cluster to measure the similarity of two clusters.
Compromise between single and complete link. Two options:
Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters
No clear difference in efficacy
=
)( :)(),()1(
1
),(ji jiccx xyccyjiji
ji yxsimccccccsim
Prasad L17HierCluster
-
8/2/2019 cs707_030112
14/53
14
Computing Group Average
SimilarityAssume cosine similarity and normalized
vectors with unit length.
Always maintain sum of vectors in eachcluster.
Compute similarity of clusters in constanttime:
=
jcx
jxcs
)(
)1||||)(|||(|
|)||(|))()(())()((),(
++
+++
=
jiji
jijiji
jicccc
cccscscscsccsim
Prasad L17HierCluster
-
8/2/2019 cs707_030112
15/53
15
Medoid as Cluster Representative The centroid does not have to be a
document.
Medoid: A cluster representative that isone of the documents (E.g., the documentclosest to the centroid)
One reason this is useful Consider the representative of a large cluster (>1K
docs)
The centroidof this cluster will be a densevectorPrasad L17HierCluster
-
8/2/2019 cs707_030112
16/53
16
Efficiency: Using
approximations
In standard algorithm, must find closestpair of centroids at each step
Approximation: instead, find nearly closestpairuse some data structure that makes this
approximation easier to maintain
simplistic example: maintain closest pairbased on distances in projection on a random
line
Random line
Prasad
-
8/2/2019 cs707_030112
17/53
17
Major issue - labeling
After clustering algorithm finds clusters -how can they be useful to the end user?
Need pithy label for each clusterIn search results, say Animal orCar inthejaguarexample.
In topic trees (Yahoo), need navigational cues.Often done by hand, a posteriori.
Prasad L17HierCluster
-
8/2/2019 cs707_030112
18/53
18
How to Label Clusters
Show titles of typical documentsTitles are easy to scanAuthors create them for quick scanning!But you can only show a few titles which
may not fully represent cluster
Show words/phrases prominent incluster
More likely to fully represent clusterUse distinguishing words/phrases
Differential labelingPrasad L17HierCluster
-
8/2/2019 cs707_030112
19/53
19
Labeling
Common heuristics - list 5-10 mostfrequent terms in the centroid vector.
Drop stop-words; stem.
Differential labeling by frequent termsWithin a collection Computers, clusters all
have the word computeras frequent term.
Discriminant analysis of centroids. Perhaps better: distinctive noun phrasePrasad L17HierCluster
-
8/2/2019 cs707_030112
20/53
20
What is a Good Clustering?
Internal criterion: A good clustering willproduce high quality clusters in which:
the intra-class (that is, intra-cluster) similarityis high
the inter-class similarity is low
The measured quality of a clusteringdepends on both the documentrepresentation and the similarity measure
usedPrasad L17HierCluster
-
8/2/2019 cs707_030112
21/53
21
External criteria for clustering quality
Quality measured by its ability to discoversome or all of the hidden patterns or latent
classes in gold standard data
Assesses a clustering with respect to groundtruth requires labeled data
Assume documents with Cgold standardclasses, while our clustering algorithmsproduce Kclusters, 1, 2,, K with ni
members.
Prasad L17HierCluster
-
8/2/2019 cs707_030112
22/53
22
External Evaluation of Cluster Quality
Simple measure: purity, the ratio betweenthe dominant class in the clusteri and
the size of clusteri
Others are entropy of classes in clusters(or mutual information between classes
and clusters)
Cjnn
Purity ijji
i = )(max1
)(
Prasad L17HierCluster
-
8/2/2019 cs707_030112
23/53
23
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5
Purity example
-
8/2/2019 cs707_030112
24/53
Rand Index
Number of
points
Same Cluster
in clustering
Different
Clusters in
clustering
Same class in
ground truth A C
Different
classes in
ground truthB D
-
8/2/2019 cs707_030112
25/53
25
Rand index: symmetric version
BA
AP
+
=
DCBA
DARI
+++
+
=
CA
AR
+
=
Compare with standard Precision and Recall.
Prasad L17HierCluster
-
8/2/2019 cs707_030112
26/53
Rand Index example: 0.68
Number of
points
Same Cluster
in clustering
Different
Clusters in
clustering
Same class in
ground truth 20 24
Different
classes in
ground truth20 72
-
8/2/2019 cs707_030112
27/53
Final word and resources
In clustering, clusters are inferred from thedata without human input (unsupervised
learning)
However, in practice, its a bit less clear:there are many ways of influencing the
outcome of clustering: number of clusters,
similarity measure, representation ofdocuments, . . .
-
8/2/2019 cs707_030112
28/53
28
SKIP WHAT FOLLOWS
Prasad L17HierCluster
-
8/2/2019 cs707_030112
29/53
29
Term vs. document space
So far, we clustered docs based on theirsimilarities in term space
For some applications, e.g., topicanalysis for inducing navigation
structures, can dualize
use docs as axesrepresent (some) terms as vectorsproximity based on co-occurrence of terms
in docs
now clustering terms, notdocsPrasad L17HierCluster
-
8/2/2019 cs707_030112
30/53
30
Term vs. document space
Cosine computationConstant for docs in term spaceGrows linearly with corpus size for terms in
doc space
Cluster labelingClusters have clean descriptions in terms of
noun phrase co-occurrence
Application of term clustersPrasad L17HierCluster
-
8/2/2019 cs707_030112
31/53
31
Multi-lingual docs
E.g., Canadian government docs. Every doc in English and equivalent
French.Must cluster by concepts rather than language
Simplest: pad docs in one language withdictionary equivalents in the other
thus each doc has a representation in bothlanguages
Axes are terms in both languagesPrasad L17HierCluster
-
8/2/2019 cs707_030112
32/53
32
Feature selection
Which terms to use as axes for vectorspace?
IDF is a form of feature selectionCan exaggerate noise e.g., mis-spellings
Better to use highest weight mid-frequencywords the most discriminating terms
Pseudo-linguistic heuristics, e.g.,drop stop-wordsstemming/lemmatizationuse only nouns/noun phrases
Prasad L17HierCluster
-
8/2/2019 cs707_030112
33/53
33
Evaluation of clustering
Perhaps the most substantive issue in datamining in general:
how do you measure goodness? Most measures focus on computational
efficiency
Time and space For application of clustering to search:
Measure retrieval effectiveness
Prasad L17HierCluster
-
8/2/2019 cs707_030112
34/53
34
Approaches to evaluating
AnecdotalGround truth comparison
Cluster retrievalPurely quantitative measures
Probability of generating clusters foundAverage distance between cluster members
Microeconomic / utility
Prasad L17HierCluster
-
8/2/2019 cs707_030112
35/53
35
Anecdotal evaluation
Probably the commonest (and surely theeasiest)
I wrote this clustering algorithm and lookwhat it found!
No benchmarks, no comparison possibleAny clustering algorithm will pick up the
easy stuff like partition by languages
Generally, unclear scientific value.Prasad L17HierCluster
-
8/2/2019 cs707_030112
36/53
36
Ground truth comparison
Take a union of docs from a taxonomy &cluster
Yahoo!, ODP, newspaper sections Compare clustering results to baseline
e.g., 80% of the clusters found map cleanlyto taxonomy nodes
But is it the
right
answer?There can be several equally right answers For the docs given, the static prior
taxonomy may be incomplete/wrong in
places
Subjec've
Prasad
-
8/2/2019 cs707_030112
37/53
37
Microeconomic viewpoint
Anything - including clustering - is only asgood as the economic utility it provides
For clustering: net economic gainproduced by an approach (vs. another
approach)
Strive for a concrete optimization problem Examples
recommendation systemsclock time for interactive search
expensivePrasad L17HierCluster
-
8/2/2019 cs707_030112
38/53
38
Evaluation example: Cluster retrieval
Ad-hoc retrieval Cluster docs in returned set Identify best cluster & only retrieve docsfrom it How do various clustering methods affect
the quality of whats retrieved?
Concrete measure of quality:Precision as measured by user judgements
for these queries
Prasad L17HierCluster
-
8/2/2019 cs707_030112
39/53
39
Evaluation
Compare two IR algorithms1. send query, present ranked results2. send query, cluster results, present clusters
Experiment was simulated (no users)Results were clustered into 5 clustersClusters were ranked according to percentage
relevant documents
Documents within clusters were rankedaccording to similarity to query
Prasad L17HierCluster
-
8/2/2019 cs707_030112
40/53
Prasad L18LSI 40
Latent Semantic Indexing
Adapted from Lectures by
Prabhaker Raghavan, Christopher Manning and
Thomas Hoffmann
-
8/2/2019 cs707_030112
41/53
Todays topic
Latent Semantic IndexingTerm-document matrices are very largeBut the number of topics that people talk
about is small (in some sense)
Clothes, movies, politics, Can we represent the term-document space
by a lower dimensional latent space?
Prasad 41L18LSI
-
8/2/2019 cs707_030112
42/53
Linear Algebra
Background
Prasad 42L18LSI
-
8/2/2019 cs707_030112
43/53
Eigenvalues & Eigenvectors
Eigenvectors (for a square mm matrix S)
How many eigenvalues are there at most?only has a non-zero solution ifthis is a m-th order equation in which can have atmost m distinct solutions(roots of the characteristicpolynomial) can be complex even though S is real.
eigenvalue(right) eigenvector
Example
43
-
8/2/2019 cs707_030112
44/53
Matrix-vector multiplication
S=
30 0 0
0 20 0
0 0 1
has eigenvalues 30, 20, 1 withcorresponding eigenvectors
=
0
0
1
1v
=
0
1
0
2v
=
1
0
0
3v
On each eigenvector, S acts as a multiple of the identitymatrix: but as a different multiple on each.
Any vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v
1+ 4v
2+ 6v
3
6
4
2
-
8/2/2019 cs707_030112
45/53
Matrix vector multiplication
Thus a matrix-vector multiplication such asSx (S,xas in the previous slide) can berewritten in terms of the eigenvalues/vectors:
Even thoughxis an arbitrary vector, theaction ofS onxis determined by theeigenvalues/vectors.
Sx = S(2v1 + 4v2 + 6v 3)
Sx = 2Sv1 + 4Sv2 + 6Sv 3= 21v1 + 42v2 + 63v 3
Sx = 60v1 + 80v2 + 6v 3
Prasad 45L18LSI
-
8/2/2019 cs707_030112
46/53
Matrix vector multiplication
Observation: the effect ofsmalleigenvalues is small. If we ignored thesmallest eigenvalue (1), then instead of
we would get
These vectors are similar (in terms ofcosine similarity), or close (in terms ofEuclidean distance).
6080
6
60
80
0
Prasad 46L18LSI
-
8/2/2019 cs707_030112
47/53
Eigenvalues & Eigenvectors
a, 121}2,1{}2,1{}2,1{ =vSv
For symmetric matrices, eigenvectors fordistinct
eigenvalues are orthogonal
=TSa0if,coforIS
All eigenvalues of a real symmetric matrix are real.
Sit,0,SwwTn
All eigenvalues of a positive semi-definite matrix
are non-negative
Prasad 47L18LSI
-
8/2/2019 cs707_030112
48/53
Example
Let
Then
The eigenvalues are 1 and 3 (nonnegative,real). The eigenvectors are orthogonal (and real):
=
21
12S
1)2(21
122
=
IS
1
1
1
1
Real, symmetric.
Plug in these valuesand solve for
eigenvectors.Prasad
-
8/2/2019 cs707_030112
49/53
Let be a squarematrix with mlinearly independent eigenvectors (a non-
defective matrix)
Theorem: Exists an eigen decomposition (cf. matrix diagonalization theorem)
Columns ofUare eigenvectors ofS Diagonal elements of are eigenvalues of
Eigen/diagonal Decomposition
diagonal
Unique
fordistincteigen-values
Prasad 49L18LSI
-
8/2/2019 cs707_030112
50/53
Diagonal decomposition: why/how
=nvvU ...
1Let Uhave the eigenvectors as columns:
=
=
=nnnnvvvvvvSSU
...........
1
1111
Then, SUcan be written
And S=UU1.
Thus SU=U, orU1SU=
Prasad 50
-
8/2/2019 cs707_030112
51/53
Diagonal decomposition - example
Recall .3,1;21
1221==
= S
The eigenvectors and form
1
1
1
1
=
11
11U
Inverting, we have
=
2/12/1
/12/11
U
Then, S=UU1 =
/12/1
/12/1
30
01
11
11
Recall
UU1 =1.
-
8/2/2019 cs707_030112
52/53
Example continued
Lets divide U(and multiply U1)by 2
/12/1
/12/1
30
01
2/12/1
2/12/1Then, S=
Q (Q-1= QT)
Why? Stay tuned
Prasad 52L18LSI
-
8/2/2019 cs707_030112
53/53
If is a symmetric matrix: Theorem: There exists a (unique) eigen
decomposition where Qis orthogonal:
Q-1= QTColumns ofQare normalized eigenvectorsColumns are orthogonal.(everything is real)
Symmetric Eigen Decomposition
QS=
Prasad 53L18LSI