cs707_030112

8/2/2019 cs707_030112

1/53

1

Agglomerative (bottom-up):Precondition: Start with each document as a

separate cluster.Postcondition: Eventually all documents belong to

the same cluster.

Divisive (top-down):Precondition: Start with all documents belonging

to the same cluster.

Postcondition: Eventually each document forms acluster of its own.

Does not require the number of clusters kinadvance.

Needs a termination/readout condition

Hierarchical Clustering algorithms

Prasad L17HierCluster

8/2/2019 cs707_030112

2/53

2

Hierarchical Agglomerative Clustering (HAC)

Algorithm

Start with all instances in their own cluster.

Until there is only one cluster:Among the current clusters, determine the two

clusters, ci and cj, that are most similar.

Replace ci and cj with a single clustercicj


8/2/2019 cs707_030112

3/53

3

Dendrogram: Document Example

As clusters agglomerate, docs likely to fallinto a hierarchy oftopics or concepts.

d1

d2

d3

d4

d5

d1,d2 d4,d5 d3

d3,d4,d5


8/2/2019 cs707_030112

4/53

4

Key notion: cluster representative

We want a notion of a representative point in acluster, to represent the location of each cluster

Representative should be some sort oftypicalor central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the average of all docs in the cluster

Centroid or center of gravity Measure intercluster distances by distances of centroids.


8/2/2019 cs707_030112

5/53

5

Example: n=6, k=3, closest pair of centroids

d1 d2

d3

d4

d5

d6

Centroid after first step.

Centroid after

second step.

Prasad

8/2/2019 cs707_030112

6/53

6

Outliers in centroid computation

Can ignore outliers when computingcentroid.

What is an outlier?Lots of statistical definitions, e.g. momentof point to centroid > M some clustermoment.

Centroid

Outlier

Say 10.


8/2/2019 cs707_030112

7/53

7

Closest pairof clusters

Many variants to defining closest pair of clusters Single-link

Similarity of the mostcosine-similar (single-link) Complete-link Similarity of the furthest points, the leastcosine-

similar

Centroid Clusters whose centroids (centers of gravity) are the

most cosine-similar Average-link

Average cosine between pairs of elements

8/2/2019 cs707_030112

8/53

8

Single Link Agglomerative Clustering

Use maximum similarity of pairs:

Can result in straggly (long and thin)clusters due to chaining effect.

After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:

),(max),( , yxsimccsim ji cycxji =

)),(),,(max()),(( kjkikji ccsimccsimcccsim =


8/2/2019 cs707_030112

9/53

9

Single Link Example


8/2/2019 cs707_030112

10/53

10

Complete Link Agglomerative

Clustering Use minimum similarity of pairs:

Makes tighter, spherical clusters that aretypically preferable.

After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

=

)),(),,(min()),(( kjkikji ccsimccsimcccsim =

Ci Cj Ck

Prasad

8/2/2019 cs707_030112

11/53

11

Complete Link Example


8/2/2019 cs707_030112

12/53

12

Computational Complexity

In the first iteration, all HAC methods need tocompute similarity of all pairs ofn individualinstances which is O(n2).

In each of the subsequent n-2 mergingiterations, compute the distance between themost recently created cluster and all otherexisting clusters.

In order to maintain an overall O(n2)performance, computing similarity to eachcluster must be done in constant time.Often O(n3) if done naively or O(n2log n) if done

more cleverly


8/2/2019 cs707_030112

13/53

13

Group Average Agglomerative

Clustering Use average similarity across all pairs within the merged

cluster to measure the similarity of two clusters.

Compromise between single and complete link. Two options:

Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters

No clear difference in efficacy

=

)( :)(),()1(

1

),(ji jiccx xyccyjiji

ji yxsimccccccsim


8/2/2019 cs707_030112

14/53

14

Computing Group Average

SimilarityAssume cosine similarity and normalized

vectors with unit length.

Always maintain sum of vectors in eachcluster.

Compute similarity of clusters in constanttime:

=

jcx

jxcs

)(

)1||||)(|||(|

|)||(|))()(())()((),(

++

+++

=

jiji

jijiji

jicccc

cccscscscsccsim


8/2/2019 cs707_030112

15/53

15

Medoid as Cluster Representative The centroid does not have to be a

document.

Medoid: A cluster representative that isone of the documents (E.g., the documentclosest to the centroid)

One reason this is useful Consider the representative of a large cluster (>1K

docs)

The centroidof this cluster will be a densevectorPrasad L17HierCluster

8/2/2019 cs707_030112

16/53

16

Efficiency: Using

approximations

In standard algorithm, must find closestpair of centroids at each step

Approximation: instead, find nearly closestpairuse some data structure that makes this

approximation easier to maintain

simplistic example: maintain closest pairbased on distances in projection on a random

line

Random line

Prasad

8/2/2019 cs707_030112

17/53

17

Major issue - labeling

After clustering algorithm finds clusters -how can they be useful to the end user?

Need pithy label for each clusterIn search results, say Animal orCar inthejaguarexample.

In topic trees (Yahoo), need navigational cues.Often done by hand, a posteriori.


8/2/2019 cs707_030112

18/53

18

How to Label Clusters

Show titles of typical documentsTitles are easy to scanAuthors create them for quick scanning!But you can only show a few titles which

may not fully represent cluster

Show words/phrases prominent incluster

More likely to fully represent clusterUse distinguishing words/phrases

Differential labelingPrasad L17HierCluster

8/2/2019 cs707_030112

19/53

19

Labeling

Common heuristics - list 5-10 mostfrequent terms in the centroid vector.

Drop stop-words; stem.

Differential labeling by frequent termsWithin a collection Computers, clusters all

have the word computeras frequent term.

Discriminant analysis of centroids. Perhaps better: distinctive noun phrasePrasad L17HierCluster

8/2/2019 cs707_030112

20/53

20

What is a Good Clustering?

Internal criterion: A good clustering willproduce high quality clusters in which:

the intra-class (that is, intra-cluster) similarityis high

the inter-class similarity is low

The measured quality of a clusteringdepends on both the documentrepresentation and the similarity measure

usedPrasad L17HierCluster

8/2/2019 cs707_030112

21/53

21

External criteria for clustering quality

Quality measured by its ability to discoversome or all of the hidden patterns or latent

classes in gold standard data

Assesses a clustering with respect to groundtruth requires labeled data

Assume documents with Cgold standardclasses, while our clustering algorithmsproduce Kclusters, 1, 2,, K with ni

members.


8/2/2019 cs707_030112

22/53

22

External Evaluation of Cluster Quality

Simple measure: purity, the ratio betweenthe dominant class in the clusteri and

the size of clusteri

Others are entropy of classes in clusters(or mutual information between classes

and clusters)

Cjnn

Purity ijji

i = )(max1

)(


8/2/2019 cs707_030112

23/53

23

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5

Purity example

8/2/2019 cs707_030112

24/53

Rand Index

Number of

points

Same Cluster

in clustering

Different

Clusters in

clustering

Same class in

ground truth A C

Different

classes in

ground truthB D

8/2/2019 cs707_030112

25/53

25

Rand index: symmetric version

BA

AP

+

=

DCBA

DARI

+++

+

=

CA

AR

+

=

Compare with standard Precision and Recall.


8/2/2019 cs707_030112

26/53

Rand Index example: 0.68

Number of

points

Same Cluster

in clustering

Different

Clusters in

clustering

Same class in

ground truth 20 24

Different

classes in

ground truth20 72

8/2/2019 cs707_030112

27/53

Final word and resources

In clustering, clusters are inferred from thedata without human input (unsupervised

learning)

However, in practice, its a bit less clear:there are many ways of influencing the

outcome of clustering: number of clusters,

similarity measure, representation ofdocuments, . . .

8/2/2019 cs707_030112

28/53

28

SKIP WHAT FOLLOWS


8/2/2019 cs707_030112

29/53

29

Term vs. document space

So far, we clustered docs based on theirsimilarities in term space

For some applications, e.g., topicanalysis for inducing navigation

structures, can dualize

use docs as axesrepresent (some) terms as vectorsproximity based on co-occurrence of terms

in docs

now clustering terms, notdocsPrasad L17HierCluster

8/2/2019 cs707_030112

30/53

30

Term vs. document space

Cosine computationConstant for docs in term spaceGrows linearly with corpus size for terms in

doc space

Cluster labelingClusters have clean descriptions in terms of

noun phrase co-occurrence

Application of term clustersPrasad L17HierCluster

8/2/2019 cs707_030112

31/53

31

Multi-lingual docs

E.g., Canadian government docs. Every doc in English and equivalent

French.Must cluster by concepts rather than language

Simplest: pad docs in one language withdictionary equivalents in the other

thus each doc has a representation in bothlanguages

Axes are terms in both languagesPrasad L17HierCluster

8/2/2019 cs707_030112

32/53

32

Feature selection

Which terms to use as axes for vectorspace?

IDF is a form of feature selectionCan exaggerate noise e.g., mis-spellings

Better to use highest weight mid-frequencywords the most discriminating terms

Pseudo-linguistic heuristics, e.g.,drop stop-wordsstemming/lemmatizationuse only nouns/noun phrases


8/2/2019 cs707_030112

33/53

33

Evaluation of clustering

Perhaps the most substantive issue in datamining in general:

how do you measure goodness? Most measures focus on computational

efficiency

Time and space For application of clustering to search:

Measure retrieval effectiveness


8/2/2019 cs707_030112

34/53

34

Approaches to evaluating

AnecdotalGround truth comparison

Cluster retrievalPurely quantitative measures

Probability of generating clusters foundAverage distance between cluster members

Microeconomic / utility


8/2/2019 cs707_030112

35/53

35

Anecdotal evaluation

Probably the commonest (and surely theeasiest)

I wrote this clustering algorithm and lookwhat it found!

No benchmarks, no comparison possibleAny clustering algorithm will pick up the

easy stuff like partition by languages

Generally, unclear scientific value.Prasad L17HierCluster

8/2/2019 cs707_030112

36/53

36

Ground truth comparison

Take a union of docs from a taxonomy &cluster

Yahoo!, ODP, newspaper sections Compare clustering results to baseline

e.g., 80% of the clusters found map cleanlyto taxonomy nodes

But is it the

right

answer?There can be several equally right answers For the docs given, the static prior

taxonomy may be incomplete/wrong in

places

Subjec've

Prasad

8/2/2019 cs707_030112

37/53

37

Microeconomic viewpoint

Anything - including clustering - is only asgood as the economic utility it provides

For clustering: net economic gainproduced by an approach (vs. another

approach)

Strive for a concrete optimization problem Examples

recommendation systemsclock time for interactive search

expensivePrasad L17HierCluster

8/2/2019 cs707_030112

38/53

38

Evaluation example: Cluster retrieval

Ad-hoc retrieval Cluster docs in returned set Identify best cluster & only retrieve docsfrom it How do various clustering methods affect

the quality of whats retrieved?

Concrete measure of quality:Precision as measured by user judgements

for these queries


8/2/2019 cs707_030112

39/53

39

Evaluation

Compare two IR algorithms1. send query, present ranked results2. send query, cluster results, present clusters

Experiment was simulated (no users)Results were clustered into 5 clustersClusters were ranked according to percentage

relevant documents

Documents within clusters were rankedaccording to similarity to query


8/2/2019 cs707_030112

40/53

Prasad L18LSI 40

Latent Semantic Indexing

Adapted from Lectures by

Prabhaker Raghavan, Christopher Manning and

Thomas Hoffmann

8/2/2019 cs707_030112

41/53

Todays topic

Latent Semantic IndexingTerm-document matrices are very largeBut the number of topics that people talk

about is small (in some sense)

Clothes, movies, politics, Can we represent the term-document space

by a lower dimensional latent space?

Prasad 41L18LSI

8/2/2019 cs707_030112

42/53

Linear Algebra

Background

Prasad 42L18LSI

8/2/2019 cs707_030112

43/53

Eigenvalues & Eigenvectors

Eigenvectors (for a square mm matrix S)

How many eigenvalues are there at most?only has a non-zero solution ifthis is a m-th order equation in which can have atmost m distinct solutions(roots of the characteristicpolynomial) can be complex even though S is real.

eigenvalue(right) eigenvector

Example

43

8/2/2019 cs707_030112

44/53

Matrix-vector multiplication

S=

30 0 0

0 20 0

0 0 1

has eigenvalues 30, 20, 1 withcorresponding eigenvectors

=

0

0

1

1v

=

0

1

0

2v

=

1

0

0

3v

On each eigenvector, S acts as a multiple of the identitymatrix: but as a different multiple on each.

Any vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v

1+ 4v

2+ 6v

3

6

4

2

8/2/2019 cs707_030112

45/53

Matrix vector multiplication

Thus a matrix-vector multiplication such asSx (S,xas in the previous slide) can berewritten in terms of the eigenvalues/vectors:

Even thoughxis an arbitrary vector, theaction ofS onxis determined by theeigenvalues/vectors.

Sx = S(2v1 + 4v2 + 6v 3)

Sx = 2Sv1 + 4Sv2 + 6Sv 3= 21v1 + 42v2 + 63v 3

Sx = 60v1 + 80v2 + 6v 3

Prasad 45L18LSI

8/2/2019 cs707_030112

46/53

Matrix vector multiplication

Observation: the effect ofsmalleigenvalues is small. If we ignored thesmallest eigenvalue (1), then instead of

we would get

These vectors are similar (in terms ofcosine similarity), or close (in terms ofEuclidean distance).

6080

6

60

80

0

Prasad 46L18LSI

8/2/2019 cs707_030112

47/53

Eigenvalues & Eigenvectors

a, 121}2,1{}2,1{}2,1{ =vSv

For symmetric matrices, eigenvectors fordistinct

eigenvalues are orthogonal

=TSa0if,coforIS

All eigenvalues of a real symmetric matrix are real.

Sit,0,SwwTn

All eigenvalues of a positive semi-definite matrix

are non-negative

Prasad 47L18LSI

8/2/2019 cs707_030112

48/53

Example

Let

Then

The eigenvalues are 1 and 3 (nonnegative,real). The eigenvectors are orthogonal (and real):

=

21

12S

1)2(21

122

=

IS

1

1

1

1

Real, symmetric.

Plug in these valuesand solve for

eigenvectors.Prasad

8/2/2019 cs707_030112

49/53

Let be a squarematrix with mlinearly independent eigenvectors (a non-

defective matrix)

Theorem: Exists an eigen decomposition (cf. matrix diagonalization theorem)

Columns ofUare eigenvectors ofS Diagonal elements of are eigenvalues of

Eigen/diagonal Decomposition

diagonal

Unique

fordistincteigen-values

Prasad 49L18LSI

8/2/2019 cs707_030112

50/53

Diagonal decomposition: why/how

=nvvU ...

1Let Uhave the eigenvectors as columns:

=

=

=nnnnvvvvvvSSU

...........

1

1111

Then, SUcan be written

And S=UU1.

Thus SU=U, orU1SU=

Prasad 50

8/2/2019 cs707_030112

51/53

Diagonal decomposition - example

Recall .3,1;21

1221==

= S

The eigenvectors and form

1

1

1

1

=

11

11U

Inverting, we have

=

2/12/1

/12/11

U

Then, S=UU1 =

/12/1

/12/1

30

01

11

11

Recall

UU1 =1.

8/2/2019 cs707_030112

52/53

Example continued

Lets divide U(and multiply U1)by 2

/12/1

/12/1

30

01

2/12/1

2/12/1Then, S=

Q (Q-1= QT)

Why? Stay tuned

Prasad 52L18LSI

8/2/2019 cs707_030112

53/53

If is a symmetric matrix: Theorem: There exists a (unique) eigen

decomposition where Qis orthogonal:

Q-1= QTColumns ofQare normalized eigenvectorsColumns are orthogonal.(everything is real)

Symmetric Eigen Decomposition

QS=

Prasad 53L18LSI

cs707_030112

Documents

Transcript of cs707_030112