Aaai 2006 Pedersen

100
July 17, 2006 AAAI-2006 Tutorial 1 Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth [email protected] http:// www.d.umn.edu/~tpederse/SCTutorial.html

Transcript of Aaai 2006 Pedersen

Page 1: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 1

Language Independent Methods of Clustering Similar Contexts

(with applications)

Ted Pedersen

University of Minnesota, Duluth

[email protected]

http://www.d.umn.edu/~tpederse/SCTutorial.html

Page 2: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 2

Language Independent Methods

• Do not utilize syntactic information– No parsers, part of speech taggers, etc. required

• Do not utilize dictionaries or other manually created lexical resources

• Based on lexical features selected from corpora – Assumption: word segmentation can be done by

looking for white spaces between strings• No manually annotated data of any kind,

methods are completely unsupervised in the strictest sense

Page 3: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 3

Clustering Similar Contexts

• A context is a short unit of text– often a phrase to a paragraph in length,

although it can be longer

• Input: N contexts

• Output: K clusters– Where each member of a cluster is a context

that is more similar to each other than to the contexts found in other clusters

Page 4: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 4

Applications

• Headed contexts (contain target word)– Name Discrimination– Word Sense Discrimination

• Headless contexts – Email Organization– Document Clustering– Paraphrase identification

• Clustering Sets of Related Words

Page 5: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 5

Tutorial Outline• Identifying lexical features

– Measures of association & tests of significance

• Context representations– First & second order

• Dimensionality reduction– Singular Value Decomposition

• Clustering– Partitional techniques– Cluster stopping– Cluster labeling

• Evaluation

Page 6: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 6

SenseClusters

• A package for clustering contexts– http://senseclusters.sourceforge.net– SenseClusters Live! (Knoppix CD)

• Integrates with various other tools– Ngram Statistics Package– CLUTO– SVDPACKC

Page 7: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 7

Many thanks…

• Amruta Purandare (M.S., 2004)– Founding developer of SenseClusters (2002-2004)– Now PhD student in Intelligent Systems at the

University of Pittsburgh http://www.cs.pitt.edu/~amruta/

• Anagha Kulkarni (M.S., 2006, expected)– Enhancing SenseClusters since Fall 2004!– Will start as PhD student at CMU/LTI in Fall 2006

http://www.d.umn.edu/~kulka020/

• NSF for supporting Amruta, Anagha and Ted via CAREER award #0092784

Page 8: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 8

Background and Motivations

Page 9: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 9

Headed and Headless Contexts

• A headed context includes a target word– Our goal is to cluster the target words based

on their surrounding contexts – Target word is center of context and our

attention

• A headless context has no target word– Our goal is to cluster the contexts based on

their similarity to each other– The focus is on the context as a whole

Page 10: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 10

Headed Contexts (input)

• I can hear the ocean in that shell.

• My operating system shell is bash.

• The shells on the shore are lovely.

• The shell command line is flexible.

• The oyster shell is very hard and black.

Page 11: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 11

Headed Contexts (output)

• Cluster 1: – My operating system shell is bash.– The shell command line is flexible.

• Cluster 2:– The shells on the shore are lovely.– The oyster shell is very hard and black.– I can hear the ocean in that shell.

Page 12: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 12

Headless Contexts (input)

• The new version of Linux is more stable and has better support for cameras.

• My Chevy Malibu has had some front end troubles.

• Osborne made one of the first personal computers.

• The brakes went out, and the car flew into the house.

• With the price of gasoline, I think I’ll be taking the bus more often!

Page 13: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 13

Headless Contexts (output)

• Cluster 1:– The new version of Linux is more stable and better

support for cameras.– Osborne made one of the first personal computers.

• Cluster 2: – My Chevy Malibu has had some front end troubles.– The brakes went out, and the car flew into the house. – With the price of gasoline, I think I’ll be taking the bus

more often!

Page 14: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 14

Web Search as Application

• Web search results are headed contexts– Search term is target word (found in snippets)

• Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up”

• If you click on search results or follow links in pages found, you will encounter headless contexts too…

Page 15: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 15

Email Foldering as Application

• Email (public or private) is made up of headless contexts– Short, usually focused…

• Cluster similar email messages together – Automatic email foldering– Take all messages from sent-mail file or inbox

and organize into categories

Page 16: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 16

Clustering News as Application

• News articles are headless contexts– Entire article or first paragraph– Short, usually focused

• Cluster similar articles together

Page 17: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 17

What is it to be “similar”?

• You shall know a word by the company it keeps– Firth, 1957 (Studies in Linguistic Analysis)

• Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis)– Harris, 1968 (Mathematical Structures of Language)

• Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis)– Miller and Charles, 1991 (Language and Cognitive Processes)

• Various extensions…– Similar contexts will have similar meanings, etc.– Names that occur in similar contexts will refer to the same

underlying person, etc.

Page 18: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 18

General Methodology

• Represent contexts to be clustered using first or second order feature vectors– Lexical features

• Reduce dimensionality to make vectors more tractable and/or understandable– Singular value decomposition

• Cluster the context vectors– Find the number of clusters– Label the clusters

• Evaluate and/or use the contexts!

Page 19: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 19

Identifying Lexical Features

Measures of Association and

Tests of Significance

Page 20: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 20

What are features?

• Features represent the (hopefully) salient characteristics of the contexts to be clustered

• Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features

• Vectors/contexts that include many of the same features will be similar to each other

Page 21: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 21

Where do features come from?

• In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered– This is not cheating, since data to be clustered does

not have any labeled classes that can be used to assist feature selection

– It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step

• Email or news articles

Page 22: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 22

Feature Selection

• “Test” data – the contexts to be clustered– Assume that the feature selection data is the same as

the test data, unless otherwise indicated

• “Training” data – a separate corpus of held out feature selection data (that will not be clustered)– may need to use if you have a small number of

contexts to cluster (e.g., web search results)– This sense of “training” due to Schütze (1998)

Page 23: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 23

Lexical Features

• Unigram – a single word that occurs more than a given number of times

• Bigram – an ordered pair of words that occur together more often than expected by chance– Consecutive or may have intervening words

• Co-occurrence – an unordered bigram• Target Co-occurrence – a co-occurrence where

one of the words is the target word

Page 24: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 24

Bigrams

• fine wine (window size of 2)• baseball bat• house of representatives (window size of 3)• president of the republic (window size of 4)• apple orchard

• Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?)

Page 25: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 25

Co-occurrences

• tropics water• boat fish• law president• train travel

• Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations

Page 26: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 26

Bigrams and Co-occurrences

• Pairs of words tend to be much less ambiguous than unigrams– “bank” versus “river bank” and “bank card”– “dot” versus “dot com” and “dot product”

• Three grams and beyond occur much less frequently (Ngrams very Zipfian)

• Unigrams are noisy, but bountiful

Page 27: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 27

“occur together more often than expected by chance…”

• Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix– Throw out bigrams that include one or two stop words

• Expected values are calculated, based on the model of independence and observed values– How often would you expect these words to occur

together, if they only occurred together by chance?– If two words occur “significantly” more often than the

expected value, then the words do not occur together by chance.

Page 28: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 28

2x2 Contingency Table

Intelligence !Intelligence

Artificial 100.0

000.12

300.0

398.8

400

!Artificial 200.0

298.8

99,400.0

99,301.2

99,600

300 99,700 100,000

Page 29: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 29

Measures of Association

2

1,

22

2

1,

2

),(

)],(),([

)),(

),(log*),((

ji ji

jiji

ji

ji

jiji

wwexpected

wwexpectedwwobservedX

wwexpected

wwobservedwwobservedG

Page 30: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 30

Interpreting the Scores…

• G^2 and X^2 are asymptotically approximated by the chi-squared distribution…

• This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get …

Page 31: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 31

Interpreting the Scores…

• Values above a certain level of significance can be considered grounds for rejecting the null hypothesis – H0: the words in the bigram are independent– 3.841 is associated with 95% confidence that

the null hypothesis should be rejected

Page 32: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 32

Measures of Association

• There are numerous measures of association that can be used to identify bigram and co-occurrence features

• Many of these are supported in the Ngram Statistics Package (NSP)– http://www.d.umn.edu/~tpederse/nsp.html

Page 33: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 33

Summary

• Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data– Language independent

• Unigrams usually only selected by frequency– Remember, no labeled data from which to learn, so somewhat

less effective as features than in supervised case

• Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association– Bigrams and co-occurrences need not be consecutive– Stop words should be eliminated– Frequency thresholds are helpful (e.g., unigram/bigram that

occurs once may be too rare to be useful)

Page 34: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 34

Context Representations

First and Second Order Methods

Page 35: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 35

Once features selected…

• We have a set of unigrams, bigrams, co-occurrences or target co-occurrences – We believe/hope that these are descriptive of

the contexts– We also have frequency and measure of

association score that have been used in their selection

• Convert contexts to be clustered into a vector representation based on these features

Page 36: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 36

First Order Representation

• Each context is represented by a vector with M dimensions, each of which indicates whether or not a particular feature occurred in that context– Value may be binary, a frequency count, or an

association score

• Context by Feature representation

Page 37: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 37

Contexts

• Cxt1: There was an island curse of black magic cast by that voodoo child.

• Cxt2: Harold, a known voodoo child, was gifted in the arts of black magic.

• Cxt3: Despite their military might, it was a serious error to attack.

• Cxt4: Military might is no defense against a voodoo child or an island curse.

Page 38: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 38

Unigram Feature Set

• island 1000• black 700• curse 500• magic 400• child 200

• (assume these are frequency counts obtained from some corpus…)

Page 39: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 39

First Order Vectors of Unigrams

island black curse magic child

Cxt1 1 1 1 1 1

Cxt2 0 1 0 1 1

Cxt3 0 0 0 0 0

Cxt4 1 0 1 0 1

Page 40: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 40

Bigram Feature Set• island curse 189.2• black magic 123.5• voodoo child 120.0• military might 100.3• serious error 89.2• island child 73.2• voodoo might 69.4• military error 54.9• black child 43.2• serious curse 21.2

• (assume these are log-likelihood scores based on frequency counts from some corpus)

Page 41: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 41

First Order Vectors of Bigrams

black

magic

island curse

military might

serious error

voodoo child

Cxt1 1 1 0 0 1

Cxt2 1 0 0 0 1

Cxt3 0 0 1 1 0

Cxt4 0 1 1 0 1

Page 42: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 42

First Order Vectors

• Can have binary values or weights associated with frequency, etc.

• Forms a context by feature matrix• May optionally be smoothed/reduced with

Singular Value Decomposition – More on that later…

• The contexts are ready for clustering…– More on that later…

Page 43: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 43

Second Order Features

• First order features encode the occurrence of a feature in a context– Feature occurrence represented by binary value

• Second order features encode something ‘extra’ about a feature that occurs in a context– Feature occurrence represented by word co-occurrences– Feature occurrence represented by context occurrences

Page 44: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 44

Second Order Representation

• First, build word by word matrix from features– Based on bigrams or co-occurrences– First word is row, second word is column, cell is score– (optionally) reduce dimensionality w/SVD– Each row forms a vector of first order co-occurrences

• Second, replace each word in a context with its row/vector as found in the word by word matrix

• Average all the word vectors in the context to create the second order representation– Due to Schütze (1998), related to LSI/LSA

Page 45: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 45

Word by Word Matrix

magic curse might error child

black 123.5 0 0 0 43.2

island 0 189.2 0 0 73.2

military 0 0 100.3 54.9 0

serious 0 21.2 0 89.2 0

voodoo 0 0 69.4 0 120.0

Page 46: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 46

Word by Word Matrix

• …can also be used to identify sets of related words• In the case of bigrams, rows represent the first word in a

bigram and columns represent the second word– Matrix is asymmetric

• In the case of co-occurrences, rows and columns are equivalent– Matrix is symmetric

• The vector (row) for each word represent a set of first order features for that word

• Each word in a context to be clustered for which a vector exists (in the word by word matrix) is replaced by that vector in that context

Page 47: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 47

There was an island curse of black magic cast by that voodoo child.

magic curse might error child

black 123.5 0 0 0 43.2

island 0 189.2 0 0 73.2

voodoo 0 0 69.4 0 120.0

Page 48: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 48

Second Order Co-Occurrences

• Word vectors for “black” and “island” show similarity as both occur with “child”

• “black” and “island” are second order co-occurrence with each other, since both occur with “child” but not with each other (i.e., “black island” is not observed)

Page 49: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 49

Second Order Representation

• There was an [curse, child] curse of [magic, child] magic cast by that [might, child] child

• [curse, child] + [magic, child] + [might, child]

Page 50: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 50

There was an island curse of black magic cast by that voodoo child.

magic curse might error child

Cxt1 41.2 63.1 24.4 0 78.8

Page 51: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 51

Second Order Representation

• Results in a Context by Feature (Word) Representation

• Cell values do not indicate if feature occurred in context. Rather, they show the strength of association of that feature with other words that occur with a word in the context.

Page 52: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 52

Summary

• First order representations are intuitive, but…– Can suffer from sparsity– Contexts represented based on the features that

occur in those contexts• Second order representations are harder to

visualize, but…– Allow a word to be represented by the words it co-

occurs with (i.e., the company it keeps)– Allows a context to be represented by the words that

occur with the words in the context – Helps combat sparsity…

Page 53: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 53

Related Work• Pedersen and Bruce 1997 (EMNLP) presented first order method of

discrimination http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf

• Schütze 1998 (Computational Linguistics) introduced second order method

http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf

• Purandare and Pedersen 2004 (CoNLL) compared first and second order methods

http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf

– First order better if you have lots of data– Second order better with smaller amounts of data

Page 54: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 54

Dimensionality Reduction

Singular Value Decomposition

Page 55: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 55

Effect of SVD

• SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space– If “dog” and “collie” and “wolf” are

dimensions/columns in a word co-occurrence matrix, after SVD they may be a single dimension that represents “canines”

Page 56: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 56

Effect of SVD

• The dimensions of the matrix after SVD are principal components that represent the meaning of concepts– Similar columns are grouped together

• SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD

Page 57: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 57

How can SVD be used?

• SVD on first order contexts will reduce a context by feature representation down to a smaller number of features– Latent Semantic Analysis typically performs SVD

on a feature by context representation, where the contexts are reduced

• SVD used in creating second order context representations– Reduce word by word matrix

Page 58: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 58

Word by Word Matrixapple blood cells ibm data box tissue graphics memory organ plasma

pc 2 0 0 1 3 1 0 0 0 0 0

body 0 3 0 0 0 0 2 0 0 2 1

disk 1 0 0 2 0 3 0 1 2 0 0

petri 0 2 1 0 0 0 2 0 1 0 1

lab 0 0 3 0 2 0 2 0 2 1 3

sales 0 0 0 2 3 0 0 1 2 0 0

linux 2 0 0 1 3 2 0 1 1 0 0

debt 0 0 0 2 3 4 0 2 0 0 0

Page 59: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 59

Singular Value DecompositionA=UDV’

Page 60: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 60

Word by Word Matrix After SVD

apple blood cells ibm data tissue graphics memory organ plasma

pc .73 .00 .11 1.3 2.0 .01 .86 .77 .00 .09

body .00 1.2 1.3 .00 .33 1.6 .00 .85 .84 1.5

disk .76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00

germ .00 1.1 1.2 .00 .49 1.5 .00 .86 .77 1.4

lab .21 1.7 2.0 .35 1.7 2.5 .18 1.7 1.2 2.3

sales .73 .15 .39 1.3 2.2 .35 .85 .98 .17 .41

linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13

debt 1.2 .00 .00 2.1 3.2 .00 1.5 1.1 .00 .00

Page 61: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 61

Second Order Representation

• These two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory”

• The two contexts are similar because they share many second order co-occurrences

apple blood cells ibm data tissue graphics memory organ plasma

disk .76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00

linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13

• I got a new disk today!

• What do you think of linux?

Page 62: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 62

Relationship to LSA

• Latent Semantic Analysis uses feature by context first order representation – Indicates all the contexts in which a feature

occurs– Use SVD to reduce dimensions (contexts)– Cluster features based on similarity of

contexts in which they occur– Represent sentences using an average of

feature vectors

Page 63: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 63

Feature by Context Representation

Cxt1 Cxt2 Cxt3 Cxt4

black magic 1 1 0 1

island curse 1 0 0 1

military might 0 0 1 0

serious error 0 0 1 0

voodoo child 1 1 0 1

Page 64: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 64

References

• Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, 1990

• Landauer, T. and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review, vol. 104, 1997

• Schütze, H, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, 1998

• Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices, Vector Spaces, and Information Retrieval, SIAM Review, vol 41, 1999

Page 65: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 65

Clustering

Partitional Methods

Cluster Stopping

Cluster Labeling

Page 66: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 66

Many many methods…

• Cluto supports a wide range of different clustering methods– Agglomerative

• Average, single, complete link…

– Partitional• K-means (Direct)

– Hybrid• Repeated bisections

• SenseClusters integrates with Cluto– http://www-users.cs.umn.edu/~karypis/cluto/

Page 67: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 67

General Methodology

• Represent contexts to be clustered in first or second order vectors

• Cluster the context vectors directly– vcluster

• … or convert to similarity matrix and then cluster– scluster

Page 68: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 68

Partitional Methods

• Randomly create centroids equal to the number of clusters you wish to find

• Assign each context to nearest centroid• After all contexts assigned, re-compute

centroids– “best” location decided by criterion function

• Repeat until stable clusters found– Centroids don’t shift from iteration to iteration

Page 69: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 69

Partitional Methods

• Advantages : fast

• Disadvantages– Results can be dependent on the initial

placement of centroids– Must specify number of clusters ahead of time

• maybe not…

Page 70: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 70

Partitional Criterion Functions

• Intra-Cluster (Internal) similarity/distance– How close together are members of a cluster?– Closer together is better

• Inter-Cluster (External) similarity/distance– How far apart are the different clusters?– Further apart is better

Page 71: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 71

Intra Cluster Similarity

• Ball of String (I1)– How far is each member from each other

member

• Flower (I2)– How far is each member of cluster from

centroid

Page 72: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 72

Contexts to be Clustered

Page 73: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 73

Ball of String (I1 Internal Criterion Function)

Page 74: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 74

Flower(I2 Internal Criterion Function)

Page 75: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 75

Inter Cluster Similarity

• The Fan (E1)– How far is each centroid from the centroid of

the entire collection of contexts– Maximize that distance

Page 76: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 76

The Fan(E1 External Criterion Function)

Page 77: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 77

Hybrid Criterion Functions

• Balance internal and external similarity– H1 = I1/E1– H2 = I2/E1

• Want internal similarity to increase, while external similarity decreases

• Want internal distances to decrease, while external distances increase

Page 78: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 78

Cluster Stopping

Page 79: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 79

Cluster Stopping

• Many Clustering Algorithms require that the user specify the number of clusters prior to clustering

• But, the user often doesn’t know the number of clusters, and in fact finding that out might be the goal of clustering

Page 80: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 80

Criterion Functions Can Help

• Run partitional algorithm for k=1 to deltaK– DeltaK is a user estimated or automatically

determined upper bound for the number of clusters

• Find the value of k at which the criterion function does not significantly increase at k+1

• Clustering can stop at this value, since no further improvement in solution is apparent with additional clusters (increases in k)

Page 81: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 81

H2 versus kT. Blair – V. Putin – S. Hussein

Page 82: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 82

PK2

• Based on Hartigan, 1975• When ratio approaches 1, clustering is at a plateau• Select value of k which is closest to but outside of

standard deviation interval

)1(2

)(2)(2

kH

kHkPK

Page 83: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 83

PK2 predicts 3 sensesT. Blair – V. Putin – S. Hussein

Page 84: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 84

PK3• Related to Salvador and Chan, 2004• Inspired by Dice Coefficient• Values close to 1 mean clustering is improving …• Select value of k which is closest to but outside of

standard deviation interval

)1(2)1(2

)(2*2)(3

kHkH

kHkPK

Page 85: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 85

PK3 predicts 3 sensesT. Blair – V. Putin – S. Hussein

Page 86: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 86

References• Hartigan, J. Clustering Algorithms, Wiley, 1975

– basis for SenseClusters stopping method PK2• Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An

Evaluation, The Computer Journal, vol 20, 1977 – basis for SenseClusters stopping method PK1

• Milligan, G. and Cooper, M., An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, vol. 50, 1985– Very extensive comparison of cluster stopping methods

• Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number of Clusters in a Dataset via the Gap Statistic,Journal of the Royal Statistics Society (Series B), 2001

• Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of Senses Based on Clustering Criterion Functions, Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006– Describes SenseClusters stopping methods

Page 87: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 87

Cluster Labeling

Page 88: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 88

Cluster Labeling

• Once a cluster is discovered, how can you generate a description of the contexts of that cluster automatically?

• In the case of contexts, you might be able to identify significant lexical features from the contents of the clusters, and use those as a preliminary label

Page 89: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 89

Results of Clustering

• Each cluster consists of some number of contexts

• Each context is a short unit of text• Apply measures of association to the

contents of each cluster to determine N most significant bigrams

• Use those bigrams as a label for the cluster

Page 90: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 90

Label Types

• The N most significant bigrams for each cluster will act as a descriptive label

• The M most significant bigrams that are unique to each cluster will act as a discriminating label

Page 91: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 91

Evaluation Techniques

Comparison to gold standard data

Page 92: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 92

Evaluation

• If Sense tagged text is available, can be used for evaluation– But don’t use sense tags for clustering or

feature selection!

• Assume that sense tags represent “true” clusters, and compare these to discovered clusters– Find mapping of clusters to senses that

attains maximum accuracy

Page 93: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 93

Evaluation

• Pseudo words are especially useful, since it is hard to find data that is discriminated– Pick two words or names from a corpus, and

conflate them into one name. Then see how well you can discriminate.

– http://www.d.umn.edu/~tpederse/tools.html

• Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier

Page 94: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 94

Evaluation

• Pseudo words are especially useful, since it is hard to find data that is discriminated– Pick two or more words or names from a

corpus, and conflate them into one name. Then see how well you can discriminate.

– http://www.d.umn.edu/~kulka020/kanaghaName.html

Page 95: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 95

Baseline Algorithm

• Baseline Algorithm – group all instances into one cluster, this will reach “accuracy” equal to majority classifier

• What if the clustering said everything should be in the same cluster?

Page 96: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 96

Baseline Performance

S1 S2 S3 Totals

C1 0 0 0 0

C2 0 0 0 0

C3 80 35 55 170

Totals 80 35 55 170

S3 S2 S1 Totals

C1 0 0 0 0

C2 0 0 0 0

C3 55 35 80 170

Totals 55 35 80 170

(0+0+55)/170 = .32 if C3 is S1 (0+0+80)/170 = .47 if C3 is S3

Page 97: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 97

Evaluation• Suppose that C1 is labeled S1, C2 as S2, and C3 as S3• Accuracy = (10 + 0 + 10) / 170 = 12% • Diagonal shows how many members of the cluster actually belong to

the sense given on the column • Can the “columns” be rearranged to improve the overall accuracy?

– Optimally assign clusters to senses

S1 S2 S3 Totals

C1 10 30 5 45

C2 20 0 40 60

C3 50 5 10 65

Totals 80 35 55 170

Page 98: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 98

Evaluation

• The assignment of C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71%

• Find the ordering of the columns in the matrix that maximizes the sum of the diagonal.

• This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory.

S2 S3 S1 Totals

C1 30 5 10 45

C2 0 40 20 60

C3 5 10 50 65

Totals 35 55 80 170

Page 99: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 99

Alternatives?

• Unsupervised methods may not discover clusters equivalent to the classes learned in supervised learning

• Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives?– Humans could look at the members of each cluster and

determine the nature of the relationship or meaning that they all share

– Use the contents of the cluster to generate a descriptive label that could be inspected by a human

Page 100: Aaai 2006 Pedersen

July 17, 2006 AAAI-2006 Tutorial 100

Thank you!

• Questions or comments on tutorial or SenseClusters are welcome at any time

[email protected]

• SenseClusters is freely available via LIVE CD, the Web, and in source code form

http://senseclusters.sourceforge.net

• SenseClusters papers available at:http://www.d.umn.edu/~tpederse/senseclusters-pubs.html