Three Approaches to Unsupervised WSD

Three Approaches to Unsupervised WSD

Dmitriy Dligach

Unsupervised WSD

• No training corpora needed• No predefined tag set needed• Three approaches

– Context-group Discrimination (Schutze, 1998)– Graph-based Algorithms (Agirre et al., 2006)

• HyperLex (Veronis, 2004)• PageRank (Brin and Page, 1998)

– Predominant Sense (McCarthy, 2006)• Thesaurus generation

– Method in (Lin, 1998)– Earlier version in (Hindle, 1990)

Context-group Discrimination Algorithm

• Sense Representations– Generate word vectors– Generate context vectors (from co-

occurrence matrix)– Generate sense vectors (by clustering

context vectors)

• Disambiguate by computing proximity

Word Vectors

• wi

• Two strategies to select dimensions– Local: select words from the contexts of the

ambiguous word within a 50-word window• Either 1,000 most frequent words, or• Use 2 measure of dependence to pick 1,000 words

– Global: select from the entire corpus regardless of the target word

• Select 20,000 most frequent words as features• 2,000 as dimensions• 20,000-by-2,000 co-occurrence matrix

w0 w1 wn

Context Vectorsw0 w1

lawsuit wjgarment wn

wi

suit … …

• This representation conflates senses

• Represent context as the centroid of the word vectors

• IDF-valued vectors

Sense Vectors

• Cluster approx. 2,000 context vectors• Use a combination of group-average agglomerative

clustering and EM– Choose a random sample of 50 (2000) and cluster using

GAAC O(n2)– Centroids of the resulting clusters become the input to the

EM– The procedure is still linear

• Perform an SVD on context vectors– Re-represent context vectors by their values on the 100

principal dimensions

Evaluation

• Hand-labeled corpus of 10 naturally ambiguous and 10 artificial words

• Throw out low-frequency senses and leave only 2 most frequent

• Number of clusters– 2 clusters: use gold standard to evaluate– 10 clusters: no gold standard; use purity

• Sense-based IR

Results (highlights)

• Overall performance for pseudo-words is higher than for naturally ambiguous words

• Some pseudowords (wide range/consulting firm) and words (space in area, volume sense) show poor performance due to being topically amorphous

• IR evaluation– vector-space model with senses as dimensions– 7.4% improvement on TREC-1 collection

Graph-based Algorithms

• Build a co-occurrence matrix• View it as a graph

– Small world properties• Most nodes have few connections• Few are highly connected

– Look for densely populated regions• Known as High-Density Components

– Map ambiguous instances to one of these regions

A Sample Co-Occurrence Graph

• barrage – dam, play-off, barrier, roadblock, police cordon, barricade

Algorithm Details

• Nodes correspond to words• Edges reflect the degree of semantic association

between words– Model with conditional probabilities

– wA,B = 1 – max[p(A|B), p(B|A)]

• Detect high-density components– Sort nodes by their degree– Take the top one (root hub) and remove along with all its

neighbors (hoping to eliminate the entire component)– Iterate until all the high-density components are found

Disambiguation• Delineate high-density components

– Need to attach them back to the root hubs– Attach the target word to all root hubs– Compute the MST

• Map the ambiguous instance to one of the components – Examine each word in its context– Compute the distance from each of these words to each root hub

(each word is under exactly one hub)– Compute the total score for each hub

PageRank

• Based on PageRank (Brin and Page, 1998) and adopted for weighted graphs

• An alternative way to rank nodes• Algorithm

– Initialize nodes to random values– Compute PageRank– Iterate a fixed number of times

)()1()()( )(

jvInj vInk jk

iji vP

w

wddvP

ij

Evaluation

• First need to optimize 10 parameters– P1. Minimum frequency of edges (occurrences)– P2. Minimum frequency of vertices (words)– P3. Edges with weights above this value are removed

• Train on Senseval2 using unsupervised metrics– Entropy, Purity, and Fscore

• Evaluate on Senseval3– Lexical sample data

• 10 point gain over the MFS baseline• Beat by 1 point a supervised system with lexical features

– All-words task• Little training data• Supervised systems barely beat the MFS baseline• This system is less than 1 point below the best system• The difference in performance is not statistically significant

Finding Predominant Sense

• Predominant senses in WordNet are derived from SemCor (a relatively small subset of Brown)

• Idiosyncrasies– tiger (audacious person not the animal)– star (depending on context celebrity or

celestial body)

Distributional Similarity

• Nouns that occur in object positions of the same verbs are similar (e.g. beer and vodka as objects of to drink)

• Can automatically generate thesaurus-like neighborhood list for the target word (Hindle 1990), (Lin 1998)– w0:s0, w1:s1, …, wn:sn

– neighborhood list conflates different senses– quality and quantity of neighbors must relate to the

predominant sense– need to compute the proximity of each neighbor to each of

the senses of the target word (e.g. lesk, jcn)

Algorithm• w – the target word• Nw = {n1, n2, …, nk} – the ordered set of top k most similar neighbors of the target word

• {dss(w, n1), dss(w, n2), …, dss(w, nk)} – distributional similarity score for each of the k neighbors

• wsi senses(w) – senses of the target word

• wnss(wsi, nj) – WordNet similarity score between WordNet sense i of the target word and the sense nj of the neighbor j that maximizes this score

• PrevalenceScore(wsi) – ranking of the sense i of the target word as being the predominant sense.

wj

iNn wsensesws ji

jij nwswnss

nwswnssnwdss

)( ''

),(

),(),(

Experiment 1

• Derive a thesaurus from BNC• SemCor experiments

– Metric: accuracy of finding the MFS– Metric: WSD accuracy

• Baseline: random accuracy• Upper bound for WSD task is 67%• Both experiments beat the random baseline (54%

and 48% respectively)• Hand Examination

– some error due to genre and time period variations

Experiment 2

• Use Senseval2 all-words task

• Label with first sense computed – automatically – according to SemCor– Senseval2 data itself (upper bound)

• Automatic precision/recall are only a few points less than SemCor’s

Experiment 3

• Investigate how the MFS changes across domains

• SPORTS and FINANCE domains of the Reuters corpus

• No hand annotated data, so hand-examine• Most words displayed the expected change in

MFS– tie changes from draw to affiliation

Discussion: Algorithms

• Context– Bag-of-words: Schutze and Agirre et. al.– Syntactic: McCarthy et. al.– Is bag-of-words sufficient?

• E.g. topically amorphous words

• Co-occurrence– Co-occurrence matrix: Schutze and Agirre et al.– Used to to look for similar nouns: McCarthy et.al

• Order of co-occurrence– First order: all three papers– Second order: Schutze and McCarthy et. al.– Higher-order: Agirre

• PageRank computes global rankings• MST links all nodes to the root• Advantage of the graph-based methods

Discussion: Evaluation

• Testbeds: little ground for cross-comparison– Schutze: his own corpus– Agirre et al: train parameters on Senseval2 and test on

Senseval3 data

– McCarthy et al: test on SemCor, Senseval2, Reuters

• Methodology– Map clusters to the gold standard (Schutze and Agirre et. al.)– Unsupervised evaluation (Schutze and Agirre et. al)

– Compare to various baselines (MFS, Lesk, Random baseline)– Use an application (Schutze)

Three Approaches to Unsupervised WSD

Documents

Transcript of Three Approaches to Unsupervised WSD