Handout - Computational Lexical Semantics at ESSLLI 09

OverviewSupervised WSD

EvaluationDictionary and Thesaurus Methods

Other approaches to WSDDiscussion

Other example classifiers

Word Sense DisambiguationComputational Lexical Semantics

Gemma Boleda1 Stefan Evert2

1Universitat Politecnica de Catalunya

2University of Osnabruck

ESSLLI. Bordeaux, France, July 2009.

1 / 56

Thanks

These slides are based on Jurafsky & Martin (2004: chapter 20)and material by Ann Copestake (course at UPF, 2008)

2 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

4 Dictionary and Thesaurus Methods

5 Discussion

3 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

5 Discussion

4 / 56

Overview

Word Sense Disambiguation

The task of selecting the correct sense for a word in context.

potentially helpful in many applications

machine translation, question answering, information retrieval. . .

we focus on WSD as a stand-alone task

artificial!

5 / 56

WSD algorithm

basic form:

input: word in context, fixed inventory of word sensesoutput: the correct word sense for that use

context?

words surrounding the target word: annotated? just the wordsin no particular order? context size?

inventory?task-dependent

machine translation from English to Spanish: set of Spanishtranslationsspeech synthesis: homographs with differing pronunciations(e.g., bass)

stand-alone task: a lexical resource (usually, WordNet)

6 / 56

An example

WordNet Sense Target Word in Contextbass4 . . . fish as Pacific salmon and striped bass and. . .bass4 . . . produce filets of smoked bass or sturgeon. . .bass7 . . . exciting jazz bass player since Ray Brown. . .bass7 . . . play bass because he doesn’t have to solo. . .

Figure: Possible inventory of sense tags for word bass

7 / 56

Variants of the task

lexical sample task

WSD for a small set of target wordsa number of corpus instances are selected and labeledsimilar to task in our case study

→ supervised approaches; word-specific classifiers

all-words

WSD for all content words in a textsimilar to POS-tagging; but very large “tagset”! → datasparseness

not enough training data for every word

8 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

5 Discussion

9 / 56

Feature extraction

supervised approach → need to identify features that arepredictive of word senses

fundamental (and early) insight: look at the context words

basssmoked bass orjazz bass player

window (e.g., 1-word window)

10 / 56

Method

process the dataset (POS-tagging, lemmatization, parsing)

build feature representation encoding the relevant linguisticinformation

two main feature types:1 collocational features2 bag-of-words features

11 / 56

Collocational features

features that take order or syntactic relations into account

restricted to immediate word context (usually fixed window).For example:

lemma and part of speech of two-word windowsyntactic function of the target word

12 / 56

Collocational features: Example

Example: (20.1) An electric guitar and bass player stand off toone side, not really part of the scene, just as a sort of nod togringo expectations perhaps.

2-word window representation, using parts of speech:

[guitar , NN, and , CC , player , NN, stand , VB][w − 2, P − 2, w − 1, P − 1, w + 1, P + 1, w + 2, P + 2]

13 / 56

Bag-of-words features

lexical features

pre-selected words that are potentially relevant for sensedistinctions. For example:

for all-words task: frequent content words in the corpusfor lexical sample task: content words in the sentences of thetarget word

test for presence/absence of a certain word in the selectedcontext

14 / 56

Bag-of-words features: Example

Example: (20.1) An electric guitar and bass player stand off to oneside, not really part of the scene, just as a sort of nod to gringoexpectations perhaps.

pre-selected words:

[fishing , big , sound , player , fly ]

feature vector:

[0, 0, 0, 1, 0]

15 / 56

More on features

collocational cues account for:“collocational” effects

bass+player=bass7

syntax-related sense differences

serve breakfast to customers vs. serve Philadelphia

bag of word features account for topic and domain relatedfeatures

resemblance to semantic fields, frames, . . .

complementary information → both feature types usuallycombined

16 / 56

Combined representation: Example

simplified representation for 2 sentences:

collocational features corresponding to 1-word window:

. . . jazz bass player . . .

. . . smoked bass or . . .

bag-of-word features only fishing, player

17 / 56

Combined representation

Weka format

@relation bass

@attribute wordL1 {jazz,smoke}@attribute posL1 {CC,VBD}@attribute wordR1 {player,or}@attribute posR1 {CC,NN}@attribute fishing {0,1}@attribute player {0,1}@attribute sense {s4,s7}@datajazz,CC,player,NN,0,1,s7smoke,VBD,or,NN,0,0,s4

. . . jazz bass player . . .

. . . smoked bass or . . .

18 / 56

Method

any supervised algorithm

Decision Trees (for example, J48)Decision Lists (similar to Decision Trees)Naive Bayes (probabilistic). . .

and tool

WekaRSVMToolyour own implementation. . .

19 / 56

Interim Summary

supervised approaches use sense-annotated datasets

need many annotated examples for every word

relevant information in the context:

lexico-syntactic information (collocational features)lexical information (bag of words features)

information is encoded in the form of features . . .

and a classifier is trained to distinguish different senses of agiven word

20 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

5 Discussion

21 / 56

Extrinsic evaluation

long term goal: improve performance in end-to-end application

→ extrinsic evaluation (or task-based, end-to-end, in vivoevaluation)

example: Word Sense Disambiguation for (Cross-Lingual)Information Retrievalhttp://ixa2.si.ehu.es/clirwsd

22 / 56

Intrinsic evaluation

however, extrinsic evaluation difficult and time consuming

→ intrinsic evaluation (or in vitro evaluation)

treat a WSD component as if it were a stand-alone systemmeasure: sense accuracy (percentage of words correctlytagged)

Accuracy =matches

method: held-out data from the same sense-tagged corporaused for training (train-test methodology)

to standardize datasets and methods: SensEval and SemEvalcompetitions

example: our case study

23 / 56

Baseline

baseline: performance we would get without much knowledge/ with a simple approach

necessary for any Machine Learning experiment(how good is 70%?)

simplest baseline: most frequent senseWordNet: first sense heuristic (senses ordered)very powerful baseline! → skewed distribution of senses incorporaBUT we need access to annotated data for every word in thedataset to estimate sense frequencies

this is a “knowledge-laden” baseline

24 / 56

Ceiling

ceiling or upper-bound for performance: inter-coderagreement

all-word corpora using WordNet: Ao ≈ 0.75− 0.8more coarse-grained sense distinctions: Ao ≈ 0.9

another possibility: avoid annotation using pseudowordsbanana-door

however: unrealistic → real polysemy is not likebanana-doors!

need to find better ways to create pseudowords

25 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

5 Discussion

26 / 56

Overview

sense-labeled corpora give accurate information – but scarce!

need other sources: dictionaries, thesaurus, selectionalrestrictions . . .

idea: use dictionaries as corpora (identifying related words indefinitions and examples)

27 / 56

An example

Example: (20.10) The bank can guarantee deposits will eventuallycover future tuition costs because it invests in adjustable-ratemortgage securities.

bank1 Gloss: a financial institution that accepts deposits andchannels the money into lending activities

Examples: “he cashed a check at the bank”; “that bankholds the mortgage on my home”

bank2 Gloss: sloping land (especially beside a body of water)Examples: “they pulled the canoe up on the bank”;

“he sat on the bank of the river”

Figure: WordNet information for two senses of bank

28 / 56

An example

Example: (20.10) The bank can guarantee deposits will eventuallycover future tuition costs because it invests in adjustable-ratemortgage securities.

bank1 Gloss: a financial institution that accepts deposits andchannels the money into lending activities

Examples: “he cashed a check at the bank”; “that bankholds the mortgage on my home”

bank2 Gloss: sloping land (especially beside a body of water)Examples: “they pulled the canoe up on the bank”;

“he sat on the bank of the river”

Figure: WordNet information for two senses of bank

29 / 56

Signatures

signature: set of words that characterizes a given sense of atarget word

extracted from dictionaries, thesauri, tagged corpora, . . .

for example (20.10):

bank1: financial, institution, accept, deposit, channel, money,lending, activity, cash, check, hold, mortgage, homebank2: sloping, land, body, water, pull, canoe, bank, sit, river

30 / 56

Lesk Algorithm

function SIMPLIFIED LESK(word, sentence) returns best sense of wordbest-sense ← most frequent sense for wordmax-overlap ← 0context ← set of words in sentencefor each sense in senses of word do

signature ← set of words in the gloss and examples of senseoverlap ← COMPUTEOVERLAP(signature, context)if overlap > max-overlap then

max-overlap ← overlapbest-sense ← sense

endreturn(best-sense)

31 / 56

Lesk Algorithm

Example: she strolled by the river bank.

best-sense ← bank1; max-overlap ← 0

context ← {she, stroll, river}sense bank1:

signature ← {financial, institution, accept, deposit, channel,money, lending, activity, cash, check, hold, mortgage, home}overlap ← 0; 0 > 0 fails

sense bank2:

signature ← {sloping, land, body, water, pull, canoe, bank, sit,river}overlap ← 1; 1 > 0 succeedsbest-sense ← bank2; max − overlap ← 1

return bank2

32 / 56

Discussion

right intuition: words that appear in dictionary definitions andexamples are relevant to a given sense

problem: data sparseness: dictionary entries short, not alwaysexamples

→ Lesk algorithm currently used as baseline

BUT many extensions possible and have been tried(generalizations over lemmata, corpus data, weighting, . . . )

AND dictionary-derived features can be used (are used) instandard supervised approaches

33 / 56

Interim Summary

information encoded in dictionaries (definitions, examples) isuseful for WSD

can be used exclusively or in addition to other information(collocations, bag of words) for supervised approaches

the Lesk algorithm disambiguates solely on the basis ofdicionary information

overlap between dictionary entry and context of wordoccurrence

the most frequent sense and the Lesk algorithm are used asbaselines for evaluation

34 / 56

Overview

we have a huge number of classes (senses)

need large hand-built resources:

supervised approaches need large annotated corpora(unrealistic)dictionary methods need large dictionaries, which, even ifavailable, often do not provide enough information

alternatives:

Minimally supervised WSDUnsupervised WSDboth make use of unannotated data

these approaches are not as successful as supervisedapproaches

35 / 56

Minimally supervised WSD: Bootstrapping

for a given word, for example plant

start with a small number of annotated examples (seeds) foreach sensecollect additional examples for each sense based on theirsimilarity to annotated examplesiterate

36 / 56

Bootstrapping: example

plant (Yarowsky 1995)sense A: living entity; sense B: buildingfirst examples: those that appear with life (sense A) andmanufacturing (sense B)

Figure: Bootstrapping word senses. Figure 20.4 in Jurafsky & Martin.37 / 56

Yarowsky 1995

Influential insights (used as heuristics in Yarowsky’s algorithm):

→ one sense per collocation

life+plant = plantA

manufacturing+plant = plantB

→ one sense per discourse

if a word appears multiple times in a text, probably alloccurrences will bear the same sensealso useful to enlarge datasets

38 / 56

Unsupervised WSD

no previous knowledge

no human-defined word senses

simply group examples according to the similarity of theexamples

clustering

and infer senses from that

problem: hard to interpret and evaluate

39 / 56

Outline

1 Overview

2 Supervised WSD

3 Evaluation

5 Discussion

40 / 56

Interim summary

WSD can be framed as a standard classification task

training data, feature definition, classifier, evaluation→ supervised approaches

most useful information:

syntactic and lexical context (collocational features)words related to the different senses of a given word (bag ofword features)words in dictionary (thesaurus, etc.) entries

other approaches try to make use of unannotated data

bootstrapping, unsupervised learningwould be great, but not as successful as supervised approaches(and harder to interpret and work with)

41 / 56

Useful empirical facts

skewed distribution of senses

→ most frequent sense baseline→ heuristic when no other information is available

BUT distribution varies with text/corpus! (cone in geometrytextbook)

one sense per collocation

bass+player=bass7

→ simple cues for sense classification (heuristic)

one sense per discourse

different occurences of a word in a given text tend to be usedin the same sense

→ heuristic for classification and for data gathering

42 / 56

Conceptual problems

the task as currently defined does no allow for generalizationover different words → learning is word-specific

number of classes = number of senses; equal to or greaterthan number of words!need training data for every sense of every wordmost words have low frequency (Zipf’s law)no chance with unknown words

this wouldn’t be a problem if word sense alternation were likebank1 − bank2 (homonymy). . .

. . . but many alternations are systematic! (regular polysemy,metonymy, metaphor)

43 / 56

Regular polysemy

conversion

bank (N): financial institutionbank (V): put money in a banksame for sugar, hammer, tango, etc. (also derivation: -ize)

adjectives (Boleda 2007)

qualitative vs. relational: cara familiar (‘familiar face’) vs.reunio familiar (‘family meeting’)event-related vs. qualitative: fet sabut (‘known fact’) vs. homesabut (‘wise man’)

44 / 56

Regular polysemy: mass/count

animal/meat

chicken1: animal; chicken2: meatlamb1: animal; lamb2: meat. . .

portions/kinds: two beers

two servings of beertwo types of beer

generally: thing/derived substance (grinding)

After several lorries had run over the body, there was rabbitsplattered all over the road.

45 / 56

Regular polysemy

verb alternationscausative/inchoative (Levin 1993)

John broke the windowThe window broke

Spanish psychological verbs

Le preocupa la situacion (Dative + Subject)Bruna no quiere preocuparla (subject + Accusative)

46 / 56

Contextual coercion / Logical metonymy

(Also see course by Louise McNally.)

object to eventuality (Pustejovsky 1995)

Mary enjoyed the book.After three martinis, Kim felt much happier.

adjectives (Pustejovsky 1995): event selection

fast runner vs. fast typist vs. fast car

47 / 56

Metonymy

container/content

He drank a bottle of whisky.Morphology again: He drank a bottleful of whisky. (-fulsuffixation)

fruit/plant

olive, grapefruit, . . .Spanish: often tree masculine (olivo, naranjo), fruit feminine(oliva, naranja)

figure/ground

Kim painted the doorKim walked through the door

48 / 56

Metonymy

country names

Location: I live in China.Government: The US and Lybia have agreed to work togetherto solve. . .Team (sports): England won last’s year World Cup.

more generally: institutions

Barcelona applied for the Olympic Games.The banks won’t give credits now.The newspapers criticized this policy.

object/person

The cello is playing badly.Not so regular: contextual metaphor: The ham sandwichwants his check. (Lakoff & Johnson 1980)

49 / 56

Metaphor

physical → mental

depart1: physical transfer; arrive1: physical transfer; go1:physical transferdepart2: mental transfer; arrive2: mental transfer; go2: mentaltransfer

concrete → abstract

aigua clara (‘clear water’) vs. estil clar (‘clear style’)cabells negres (‘black hair’) vs. humor negre (‘black humour’)

50 / 56

To sum up

pervasive systematicity in sense alternations: regularpolysemy, metonymy, metaphor

productiveWe found a little, hairy wampimuk sleeping behind the tree(McDonald & Ramscar 2001)Wampimuk soup is delicious!

inherent property of language

analogical reasoning(psychology again)

WSD as currently handled cannot capture these regularities

theoretical and practical problem!

51 / 56

WSD and regularities: what one can do

generalize on FEATURES

e.g., jazz → MUSIC-STYLE → jazz, rock, blues, . . .provided some lexical resource is available that encodes thisinformationHe is a jazz bass player.→ I love bass solos in rock music.problem: when (how) to generalize? when to stop?

52 / 56

WSD and regularities: what would be desirable

train on chicken and use the data for lamb, wampimuk, . . .Resources such as WordNet encode the meat/animaldistinction:

WordNet info for chicken:chicken1: the flesh of a chicken used for food.chicken2: a domesticated gallinaceous bird (hyponym).chicken3: a person who lacks confidence.chicken4: a foolhardy competition.

WordNet info for lamb:lamb1: young sheep.lamb2: a person easily deceived or cheated.lamb3: a sweet innocent mild-mannered person.lamb4: the flesh of a young domestic sheep eaten as food .

WHAT IS MISSING: link between chicken2 and lamb1, chicken1

and lamb4 (note other senses)53 / 56

Word Sense DisambiguationComputational Lexical Semantics

Gemma Boleda1 Stefan Evert2

1Universitat Politecnica de Catalunya

2University of Osnabruck

ESSLLI. Bordeaux, France, July 2009.

54 / 56

Classifier example 1: Naive Bayes

probabilistic classifier (related to HMMs)

choosing the best sense amounts to choosing the mostprobable sense given the feature vector

conditional probabilityBUT it is impossible to train it directly (too many featurecombinations)

2 strategies:

decomposing the probabilities (Bayes’ rules) → easier toestimatemaking unrealistic assumption: words are independent (→Naive Bayes)

training the classifier = estimating probabilities from thesense-tagged corpus

55 / 56

Classifier example 2: Decision Lists

similar to decision trees (difference: only one condition)

Rule Sensefish within window → bass4

striped bass → bass4

guitar within window → bass7

play/V bass → bass7

Figure: Decision List for word bass

to learn a decision list classifier:

generate and order tests according to the training data

56 / 56

Handout - Computational Lexical Semantics at ESSLLI 09

Documents

Transcript of Handout - Computational Lexical Semantics at ESSLLI 09

L114 Lexical Semantics - Session 1: Background to Lexical ...

Lexical semantics = Hyponim

LEXICAL SEMANTICS - web.stanford.edu

Lexical Semantics 6

Proceedings of the ESSLLI Workshop on Distributional ...events.illc.uva.nl/ESSLLI2008/Materials/BaroniEvertLenci/ESSLLI... · ESSLLI Workshop on Distributional Lexical Semantics Bridging

Problems of syntax-semantics interface ESSLLI 02 Trento.

LEXICAL SEMANTICS & COGNITION

Lexical Semantics: An Introduction · ACL/HCSNet Advanced Programme in NLP Lexical Semantics: An Introduction What is Lexical Semantics? † What is a lexical item/unit?? a word??

Lexical Semantics

Lexical Semantics - EPFLcoling.epfl.ch/slides/cours6-LexicalSemantics.pdf · Lexical Semantics • Lexical Semantics is the study of the meaning of words (i.e. of the simplest linguistic

19 LEXICAL SEMANTICS

Lexical Semantics Ch 4_GrB4

(1) lexical semantics 1

Lexical Semantics Week 4

Lexical Semantics 1

Cognitive Lexical Semantics

Semantics 1: Lexical Semantics

Lexical Semantics - people.cs.pitt.edu

lexical semantics CRUSE.pdf

Lexical Semantics Paradis, Carita