Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class...

37
Annotating the WordNet Glosses Ben Haskell <[email protected]>

Transcript of Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class...

Page 1: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

Annotating the WordNet Glosses

Ben Haskell <[email protected]>

Page 2: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Annotating the Glosses

• Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

• A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g.

run a company vs. run an errand

Page 3: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

{ run#29 }, v

{ control, command } { change, alter, modify }

{ end, terminate }

{ complete, finish }

{ carry_through, accomplish, exceute, carry_out, action, fulfil, fulfill }

{ make, create }

{ cause, do, make }

{ effect, effectuate, bring_about, set_up }

{ manage, deal care, handle }

{ direct }

{ run#12, operate }, v . . . . . . . . .

. . . run a company . . . . . . run an errand . . .

carry out;“run an errand”

direct or control; project, businesses, etc.“She is running a relief operation in theSudan”

Page 4: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Glosses as node points in the network of relations

• Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags

• Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet

Page 5: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

Page 6: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

Page 7: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

Page 8: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

{ rhythmical#1 }, a

{ way#8 }, n

manner

mode

style

fashion

unrhythmical

ANT

beatingpulsating

pulsing

SIM

cadenced

cadent

SIM SIM

danceable

. . .

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

Page 9: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Annotating the Glosses

• Automatically tag monosemous words/collocations

• For gold standard quality, sense-tagging of polysemous words must be done manually

• More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD

Page 10: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

System overview

• Preprocessor– Gloss “parser” and tokenizer/lemmatizer– Semantic class recognizer– Noun phrase chunker– Collocation recognizer (globber)

• Automatic sense tagger for monosemous terms

• Manual tagging interface

Page 11: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Logical structure of a Gloss

• Smallest unit is a word, contracted form, or non-lexical punctuation

• Collocations are decomposed into their constituent parts– Allows coding of discontinuous collocations– A collocation can be treated either as a single

unit or a sequence of forms

Page 12: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Example glosses

• n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled"

• n. brace, suspender: elastic straps that hold trousers up (usually used in the plural)

• v. kick: drive or propel with the foot

Page 13: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

Page 14: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

Page 15: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

[a] [musical] [composition] [or] [passage] [performed] [quickly] . . .

def

{ allegro#2 }, n

coll=a, sk=musical_composition%1:10:00::coll=b, sk=musical_passage%1:10:00::

coll=a coll=b

sk=perform%2:36:01::

sk=quickly%4:02:00::

Page 16: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Gloss “parser”

• Regularization & clean-up of the gloss• Recognize & XML tag <def>, <aux>,

<ex>, <qf>, verb arguments, domain <classif>

• <aux> and <classif> contents do not get tagged

• Replace XML-unfriendly characters (&, <, >) with XML entities

Page 17: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Tokenizer

• Isolate word forms

• Differentiate non-lexical from lexical punctuation– E.g., sentence-ending periods vs. periods in

abbreviations

• Recognize apostrophe vs. quotation marks– E.g., states’ rights vs. `college-bound students’

Page 18: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Lemmatizer

• A lemma is the WordNet entry form plus WordNet part of speech

• Inflected forms are uninflected using a stemmer developed in-house specifically for this task

• A <wf> may be assigned multiple potential lemmas– saw: lemma=“saw%1|saw%2|see%2”

– feeling: lemma=“feeling%1|feel%2”

Page 19: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Lemmatizer, cont.

• Exceptions: stopwords/phrases– Closed-class words (prepositions, pronouns,

conjunctions, etc.) – multi-word terms such as “by means of”,

“according to”, “granted that”

• Hyphenated terms not in WordNet get split and separately lemmatized– E.g., over-fed becomes over + fed

Page 20: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Semantic class recognizer

• Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes

• chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year

• Words and phrases in these classes will not be sense-tagged

Page 21: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Noun Phrase chunker

• Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage

• Glosses are not otherwise syntactically parsed

• Trained and tagged POS using Thorsten Brants’s TnT statistical tagger

Page 22: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Noun Phrase chunker, cont.

• Trained and chunked noun phrases using Steven Abney’s partial parser Cass

• Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions– E.g., opening move (JJ NN vs. VBG NN vs. VBG VB

vs. NN VB), bill of fare (NN IN NN vs. VB IN NN)

• Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)

Page 23: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Collocation recognizer

• Bag of Words approach– To find ‘North_America’, find glosses that have both

‘North’ and ‘America’

• Four passes1. Ghost: ‘bring_home_the_bacon’

• mark ‘bacon’ so it won’t be tagged as monosemous

2. Contiguous: ‘North_America’

3. Disjoint: North (and) [(South) America]

4. Examples: tag the synset’s collocations in its gloss

Page 24: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Automatic sense-tagger

• Tag monosemous words.

• Words that have…– …only one lemmatized form– …only one WordNet sense– …not been marked as possibly ambiguous

• i.e. non wait-list words, non ‘bacon’ words

Page 25: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

The mantag interface

• Simplicity– Taggers will repeat the same actions hundreds

of times per day

• Automation– Instead of typing the 148,000 search terms, use

a centralized list– Also allows for easy tracking of double-

checking process

Page 26: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Page 27: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Page 28: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Statistics

Total number of glosses 117,549

Total number of words (tokens) 1,221,341

Total taggable words (tokens) 658,958 (57.9%)

auto-tagged 86,914 13.2%

mono sense/pos 3,872 0.6%

poly sense and/or pos 567,944 86.2%

not in WN 228 ~0.0%

Page 29: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Statistics, cont.

Initial taggable collocations (tokens) 49,726

auto-tagged 41,475 83.4%

mono sense/pos 462 0.9%

poly sense and/or pos 6,888 13.8%

not in WN 0 0.0%

Page 30: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Statistics, cont.

Total taggable word types 61,811

auto-tagged 19,117 30.9%

mono sense/pos 760 1.2%

poly sense and/or pos 41,650 67.4%

words not in WN 127 0.2%

non-word forms 30 ~0.0%

Page 31: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Statistics, cont.

Done thus far…

automatic tags 130,770

automatic collocations 49,726

manual tags 42,020

manual collocations 2,961

Page 32: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

Aim of ISI Effort

• Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani

• Gold standard translation of glosses into first-order logic with reified events

Page 33: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

In:

ISI Effort examples

ignore graceful#a#1

move#v#2

way#n#8

rhythmic#a#1

Out:

gloss for dance, v, 2:

dance-V-2'(e0,x)       -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y)          & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)

move ain graceful rhythmicand way

ignore ignore

Page 34: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

In:ISI Effort examples, cont.

compositiona musical or passage performed quickly

ignore

ignoremusical_composition#n#1

musical_passage#n#1

perform#v#2

quickly#r#4

allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2)

musical_composition-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

musical_passage-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

Out:

gloss for allegro, n, 2:

Page 35: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

ISI Method

• Identify the most common gloss patterns and convert them first

• Parse– using Charniak’s parser:

• uneven, sometimes bizarre results (“aspen”: VBN)

– Hermjakob’s CONTEX parser:• greater local control

Page 36: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

ISI Progress

• Completed glosses of nouns with patterns:– NG (P NG)*: 45% of

nouns– + NG ((VBN | VING) NG): 15% of nouns

• 45 + 15 = 60% complete!

• But gloss patterns are in a Zipf distribution:

Page 37: Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

2004/10/08

NP (NP,PP)718141%

NP (NP,SBAR)297817%

NP (NP,VP)268415%

NP (NP,PP,SBAR)3632%

NP (NP,CC,NP)2802%

NP (DT,JJ,NN)2722%

Distribution of noun glosses