Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class...

Post on 18-Dec-2015

216 views 1 download

Tags:

Transcript of Annotating the WordNet Glosses Ben Haskell. 2004/10/08 Annotating the Glosses Annotating open-class...

Annotating the WordNet Glosses

Ben Haskell <ben@clarity.princeton.edu>

2004/10/08

Annotating the Glosses

• Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

• A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g.

run a company vs. run an errand

2004/10/08

{ run#29 }, v

{ control, command } { change, alter, modify }

{ end, terminate }

{ complete, finish }

{ carry_through, accomplish, exceute, carry_out, action, fulfil, fulfill }

{ make, create }

{ cause, do, make }

{ effect, effectuate, bring_about, set_up }

{ manage, deal care, handle }

{ direct }

{ run#12, operate }, v . . . . . . . . .

. . . run a company . . . . . . run an errand . . .

carry out;“run an errand”

direct or control; project, businesses, etc.“She is running a relief operation in theSudan”

2004/10/08

Glosses as node points in the network of relations

• Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags

• Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

2004/10/08

{ step }{ move }

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIVdancer#1

social_dancer

dancer#2professional_dancer

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

2004/10/08

{ step }{ move }

awkward

{ graceful#1 }, a

{ dance#2 }, v

move in a graceful and rhythmical way;

IS-AENTAIL

DERIV

DERIV

ANT

{ rhythmical#1 }, a

{ way#8 }, n

manner

mode

style

fashion

unrhythmical

ANT

beatingpulsating

pulsing

SIM

cadenced

cadent

SIM SIM

danceable

. . .

deft

elegant

liquidfluent

fluid

SIM

SIM

SIM

gainly. . .

SIM

dancer#1

social_dancer

dancer#2professional_dancer

2004/10/08

Annotating the Glosses

• Automatically tag monosemous words/collocations

• For gold standard quality, sense-tagging of polysemous words must be done manually

• More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD

2004/10/08

System overview

• Preprocessor– Gloss “parser” and tokenizer/lemmatizer– Semantic class recognizer– Noun phrase chunker– Collocation recognizer (globber)

• Automatic sense tagger for monosemous terms

• Manual tagging interface

2004/10/08

Logical structure of a Gloss

• Smallest unit is a word, contracted form, or non-lexical punctuation

• Collocations are decomposed into their constituent parts– Allows coding of discontinuous collocations– A collocation can be treated either as a single

unit or a sequence of forms

2004/10/08

Example glosses

• n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled"

• n. brace, suspender: elastic straps that hold trousers up (usually used in the plural)

• v. kick: drive or propel with the foot

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

2004/10/08

Optional info preceding def: domain category, etc.

def

Optional infofollowing def:usage info, etc.

ex*

2004/10/08

[a] [musical] [composition] [or] [passage] [performed] [quickly] . . .

def

{ allegro#2 }, n

coll=a, sk=musical_composition%1:10:00::coll=b, sk=musical_passage%1:10:00::

coll=a coll=b

sk=perform%2:36:01::

sk=quickly%4:02:00::

2004/10/08

Gloss “parser”

• Regularization & clean-up of the gloss• Recognize & XML tag <def>, <aux>,

<ex>, <qf>, verb arguments, domain <classif>

• <aux> and <classif> contents do not get tagged

• Replace XML-unfriendly characters (&, <, >) with XML entities

2004/10/08

Tokenizer

• Isolate word forms

• Differentiate non-lexical from lexical punctuation– E.g., sentence-ending periods vs. periods in

abbreviations

• Recognize apostrophe vs. quotation marks– E.g., states’ rights vs. `college-bound students’

2004/10/08

Lemmatizer

• A lemma is the WordNet entry form plus WordNet part of speech

• Inflected forms are uninflected using a stemmer developed in-house specifically for this task

• A <wf> may be assigned multiple potential lemmas– saw: lemma=“saw%1|saw%2|see%2”

– feeling: lemma=“feeling%1|feel%2”

2004/10/08

Lemmatizer, cont.

• Exceptions: stopwords/phrases– Closed-class words (prepositions, pronouns,

conjunctions, etc.) – multi-word terms such as “by means of”,

“according to”, “granted that”

• Hyphenated terms not in WordNet get split and separately lemmatized– E.g., over-fed becomes over + fed

2004/10/08

Semantic class recognizer

• Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes

• chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year

• Words and phrases in these classes will not be sense-tagged

2004/10/08

Noun Phrase chunker

• Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage

• Glosses are not otherwise syntactically parsed

• Trained and tagged POS using Thorsten Brants’s TnT statistical tagger

2004/10/08

Noun Phrase chunker, cont.

• Trained and chunked noun phrases using Steven Abney’s partial parser Cass

• Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions– E.g., opening move (JJ NN vs. VBG NN vs. VBG VB

vs. NN VB), bill of fare (NN IN NN vs. VB IN NN)

• Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)

2004/10/08

Collocation recognizer

• Bag of Words approach– To find ‘North_America’, find glosses that have both

‘North’ and ‘America’

• Four passes1. Ghost: ‘bring_home_the_bacon’

• mark ‘bacon’ so it won’t be tagged as monosemous

2. Contiguous: ‘North_America’

3. Disjoint: North (and) [(South) America]

4. Examples: tag the synset’s collocations in its gloss

2004/10/08

Automatic sense-tagger

• Tag monosemous words.

• Words that have…– …only one lemmatized form– …only one WordNet sense– …not been marked as possibly ambiguous

• i.e. non wait-list words, non ‘bacon’ words

2004/10/08

The mantag interface

• Simplicity– Taggers will repeat the same actions hundreds

of times per day

• Automation– Instead of typing the 148,000 search terms, use

a centralized list– Also allows for easy tracking of double-

checking process

2004/10/08

2004/10/08

2004/10/08

Statistics

Total number of glosses 117,549

Total number of words (tokens) 1,221,341

Total taggable words (tokens) 658,958 (57.9%)

auto-tagged 86,914 13.2%

mono sense/pos 3,872 0.6%

poly sense and/or pos 567,944 86.2%

not in WN 228 ~0.0%

2004/10/08

Statistics, cont.

Initial taggable collocations (tokens) 49,726

auto-tagged 41,475 83.4%

mono sense/pos 462 0.9%

poly sense and/or pos 6,888 13.8%

not in WN 0 0.0%

2004/10/08

Statistics, cont.

Total taggable word types 61,811

auto-tagged 19,117 30.9%

mono sense/pos 760 1.2%

poly sense and/or pos 41,650 67.4%

words not in WN 127 0.2%

non-word forms 30 ~0.0%

2004/10/08

Statistics, cont.

Done thus far…

automatic tags 130,770

automatic collocations 49,726

manual tags 42,020

manual collocations 2,961

2004/10/08

Aim of ISI Effort

• Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani

• Gold standard translation of glosses into first-order logic with reified events

2004/10/08

In:

ISI Effort examples

ignore graceful#a#1

move#v#2

way#n#8

rhythmic#a#1

Out:

gloss for dance, v, 2:

dance-V-2'(e0,x)       -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y)          & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)

move ain graceful rhythmicand way

ignore ignore

2004/10/08

In:ISI Effort examples, cont.

compositiona musical or passage performed quickly

ignore

ignoremusical_composition#n#1

musical_passage#n#1

perform#v#2

quickly#r#4

allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2)

musical_composition-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

musical_passage-N-1'(e1,x) ->musical_composition-N-1/musical_passage-N-1'(e1,x)

Out:

gloss for allegro, n, 2:

2004/10/08

ISI Method

• Identify the most common gloss patterns and convert them first

• Parse– using Charniak’s parser:

• uneven, sometimes bizarre results (“aspen”: VBN)

– Hermjakob’s CONTEX parser:• greater local control

2004/10/08

ISI Progress

• Completed glosses of nouns with patterns:– NG (P NG)*: 45% of

nouns– + NG ((VBN | VING) NG): 15% of nouns

• 45 + 15 = 60% complete!

• But gloss patterns are in a Zipf distribution:

2004/10/08

NP (NP,PP)718141%

NP (NP,SBAR)297817%

NP (NP,VP)268415%

NP (NP,PP,SBAR)3632%

NP (NP,CC,NP)2802%

NP (DT,JJ,NN)2722%

Distribution of noun glosses