SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

13
SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

Transcript of SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

Page 1: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

SENSEVAL2

Scott Cotton and Martha Palmer

ISLE Meeting

Dec 11, 2000

University of Pennsylvania

Page 2: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

SENSEVAL

• SENSEVAL/SIGLEX98: (Brighton, Sep,98)– Workshop on Word Sense Disambiguation– Hector, corpus-based sense inventory– 34 words, nouns, verbs, adjectives, mixed– Inter-annotator agreement over 90%– English (18 participating systems)– Also Italian (2) and French(5)

Page 3: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

Siglex99: All words Experiment

• WSJ 5K word corpus– running text– WordNet 1.6

• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement

(disagreement in 90/350 verb tokens)

Page 4: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

SENSEVAL2

• Toulouse, France, July 5,6 (ACL’02)– Samples, mid-DEC

– Training data, April

– Testing data, May

• 13 Languages• Lexical sample and all words• Standardized data and formats, central server• Closer tie to applications

Page 5: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

13 Languages

• Swedish - lexical sample– Dimitrios Kokkinakis <[email protected]>

• Chinese - lexical sample– Chu-Ren Huang [email protected]

– Keh-jiann Chen <[email protected]>

• Danish - lexical sample– Bolette Pedersen <[email protected]>

• Estonian - all words (in principle)– Haldur Oim <[email protected]>

Page 6: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

13 Languages, cont.

• Japanese - lexical sample – Sadao Kurohashi [email protected]

• Bangla - lexical sample– Niladri Sekhar Dash [email protected]

• Italian - lexical sample– Nicoletta Calzolari <[email protected]>

• English - lexical sample and All words– Adam Kilgarriff [email protected]

– Martha Palmer [email protected]

Page 7: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

13 Languages, cont.

• Basque - lexical sample– Eneko Agirre <[email protected]>

• Spanish - lexical sample– Mariona Taulé <[email protected]>– German Rigau <[email protected]>

• Korean - – Key-Sun Choi <[email protected]>

• Czech -– Ondrej Cikhart <[email protected]>

• Dutch - – Antal van den Bosch <[email protected]>

Page 8: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

Lexical Sample DTD

<!ELEMENT corpus (lexset+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT lexset (instance+)><!ATTLIST lexset item CDATA #REQUIRED><!ELEMENT instance (answer*,context)><!ELEMENT answer EMPTY><!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED><!ELEMENT context (#PCDATA | itemloc)+><!ELEMENT itemloc (#PCDATA)

Page 9: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

<!DOCTYPE corpus SYSTEM "lexical-sample.dtd">

<corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>

Page 10: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

XML version?<!ELEMENT corpus (descr?,rtext+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT descr (#PCDATA)><!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)><!ELEMENT tloc (#PCDATA)><!ATTLIST tloc id ID #REQUIRED><!ELEMENT answer (lexentry,loc+,sense+)><!ELEMENT lexentry (#PCDATA)><!ELEMENT loc EMPTY><!ATTLIST loc ids IDREFS #REQUIRED><!ELEMENT sense EMPTY><!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>

Page 11: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

<!DOCTYPE corpus SYSTEM "all-words.dtd">

<corpus lang="en">

<rtext> <descr> taken from the man page for intro of section 3

of from a FreeBSD 4.0 system. </descr><text>

Page 12: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

Words in text are tagged:

This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>.

Most ofthese <tloc id="w10">functions</tloc><tloc id="w11">are</tloc><tloc id="w12">available</tloc>from the C <tloc

id="w13">library</tloc>,libc. Other <tloc id="w14">libraries</tloc>,

Page 13: SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.

Then, for each tag:

</text>

<answer>

<lexentry>section</lexentry>

<loc ids="w0"/>

<sense id="1"/>

</answer>

</rtext>

</corpus>