SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.
-
Upload
josephine-green -
Category
Documents
-
view
212 -
download
0
Transcript of SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.
SENSEVAL2
Scott Cotton and Martha Palmer
ISLE Meeting
Dec 11, 2000
University of Pennsylvania
SENSEVAL
• SENSEVAL/SIGLEX98: (Brighton, Sep,98)– Workshop on Word Sense Disambiguation– Hector, corpus-based sense inventory– 34 words, nouns, verbs, adjectives, mixed– Inter-annotator agreement over 90%– English (18 participating systems)– Also Italian (2) and French(5)
Siglex99: All words Experiment
• WSJ 5K word corpus– running text– WordNet 1.6
• 2100 words sense tagged twice (10 days)– 89% inter-annotator agreement – 700 verb tokens – 81% agreement
(disagreement in 90/350 verb tokens)
SENSEVAL2
• Toulouse, France, July 5,6 (ACL’02)– Samples, mid-DEC
– Training data, April
– Testing data, May
• 13 Languages• Lexical sample and all words• Standardized data and formats, central server• Closer tie to applications
13 Languages
• Swedish - lexical sample– Dimitrios Kokkinakis <[email protected]>
• Chinese - lexical sample– Chu-Ren Huang [email protected]
– Keh-jiann Chen <[email protected]>
• Danish - lexical sample– Bolette Pedersen <[email protected]>
• Estonian - all words (in principle)– Haldur Oim <[email protected]>
13 Languages, cont.
• Japanese - lexical sample – Sadao Kurohashi [email protected]
• Bangla - lexical sample– Niladri Sekhar Dash [email protected]
• Italian - lexical sample– Nicoletta Calzolari <[email protected]>
• English - lexical sample and All words– Adam Kilgarriff [email protected]
– Martha Palmer [email protected]
13 Languages, cont.
• Basque - lexical sample– Eneko Agirre <[email protected]>
• Spanish - lexical sample– Mariona Taulé <[email protected]>– German Rigau <[email protected]>
• Korean - – Key-Sun Choi <[email protected]>
• Czech -– Ondrej Cikhart <[email protected]>
• Dutch - – Antal van den Bosch <[email protected]>
Lexical Sample DTD
<!ELEMENT corpus (lexset+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT lexset (instance+)><!ATTLIST lexset item CDATA #REQUIRED><!ELEMENT instance (answer*,context)><!ELEMENT answer EMPTY><!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED><!ELEMENT context (#PCDATA | itemloc)+><!ELEMENT itemloc (#PCDATA)
<!DOCTYPE corpus SYSTEM "lexical-sample.dtd">
<corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>
XML version?<!ELEMENT corpus (descr?,rtext+)><!ATTLIST corpus lang CDATA #REQUIRED><!ELEMENT descr (#PCDATA)><!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)><!ELEMENT tloc (#PCDATA)><!ATTLIST tloc id ID #REQUIRED><!ELEMENT answer (lexentry,loc+,sense+)><!ELEMENT lexentry (#PCDATA)><!ELEMENT loc EMPTY><!ATTLIST loc ids IDREFS #REQUIRED><!ELEMENT sense EMPTY><!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>
<!DOCTYPE corpus SYSTEM "all-words.dtd">
<corpus lang="en">
<rtext> <descr> taken from the man page for intro of section 3
of from a FreeBSD 4.0 system. </descr><text>
Words in text are tagged:
This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>.
Most ofthese <tloc id="w10">functions</tloc><tloc id="w11">are</tloc><tloc id="w12">available</tloc>from the C <tloc
id="w13">library</tloc>,libc. Other <tloc id="w14">libraries</tloc>,
Then, for each tag:
</text>
<answer>
<lexentry>section</lexentry>
<loc ids="w0"/>
<sense id="1"/>
</answer>
</rtext>
</corpus>