Evaluating the Waspbench
description
Transcript of Evaluating the Waspbench
Evaluating the Waspbench
A Lexicography Tool Incorporating Word Sense
Disambiguation
Rob Koeling, Adam Kilgarriff,
David Tugwell, Roger Evans
ITRI, University of Brighton
Credits: UK EPSRC grant WASPS, M34971
Lexicographers need NLP
NLP needs lexicography
Word senses: nowhere truer
Lexicography– the second hardest part
Word senses: nowhere truer
Lexicography– the second hardest part
NLP– Word sense disambiguation (WSD)
SENSEVAL-1 (1998): 77% Hector SENSEVAL-2 (2001): 64% WordNet
Word senses: nowhere truer
Lexicography– the second hardest part
NLP– Word sense disambiguation (WSD)
SENSEVAL-1 (1998): 77% Hector SENSEVAL-2 (2001): 64% WordNet
– Machine Translation Main cost is lexicography
Synergy
The WASPBENCH
Inputs and outputs Inputs
– Corpus (processed)– Lexicographic expertise
Inputs and outputs Outputs
– Analysis of meaning/translation repertoire – Implemented:
Word expert Can disambiguate
A “disambiguating dictionary”
Inputs and outputs
MT needs rules of form
in context C, S => T– Major determinant of MT quality– Manual production: expensive– Eng oil => Fr huile or petrole?
SYSTRAN: 400 rules
Inputs and outputs
MT needs rules of form
in context C, S => T– Major determinant of MT quality– Manual production: expensive– Eng oil => Fr huile or petrole?
SYSTRAN: 400 rules
Waspbench output: thousands of rules
Evaluation
hard
Evaluation
hard Three communities
Evaluation
hard Three communities No precedents
Evaluation
hard Three communities No precedents The art and craft of lexicography
Evaluation
hard Three communities No precedents The art and craft of lexicography MT personpower budgets
Five threads as WSD: SENSEVAL for lexicography: MED expert reports Quantitative experiments with human
subjects– India
Within-group consistency
– Leeds Comparison with commercial MT
Method Human1
creates word experts Computer
uses word experts to disambiguate test instances MT system
translates same test instances Human2
– evaluates computer and MT performance on each instance:
– good / bad / unsure / preferred / alternative
Words mid-frequency
– 1,500-20,000 instances in BNC At least two clearly distinct meanings
– Checked with ref to translations into Fr/Ger/Dutch
33 words– 16 nouns, 10 verbs, 7 adjs
around 40 test instances per word
WordsNouns Verbs Adjectives
bank party charge toast bright
chest policy float undermine free
coat record move funny
fit seal observe hot
line step offend moody
lot term post strong
mass volume pray
Human subjects Translation studies students, Univ Leeds
– Thanks: Tony Hartley Native/near-native in English and their other
language twelve people, working with:
– Chinese (4) French (3) German (2) Italian (1) Japanese (2) (no MT system for Japanese)
circa four days’ work:– introduction/training– two days to create word experts– two days to evaluate output
Method Human1
creates word experts, average 30 mins/word Computer
uses word experts to disambiguate test instances MT system: Babelfish via Altavista
translates same test instances Human2
– evaluates computer and MT performance on each instance:
– good / bad / unsure / preferred / alternative
Results (%)
Lang Wasps MT both neither unsure
Ger 60 28 19 26 5
Fr 61 45 37 28 4
Ch 68 42 37 23 3
It 67 29 23 22 5
All 64 36 29 25 4
Results by POS (%)Wasps MT both neither
Nouns 69 40 35 24
Verbs 61 38 32 27
Adjs 63 41 31 24
Observations Grad student users, 4-hour training 30 mins per (not-too-complex) word ‘fuzzy’ words intrinsically harder No great inter-subject disparities
– (it’s the words that vary, not the people)
Conclusion WSD can improve MT
(using a tool like WASPS)
Future work multiwords n>2 thesaurus other source languages new corpora, bigger corpora
– the web