1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd...
-
Upload
kellie-harrington -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd...
1
Evaluating word sketches
Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex
Adam Kilgarriff 2 Geneva, April 2010
Overview Research programme Examples:
Case study• Word sketching
Evaluating word sketches
Adam Kilgarriff 3 Geneva, April 2010
What is language?
Adam Kilgarriff 4 Geneva, April 2010
What is language? In our heads
Adam Kilgarriff 5 Geneva, April 2010
What is language? In our heads In texts and sound signals
Adam Kilgarriff 6 Geneva, April 2010
What is language? In our heads In texts and sound signals Both
Adam Kilgarriff 7 Geneva, April 2010
Methodology
Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)
Adam Kilgarriff 8 Geneva, April 2010
Methodology
Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)Odd method for objective sciencePractical problems: coverage,
arbitrariness
Adam Kilgarriff 9 Geneva, April 2010
Methodology
Study text“empiricist” (Locke, Hume)
Physics: forces, matterChemistry: chemicals, bondsLanguage: text, speech signals
Adam Kilgarriff 10 Geneva, April 2010
It goes against the grain
What is important about a sentence?its meaning
Corpus methodology:Throw away individual sentence
meaningFind patterns
Adam Kilgarriff 11 Geneva, April 2010
Twenty years of rapid ascent
Computer power Corpora
bigger and bigger data sets Language technology tools
lemmatizers, POS-taggers, parsersMachine learning, pattern-finding
Adam Kilgarriff 12 Geneva, April 2010
A virtuous circle
Pattern finding
Linguisticprocessing
Corpus
Lexicon•Part-of-speech tagging
•Parsing
•Lemmatizing
More data →
gets richer eachtime round
Adam Kilgarriff 13 Geneva, April 2010
Case study: corpus lexicography
- four ages
Adam Kilgarriff 14 Geneva, April 2010
Age 1:Pre-computer
Oxford English Dictionary:• 20 million index cards
Adam Kilgarriff 15 Geneva, April 2010
Age 2: KWIC Concordances
From 1980 Computerised
Adam Kilgarriff 16 Geneva, April 2010
Age 2: KWIC Concordance
1 arity, which will be used to take a party of under-privileged children 2 from outside. You are invited to a party and after a couple of drinks 3 tion, we believe politicians of all parties will listen to our views.4 ould be reaching agreement with all parties concerned, as to which 5 lack people. I have certainly been party to one or two discussions 6 . These should be discussed by both parties before entering into the7 presents They had hosted a cocktail party at Kensington palace, for8 akes. By midnight the end-of-course party is in full swing, but most 9 e should be a right for the injured party to terminate the contract. A10 by the Safran Peoples ' Liberation Party. This presents the powerful11 s. Ahead I could see the rest of my party plodding towards the final 12 cial ethic. The two main political parties - the Tories and the 13 ritish successes in Perth The small party of British players competing 14 to help control. One member of the party went to summon the rescue15 rket society fashion magazine. The party was held at his flat which16 security and secrecy than any Tory Party Conference : it seems that
Adam Kilgarriff 17 Geneva, April 2010
Age 2: KWIC Concordances
From 1980 Computerised COBUILD project was innovator the coloured-pens method
Adam Kilgarriff 18 Geneva, April 2010
1 arity, which will be used to take a party of under-privileged children to D2 from outside. You are invited to a party and after a couple of drinks you d3 tion, we believe politicians of all parties will listen to our views. &equo4 ould be reaching agreement with all parties concerned , as to which events,5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio7 presents They had hosted a cocktail party at Kensington palace, for example8 akes. By midnight the end-of-course party is in full swing, but most cadet9 e should be a right for the injured party to terminate the contract. A mana10 by the Safran Peoples ' Liberation Party . This presents the powerful neigh11 s. Ahead I could see the rest of my party plodding towards the final slope t12 cial ethic. The two main political parties - the Tories and the Liberals -13 ritish successes in Perth The small party of British players competing in th14 to help control. One member of the party went to summon the rescue team and15 rket society fashion magazine. The party was held at his flat which was a l16 security and secrecy than any Tory Party Conference : it seems that bootleg
The coloured pens method
Adam Kilgarriff 19 Geneva, April 2010
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: read all • 500 lines: could read all, takes a long
time• 5000 lines: no
Adam Kilgarriff 20 Geneva, April 2010
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
Adam Kilgarriff 21 Geneva, April 2010
Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword
wordword
yourmoney
estimatedjobs
faceannually
thousandsenormous
costslives
dollars$1.2
lifeforests
Adam Kilgarriff 22 Geneva, April 2010
Age 4: The word sketch
A corpus-derived one-page summary of a word’s grammatical and collocational behaviour
Adam Kilgarriff 23 Geneva, April 2010
Age 4: The word sketch
Large corpus Parse to find
subjects, objects, heads, modifiers etc
One list for each grammatical relation
Statistics to sort each list, as before
Adam Kilgarriff 24 Geneva, April 2010
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002
Adam Kilgarriff 25 Geneva, April 2010
Euralex 2002
Adam Kilgarriff 26 Geneva, April 2010
Euralex 2002
Can I have them for my language please
Adam Kilgarriff 27 Geneva, April 2010
The Sketch Engine
Input: any corpus, any language
• Lemmatised, part-of-speech tagged specification of grammatical
relations Word sketches integrated with
corpus query system
Developer: Pavel Rychly, Brno
Adam Kilgarriff 28 Geneva, April 2010
Users: Dictionary publishers
• Oxford UP, Collins, Chambers, Macmillan, Cambridge UP
Universities • Teaching, research• Framenet
Language teaching http://www.sketchengine.co.uk/
• Self-registration for free trial account
Adam Kilgarriff 29 Geneva, April 2010
Lexical Computing Ltd
Since 2003Directors
• Adam Kilgarriff (UK), Pavel Rychly (Cz), • Diana McCarthy (UK, since Oct 2009)
Main activities• Sketch engine service• Corpus development
Research-led
Adam Kilgarriff 30 Geneva, April 2010
(demo)
Adam Kilgarriff 31 Geneva, April 2010
Evaluating word sketches
10 years 1999-2009
Feedback Good but anecdotal
Formal evaluation
Adam Kilgarriff 32 Geneva, April 2010
Goal
Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality
Ask a lexicographer For 42 headwords
• For 20 best collocates per headwords “should we include this collocation in a
published dictionary?”
Adam Kilgarriff 33 Geneva, April 2010
Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
Adam Kilgarriff 34 Geneva, April 2010
Precision and recall We test precision Recall is harder
How do we find all the collocations that the system should have found?
Current work• 200 collocates per headword
• Selected from
• All the corpora we have
• Various parameter settings
• Plus just-in-time evaluation for 'new' collocates
Adam Kilgarriff 35 Geneva, April 2010
Four languages, three families
Dutch ANW, 102m-word lexicographic corpus
English UKWaC, 1.5b web corpus
Japanese JpWaC, 400m web corpus
Slovene FidaPlus, 620m lexicographic corpus
Adam Kilgarriff 36 Geneva, April 2010
User evaluation
Evaluate whole system Will it help with my task
• Eg preparing a collocations dictionary
Contrast: developer evaluation Can I make the system better?
• Evaluate each module separately
• Current work
Adam Kilgarriff 37 Geneva, April 2010
Components
Corpus NLP tools
Segmenter, lemmatiser, POS-tagger
Sketch grammar Statistics
Adam Kilgarriff 38 Geneva, April 2010
Practicalities
Interface Good, Good-but
• Merge to good Maybe, Maybe-specialised, Bad
• Merge to bad
For each language Two/three linguists/lexicographers If they disagree
• Don't use for computing performance
Adam Kilgarriff 39 Geneva, April 2010
Results
Dutch 66% English 71% Japanese 87% Slovene 71%
Adam Kilgarriff 40 Geneva, April 2010
Corpus evaluation Collocation-finding
Typical corpus task Recall Hold all else constant
Statistic, NLP tools, grammarBest results: best corpus
• (for collocation-finding)
Pomikalek: de-duplication
Adam Kilgarriff 41 Geneva, April 2010
Other topics Dante
a new lexical database for English
Corpus building (mostly from the web) Instant corpora with WebBootCaT Bigger and better (English)
• BiWeC and New Model Corpus
Corpus Factory (many languages) Corpus comparison, similarity, evaluation Statistics: collocations, keyword lists
Word frequency lists Word senses and lexicography
SADD: semi-automatic dictionary drafting
Adam Kilgarriff 42 Geneva, April 2010
Thank you
http://www.sketchengine.co.uk
Adam Kilgarriff 43 Geneva, April 2010
Words and word senses automatic thesauruses
words
Adam Kilgarriff 44 Geneva, April 2010
Words and word senses automatic thesauruses
words manual thesauruses
simple hierarchy is appealinghomonyms
Adam Kilgarriff 45 Geneva, April 2010
Words and word senses automatic thesauruses
words manual thesauruses
simple hierarchy is appealinghomonyms“aha! objects must be word senses”
Adam Kilgarriff 46 Geneva, April 2010
Problems
Theoretical Practical
Adam Kilgarriff 47 Geneva, April 2010
Theoretical
Adam Kilgarriff 48 Geneva, April 2010
Adam Kilgarriff 49 Geneva, April 2010
Adam Kilgarriff 50 Geneva, April 2010
Wittgenstein
Don’t ask for the meaning, ask for the use
Adam Kilgarriff 51 Geneva, April 2010
Practical
Adam Kilgarriff 52 Geneva, April 2010
Problems Practical
a thesaurus is a toolif the tool organises words senses
you must do WSD before you can use it
WSD: state of the art, optimal conditions: 80%
Adam Kilgarriff 53 Geneva, April 2010
Problems
“To use this tool, first replace one fifth of your input with junk”
Adam Kilgarriff 54 Geneva, April 2010
Avoid word senses
Adam Kilgarriff 55 Geneva, April 2010
Avoid word senses This word has three meanings/senses
Adam Kilgarriff 56 Geneva, April 2010
Avoid word senses This word has three meanings/senses This word has three kinds of use
well foundedempiricalwe can build on it