Evaluating word sketches and corpora

11
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

description

Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex. Evaluating word sketches and corpora. Word sketches. Over 10 years Since 1999 Feedback Good but anecdotal Formal evaluation. Goal. Collocations dictionary - PowerPoint PPT Presentation

Transcript of Evaluating word sketches and corpora

Page 1: Evaluating word sketches and corpora

1

Evaluating word sketches and corpora

Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex

Page 2: Evaluating word sketches and corpora

Adam Kilgarriff2 IWSG-1, 2010

Word sketches

Over 10 years Since 1999

Feedback Good but anecdotal

Formal evaluation

Page 3: Evaluating word sketches and corpora

Adam Kilgarriff3 IWSG-1, 2010

Goal

Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality

Ask a lexicographer For 42 headwords

• For 20 best collocates per headwords “should we include this collocation in a

published dictionary?”

Page 4: Evaluating word sketches and corpora

Adam Kilgarriff4 IWSG-1, 2010

Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Page 5: Evaluating word sketches and corpora

Adam Kilgarriff5 IWSG-1, 2010

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

Page 6: Evaluating word sketches and corpora

Adam Kilgarriff6 IWSG-1, 2010

Four languages, three families

Dutch ANW, 102m-word lexicographic corpus

English UKWaC, 1.5b web corpus

Japanese JpWaC, 400m web corpus

Slovene FidaPlus, 620m lexicographic corpus

Page 7: Evaluating word sketches and corpora

Adam Kilgarriff7 IWSG-1, 2010

User evaluation

Evaluate whole system Will it help with my task

• Eg preparing a collocations dictionary Contrast: developer evaluation

Can I make the system better?• Evaluate each module separately• Current work

Page 8: Evaluating word sketches and corpora

Adam Kilgarriff8 IWSG-1, 2010

Components

Grammar NLP tools

Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

Page 9: Evaluating word sketches and corpora

Adam Kilgarriff9 IWSG-1, 2010

Practicalities Interface

Good, Good-but• Merge to good

Maybe, Maybe-specialised, Bad• Merge to bad

For each language Two/three linguists/lexicographers If they disagree

• Don't use for computing performance

Page 10: Evaluating word sketches and corpora

Adam Kilgarriff10 IWSG-1, 2010

Results

Dutch 66% English 71% Japanese 87% Slovene 71%

Page 11: Evaluating word sketches and corpora

Adam Kilgarriff11 IWSG-1, 2010

Corpus evaluation

Collocation-findingTypical corpus task

Recall Hold all else constant

Statistic, NLP tools, grammarBest results: best corpus

• (for collocation-finding)

Pomikalek: de-duplication