1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography...

11
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Transcript of 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography...

Page 1: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

1

Evaluating word sketches and corpora

Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex

Page 2: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff2 IWSG-1, 2010

Word sketches

Over 10 years Since 1999

Feedback Good but anecdotal

Formal evaluation

Page 3: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff3 IWSG-1, 2010

Goal

Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality

Ask a lexicographer For 42 headwords

• For 20 best collocates per headwords “should we include this collocation in a

published dictionary?”

Page 4: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff4 IWSG-1, 2010

Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Page 5: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff5 IWSG-1, 2010

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

Page 6: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff6 IWSG-1, 2010

Four languages, three families

Dutch ANW, 102m-word lexicographic corpus

English UKWaC, 1.5b web corpus

Japanese JpWaC, 400m web corpus

Slovene FidaPlus, 620m lexicographic corpus

Page 7: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff7 IWSG-1, 2010

User evaluation

Evaluate whole system Will it help with my task

• Eg preparing a collocations dictionary Contrast: developer evaluation

Can I make the system better?• Evaluate each module separately• Current work

Page 8: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff8 IWSG-1, 2010

Components

Grammar NLP tools

Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

Page 9: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff9 IWSG-1, 2010

Practicalities Interface

Good, Good-but• Merge to good

Maybe, Maybe-specialised, Bad• Merge to bad

For each language Two/three linguists/lexicographers If they disagree

• Don't use for computing performance

Page 10: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff10 IWSG-1, 2010

Results

Dutch 66% English 71% Japanese 87% Slovene 71%

Page 11: 1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Adam Kilgarriff11 IWSG-1, 2010

Corpus evaluation

Collocation-findingTypical corpus task

Recall Hold all else constant

Statistic, NLP tools, grammarBest results: best corpus

• (for collocation-finding)

Pomikalek: de-duplication