Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

40
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Transcript of Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Page 1: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Using Corpora and how to build them

Adam Kilgarriff

Lexical Computing Ltd

Page 2: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

• Corpora show us the facts of the language

Page 3: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 3

What is a corpus?

• a corpus is a collection of texts – when viewed as an object of language

research

Page 4: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Which texts?

• Written

• Spoken

Page 5: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Written

• Books– Fiction– Non-fiction– Textbooks

• Newspapers• Letters, unpublished• Web pages• Academic journals• Student essays• …

Page 6: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Spoken

Must be transcribed, for text corpora

• Conversation– Who? Region, class, age-group, situation…

• Lectures

• TV and Radio

• Film transcripts

• Meetings, seminars

• …

Page 7: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Which texts?

• Different purposes, different text types

• Making dictionaries:– Cover the whole language– Some of everything

Page 8: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

How much?

• Most words are rare

• Zipf’s Law

• To get enough data for most words, we need very big corpora

Page 9: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

Word (pos) r f r x f

the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000

Page 10: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

• the: 6%

• 100 most frequent: 45%

• 7500 most frequent: 90%

• all others: rare

Page 11: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

0102030405060708090

100

'the' 100 mostfrequent

3500most

frequent

7500most

frequent

% of all texts

Page 12: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Leading English Corpora: Size

109

108

107

106

Size of

Corpora

(in words)

1960s 1970s 1980s 1990s 2000s

Brown/LOB COBUILD BNC OEC

Page 13: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Good news

• The web

Page 14: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 14

You can’t help noticing

• Replaceable or replacable?– http://googlefight.com

Page 15: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 15

• Very very large– 2006 estimates for duplicate free, linguistic, Google-

indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words

• Most languages• Most language types• Up-to-date• Free• Instant access

Page 16: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 16

What is a corpus?

• a corpus is a collection of texts – when viewed as an object of language

research

Page 17: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 17

Is the web a corpus?

Yes

Page 18: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 18

but it’s not representative

Page 19: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 19

sublanguage• Language = core + sublanguages• Options for corpus construction

– none– some– all

• None– impoverished view of language

• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …

• All: until recently, not viable

Page 20: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 20

Representativeness• The web is not representative• but nor is anything else• Text type variation

– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;

Biber 1988, Kilgarriff 2001

• Text type is an issue across NLP– Web: issue is acute because, as against BNC or

WSJ, we simply don’t know what is there

Page 21: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 21

What is out there?

• What text types are there on the web?– some are new: chatroom

– proportions

• is it overwhelmed by porn? How much?

• Hard question

Page 22: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 22

Comparing frequency lists

• Web1T– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one

trillion (1012) words of English• that’s 1,000,000,000,000

• Compare with BNC– 100 words with highest Web1T:BNC ratio– 100 words with lowest ratio

Page 23: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 23

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old

• 4 legal– trademarks pursuant accordance herein

Page 24: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 24

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 25: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 25

Observations

• Pronouns and past tense verbs– Fiction

• Masc vs fem

• Yesterday– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself

Page 26: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 26

• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language

• we are well placed– a lot of people will be interested

• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure

Page 27: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 27

Using Search Engines

No setup costsStart querying today

Methods• Hit counts• ‘snippets’

– Metasearch engines, WebCorp

• Find pages and download

Page 28: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 28

Googleology

• Google hit counts for language modelling

– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to

each of Google and Altavista

• Very interesting work

• Great interest in query syntax

Page 29: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 29

The Trouble with Google• not enough instances

– max 1000• not enough queries

– max 1000 per day with API• not enough context

– 10-word snippet around search term• sort order

– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised

• aime/aimer/aimes/aimons/aimez/aiment …

Page 30: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 30

• Appeal– Zero-cost entry, just start googling

• Reality– High-quality work: high-cost methodology

Page 31: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 31

Also:

• No replicability

• Methods, stats not published

• At mercy of commercial corporation

Page 32: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 32

Also:

• No replicability

• Methods, stats not published

• At mercy of commercial corporation

• Googleology is bad science

• So…

Page 33: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 33

Basic steps

• Gather pages– Google hits– Select and gather whole sites– General crawl

• Filter

• De-duplicate

• Linguistic processing

• Load into corpus tool

Page 34: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 34

Oxford English Corpus

• Whole domains chosen and harvested– control over text type

• 2 billion words (Mar 08)

Page 35: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 35

Oxford English Corpus

Page 36: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 36

DeWaC, ItWaC, UKWaC

• 1.5 B words each

• Marco Baroni, Adriano Ferraresi

• Seeds: – mid-frequency words from ‘core vocab’ lists

and corpora

• Google on seed words, then crawl

Page 37: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 37

Filtering

• Non-text (sound, image etc) files• Boilerplate (within file)

– Copyright notices, navigation bars– “high markup” heuristic

• Not “text in sentences”– Look for function words– Lists?? Sports results?? Crossword puzzles??

• Spam, pornography– Tough

• De-duplication (also tough)

Page 38: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 38

Small, specialised corpora

• Terminologists

• Translators needing target-language domain-specific vocab

• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date

Page 39: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora 39

BootCat (Bootstrapping Corpora and Terms)

• Put in seed terms• Google/Yahoo search• Retrieve Google/Yahoo hits

– Remove duplicates, boilerplate

• Small instant corpora• Baroni and Bernardini, LREC 2004• Web version

– WebBootCaT– At Sketch Engine site

Page 40: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.

Madrid 2010 Kilgarriff: Web Corpora

Task

• Choose area of specialist interest– English or Spanish

• Select at least 5 seed terms– Specialist: good

• Build corpus– At least 100,000 words– Iterate if necessary

• Find at least six words/phrases/meanings you did not know before

• Write up