1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
-
Upload
lucy-johnston -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
1
Googleology is bad science
Adam KilgarriffLexical Computing LtdUniversities of Sussex, Leeds
2
Web as language resource
Replaceable or replacable? check
3
Very very large Most languages Most language types Up-to-date Free Instant access
4
How to use the web?
Google or other commercial search engines (CSEs)
not
5
Using CSEs
No setup costsStart querying today
Methods Hit counts ‘snippets’
Metasearch engines, WebCorp Find pages and download
6
Googleology
CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each
of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations
“we issue exact phrase Google queries of type noun2 THAT * noun1”
Nakov and Hearst 2006
Small community of researchers Corpora mailing list
Very interesting work Intense interest in query syntax
Creativity and person-years
7
The Trouble with Google
not enough instances max 1000
not enough queries max 1000 per day with API
not enough context 10-word snippet around search term
ridiculous sort order search term in titles and headings
untrustworthy hit counts limited search syntax
No regular expressions linguistically dumb
lemmatised aime/aimer/aimes/aimons/aimez/aiment …
not POS-tagged not parsed not
8
Appeal Zero-cost entry, just start googling
Reality High-quality work: high-cost methodology
9
Also:
No replicability Methods, stats not published At mercy of commercial corporation
10
Also:
No replicability Methods, stats not published At mercy of commercial corporation Bad science
11
The 5-grams
A present from Google All
1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English
A large dataset
12
Prognosis
Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP
13
Prognosis
Next 3 years Exciting new ideas Dazzlingly clever uses
After 5+ years A chain round our necks
Cf Penn Treebank (others? Brickbats?)
Resource-led vs. ideas-led research
14
How to use the web?
Google or other commercial search engines (CSEs)
not
15
Language and the web
Web is mostly linguistic Text on web << whole web (in GB)
Not many TB of text Special hardware not needed
We are the experts
16
Community-building ACL SIGWAC WAC Kool Ynitiative (WaCKY)
Mailing list Open source
WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007
17
Proof of concept: DeWaC, ItWaC
1.5 B words each, German and Italian Marco Baroni, Bologna (+ AK)
18
What is out there?
What text types? some are new: chatroom proportions
is it overwhelmed by porn? How much? Hard question
19
What is out there The web
a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language
we are well placed a lot of people will be interested
Let’s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure
20
How to do it:Components
1. web crawler2. filters and classifiers
de-duplication
3. linguistic processing• Lemmatise, pos-tag, parse
4. Database• Indexing• user interface
21
1. Crawling
How big is your hard disk? When will your sysadmin ban you?
DeWaC/ItWaC Open source crawler: heritrix
22
1.1 Seeding the crawl
Mid-frequency words Spread of text types
Formal and informal, not just newspaper DeWaC
Words from newspaper corpus Words from list with “kitchen” vocab
Use Google to get seeds for crawls
23
2. Filtering
non ‘running-text’ stripping Function word filtering Porn filtering De-duplication
24
2.1 Filtering: Sentences
What is the text that we want? Lists? Links? Catalogues? …
For linguistics, NLP in sentences
Use function words
25
2.2 Filtering: CLEANEVAL “Text cleaning”
Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter
Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard”
paragraph-marked plain text Prepared by people
Workshop Sept 2007. do join us! http://cleaneval.sigwac.org.uk
26
3. Linguistic processing
Lemmatise, POS-tag, parse Find leading NLP group for each
language Be nice to them Use their tools
27
Database, interface
Solved problem (at least for 1.5 BW) Sketch Engine
28
“Despite all the disadvantages, it’s still so much bigger”
29
How much bigger?
Method Sample words
30 Mid-to-high freq Not common words in other major lgs Min 5 chars
Compare freqs, Google vs ItWaC/DeWaC
30
Google results (Italian) Arbitrariness
Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference
API: typically 1/18th ‘manual’ figure Language filter
mista bomba clima mostly non-Italian pages
use MAX and MIN of 6 lg-filtered results
31
Clima= Computational logic in multi-agent systems Centre for Legumes in Mediterranean
Agriculture (5-char limit too short)
32
Ratios, Google:DeWaC
WORD MAX MIN RAW CLEAN--------------------------------------------------------------besuchte 10.5 3.8 81840 18228stirn 3.38 0.62 32320 11137gerufen 7.14 3.72 66720 27187verringert 6.86 3.46 52160 15987bislang 24.4 11.6 239000 90098brach 4.36 2.26 44520 19824--------------------------------------------------------------
MAX/MIN: max/min of 6 Google values (millions)RAW: DeWaC document frequency before filters, dedupeCLEAN: DeWaC document frequency after filters, dedupe
33
ItWaC:Google ratio, best estimate For each of 30 words
Calculate ratio, max:raw Calculate ratio, min:raw
Take mid-point and average: 1:33 or 3% Calculate raw:vert
Average = 4.4 half (for conservativeness/uncertainty) = 2.2
3% x 2.2 = 6.6%
ItWaC:Google = 6.6%
34
Italian web size
ItWaC = 1.67b words Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian
35
German web size
Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German
36
Effort
ItWac, DeWac Less than 6 person months Developing the method
(EnWaC: in progress)
37
Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be:
English: 2% G-scale (still biggest part) 6 other major languages: 30% G-scale 30 other languages: 10% G-scale
Online for Searching as in SkE Specifying, downloading subcorpora for
intensive NLP “corpora on demand”
Don’t quote me
38
Logjams
Cleaning See CLEANEVAL
Text type “what kind of page is it?” Critical but under-researched WebDoc proposal
(with Serge Sharoff, Tony Hartley) (a different talk)
39
Moral
Google, CSEs are wonderful Start today but
bad science Not
Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project
Google-scale is achievable
40
Thank you
http://www.sketchengine.co.uk
41
Scale and speed, LSE Commercial search engines
banks of computers highly optimised code
but this is for performance no downtime instant responses to millions of queries
This proposal crawling: once a year downtime: acceptable not so many users
42
…but it’s not representative The web is not representative but nor is anything else Text type variation
under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Baayen 2001, Kilgarriff 2001 Text type is an issue across NLP
Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there
43
Oxford English Corpus Method as above Whole domains chosen and
harvested control over text type
1 billion words Public launch April 2006 Loaded into Sketch Engine
44
Oxford English Corpus
45
Oxford English Corpus
46
Examples
DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006
Serge Sharoff, Leeds Univ UK English Chinese Russian English French
Spanish, all searchable online Oxford English corpus
47
Options for academics
Give up Niche markets, obscure languages Leave the mainstream to the big guys
Work out how to work on that scale Web is free, data availability not a
problem