The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav...
Transcript of The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav...
![Page 1: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/1.jpg)
The Web as a Corpus:Going Beyond the n-gram
Preslav NakovNational University of Singapore
(joint work with Marti Hearst, UC Berkeley)
University of HeidelbergFebruary 3, 2011
![Page 2: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/2.jpg)
2
Plan
Introduction
Surface Features & Paraphrases
Syntactic Tasks
Semantic Tasks
Application to Machine Translation
![Page 3: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/3.jpg)
3
Introduction
![Page 4: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/4.jpg)
4
Dave Bowman: “Open the pod bay doors, HAL”
HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
NLP: The Dream
This is too hard!
So, we tackle subproblems instead.
![Page 5: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/5.jpg)
5
How to Tackle the Problem?
The field was stuck for quite some time. e.g., CYC: manually annotate all semantic
concepts and relations
A new statistical approach started in the 90s Get large text collections. Compute statistics (over the words).
![Page 6: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/6.jpg)
6
Size Matters
Banko & Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL
Spelling correction: Which word should we use? <principal> <principle>
Use context:
I am in my third year as the principal of Anamosa High School.
Power without principle is barren, but principle without power is futile. (Tony Blair)
![Page 7: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/7.jpg)
7 Loglinear improvement even to a billion words! Getting more data is better than finetuning algorithms!
Bigger is better than smarter!
Banko & Brill ’01
Great idea! Can it be extended to other tasks?
For this problem, one can get a lot of training data.
![Page 8: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/8.jpg)
8
Web as a Baseline
“Web as a baseline” (Lapata & Keller 04;05): applied simple ngram models to machine translation candidate selection article generation noun compound interpretation noun compound bracketing adjective ordering spelling correction countability detection prepositional phrase attachment
Their conclusion: => Web ngrams should be used as a baseline.
Significantly better than the best supervised algorithm.
Not significantly different than the best supervised algorithm.
These are all UNSUPERVISED!
![Page 9: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/9.jpg)
9
Contribution
New features paraphrases surface features
The ultimate goalUse the Web as a corpus, and not just as a source of page hit frequencies!
![Page 10: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/10.jpg)
10
Noun Compound Bracketing
![Page 11: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/11.jpg)
11
Noun Compound Bracketing: The Problem
(a) [ liver [cell line] ] (right bracketing)(b) [ [ liver cell ] antibody ] (left bracketing)
In (a), the cell line is derived from the liver. In (b), the antibody targets the liver cell.
liver cell line liver cell antibody
![Page 12: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/12.jpg)
12
Measuring Word Associations
Using n-gram Statistics
Frequencies Dependency: #(w1,w2) vs. #(w1,w3)
Adjacency: #(w1,w2) vs. #(w2,w3)
Probabilities Dependency: Pr(w1→w2|w2) vs. Pr(w1→w3|w3)
Adjacency: Pr(w1→w2|w2) vs. Pr(w2→w3|w3)
Also: Pointwise Mutual Information, Chi Square, etc.
w1 w2 w3
adjacency
dependency
![Page 13: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/13.jpg)
13
Web-derived Surface Features
Observations Authors often disambiguate noun compounds
using surface markers. The enormous size of the Web makes them
frequent enough to be useful.
Idea Look for instances of the target noun compound
where it occurs with suitable surface markers.
Here starts the new work…
![Page 14: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/14.jpg)
14
Web-derived Surface Features:Dash (hyphen)
Left dash cellcycle analysis left
Right dash donor Tcell right
![Page 15: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/15.jpg)
15
Web-derived Surface Features:Possessive Marker
Attached to the first word brain’s stem cell right
Attached to the second word brain stem’s cell left
![Page 16: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/16.jpg)
16
Web-derived Surface Features:Capitalization
don’tcare – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left
lowercase – uppercase – don’tcare brain Stem cell right brain Stem Cell right
![Page 17: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/17.jpg)
17
Web-derived Surface Features:Embedded Slash
Left embedded slash leukemia/lymphoma cell right
![Page 18: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/18.jpg)
18
Web-derived Surface Features:Parentheses
Single word growth factor (beta) left (brain) stem cell right
Two words (growth factor) beta left brain (stem cell) right
![Page 19: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/19.jpg)
19
Web-derived Surface Features:Comma,dot,column,semi-column,…
Following the second word lung cancer: patients left health care, provider left
Following the first word home. health care right adult, male rat right
![Page 20: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/20.jpg)
20
Web-derived Surface Features:Dash to External Word
External word to the left mousebrain stem cell right
External word to the right tumor necrosis factoralpha left
![Page 21: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/21.jpg)
21
Web-derived Surface Features:Problems & Solutions
Problem: search engines ignore punctuation “brainstem cell” does not work
Solution: query for “brain stem cell” obtain 1,000 document summaries scan for the features in these summaries
One can get much more than 1,000 results using the “*” operator and inflections.
![Page 22: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/22.jpg)
22
Other Web-derived Features:Abbreviation
After the second word tumor necrosis (TN) factor left
After the third word tumor necrosis factor (NF) right
Query for e.g., “tumor necrosis tn factor”“tumor necrosis factor nf”
![Page 23: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/23.jpg)
23
Other Web-derived Features:Concatenation
Consider “health care reform” healthcare : 79,500,000 carereform : 269 healthreform: 812
Adjacency model healthcare vs. carereform
Dependency model healthcare vs. healthreform
Triples “healthcare reform” vs. “health carereform”
w1 w2 w3
adjacency
dependency
Tests for lexicalization
![Page 24: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/24.jpg)
24
Other Web-derived Features:Using the star operator “*”
Single star “health care * reform” left “health * care reform” right
More stars and/or reverse order “care reform * * health” right “reform * * * health care” left
![Page 25: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/25.jpg)
25
Other Web-derived Features:Reorder
Reorders for “health care reform” “care reform health” right “reform health care” left
![Page 26: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/26.jpg)
26
Other Web-derived Features:Internal Inflection Variability
First word bone mineral density bones mineral density
Second word bone mineral density bone minerals density
right
left
![Page 27: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/27.jpg)
27
Other Web-derived Features:Switch The First Two Words
Predict right, if we can reorder adult male rat as male adult rat
![Page 28: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/28.jpg)
28
Paraphrases
Prepositional cells in (the) bone marrow left (61,700) cells from (the) bone marrow left (16,500) marrow cells from (the) bone right (12)
Verbal cells extracted from (the) bone marrow left (17) marrow cells found in (the) bone right (1)
Copula cells that are bone marrow left (3)
“bone marrow cell”: left or rightbracketed?
![Page 29: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/29.jpg)
29
Evaluation
Method: Exact phrase queries limited to English Dataset: Lauer’s Dataset
244 noun compounds from Grolier’s encyclopedia
![Page 30: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/30.jpg)
30
Evaluation Results (1)
Cooccurrences
![Page 31: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/31.jpg)
31
Evaluation Results (2)
Paraphrases, surface features, majority vote
![Page 32: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/32.jpg)
32
Comparison to Others
![Page 33: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/33.jpg)
33
Application:Query Segmentation
Segmentation [ used car parts ] [ used car ] [ parts ] [ used ] [ car parts ] [ used ] [ car ] [ parts ]
Bracketing
[ [ used car ] parts ] [ used [ car parts ] ]
S. Bergsma, Q. Wang. Learning Noun Phrase Query Segmentation. EMNLP'07, pp. 819826.
ACL’07: Adding Noun Phrase Structure to the Penn TreebankDavid Vadas and James R. Curran
![Page 34: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/34.jpg)
34
Prepositional Phrase
Attachment
![Page 35: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/35.jpg)
35
PP attachment
(a) Peter spent millions of dollars. (noun)(b) Peter spent time with his family. (verb)
quadruple: (v, n1, p, n2)(a) (spent, millions, of, dollars)(b) (spent, time, with, family)
PP combines with the NPto form another NP
PP is an indirect object of the verb
![Page 36: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/36.jpg)
36
Results
Simpler but not significantlydifferent from 84.3%(Pantel&Lin,00).
![Page 37: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/37.jpg)
37
Noun Phrase Coordination
![Page 38: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/38.jpg)
38
NP Coordination: Ellipsis Ellipsis
car and truck production means car production and truck production
No ellipsis president and chief executive
![Page 39: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/39.jpg)
39
NP Coordination: Ellipsis
Penn Treebank annotations ellipsis:
(NP car/NN and/CC truck/NN production/NN). no ellipsis:
(NP (NP president/NN) and/CC (NP chief/NN executive/NN))
![Page 40: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/40.jpg)
40
Results428 examples from Penn TB
Comparable to other researchers (but no standard dataset).
![Page 41: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/41.jpg)
41
Paraphrasing Noun Compounds
![Page 42: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/42.jpg)
42
Noun Compound Semantics
Traditionally – choose one abstract relation Fixed set of abstract relations (Girju&al.,2005)
malaria mosquito CAUSE olive oil SOURCE
Prepositions (Lauer,1995): malaria mosquito WITH olive oil FROM
Recoverably Deletable Predicates (Levi,1978): malaria mosquito CAUSE olive oil FROM
Our approach: use multiple paraphrasing verbs Paraphrasing verbs
malaria mosquito carries, spreads, causes, transmits, brings, has olive oil comes from, is obtained from, is extracted from
![Page 43: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/43.jpg)
43
NC Semantics: Method
For NC “noun1 noun2”, query for:
"noun2 THAT * noun1“
THAT can be that, which or who; up to 8 “*”s.
POS tag the snippets. Extract verbal paraphrases.
postmodifier
premodifier
![Page 44: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/44.jpg)
44
NC Semantics: Sample Verbal Paraphrases
Verbs+prepositions for migraine treatment
7 prevent3 be given for3 be for2 reduce2 benefit1 relieve
![Page 45: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/45.jpg)
45
Dynamic componential analysis
Classic componential analysis
Example: Treatments
![Page 46: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/46.jpg)
46
Comparing to (Girju&al.,05)
14 out of 21 relations are shown.
![Page 47: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/47.jpg)
47
Amazon’s Mechanical Turk: Malaria Mosquito
Five judges: 5 carries 3 causes 2 transmits 2 infects with 1 has 1 supplies
The program: 23 carry 16 spread 12 cause 9 transmit 7 bring 4 have 3 be infected with 3 be responsible for …
![Page 48: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/48.jpg)
48
MTurk: Comparison to 30 Humans
![Page 49: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/49.jpg)
49
Average cosine correlation
Average cosine correlation (in %) between human andprogramgenerated verbs for the Levi250 dataset.
![Page 50: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/50.jpg)
50
Levi’s RecoverablyDeletable Predicates
![Page 51: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/51.jpg)
51
MTurk (human) vs. Web (program):Aggregated by Levi’s RDP Cosine correlation (in %s) between the human and the program
generated verbs by Levi’s RDP: using all humanproposed verbs vs. using the first verb from each worker only.
![Page 52: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/52.jpg)
52
Average Cosine Correlation
Left: calculated for each noun compound Right: aggregated by relation
![Page 53: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/53.jpg)
53
Predicting Abstract Semantic Relations
![Page 54: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/54.jpg)
54
Levi’s RecoverablyDeletable Predicates
![Page 55: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/55.jpg)
55
Search Engine Queries
Given noun1 and noun2, query for:
“noun2 * noun1"“noun1 * noun2"
Use up to 8 “*”s.
POS tag the snippets. Extract: verbs, prep, verb+prep, coordinations.
![Page 56: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/56.jpg)
56
Most Frequent Features for committee member
![Page 57: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/57.jpg)
57
Predicting Semantic Relations:Levi’s RDPs
v – verb p – preposition c – coordinating conjunction
Vectorspace model kNN Classifier Dice coefficient (freqs)
![Page 58: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/58.jpg)
58
Relations BetweenComplex Nominals
![Page 59: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/59.jpg)
59
SemEval’07: Data
There are seven such relations.
![Page 60: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/60.jpg)
60
SemEval’07: Results
Using up to 10 stars: 67.0kNN classifier with the Dice coefficient
![Page 61: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/61.jpg)
61
SemEval’07: Results
Using up to 10 stars: 68.1
kNN classifier with the Dice coefficient
![Page 62: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/62.jpg)
62
SAT AnalogyQuestions
![Page 63: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/63.jpg)
63
SAT Analogy Questions
![Page 64: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/64.jpg)
64
SAT: Nouns Only
![Page 65: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/65.jpg)
65
Head-Modifier Relationsin
Noun-Noun Compounds
![Page 66: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/66.jpg)
66
30-Relations from (Nastase & Szpakowicz,2003)
![Page 67: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/67.jpg)
67
Noun-Modifier Relations: 30 classes
v – verb p – preposition c – coordinating conjunction
![Page 68: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/68.jpg)
68
Application to Machine
Translation
![Page 69: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/69.jpg)
69
MT: Parallel Text
![Page 70: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/70.jpg)
70
Paraphrasingthe Phrase Table (1)
Phrase Table Entry , spain 's economy ||| , la economía española ||| 1 0.0056263 1
0.00477047 2.718
Paraphrased Entries , economy of spain ||| , la economía española ||| 1 0.0056263 1
0.00477047 2.718 , the economy of spain ||| , la economía española ||| 1 0.0056263
1 0.00477047 2.718 , spain economy ||| , la economía española ||| 1 0.0056263 1
0.00477047 2.718 , economy of a spain ||| , la economía española ||| 1 0.0056263 1
0.00477047 2.718 , economy of an spain ||| , la economía española ||| 1 0.0056263
1 0.00477047 2.718 , economy of the spain ||| , la economía española ||| 1 0.0056263
1 0.00477047 2.718
Webbasedfiltering
![Page 71: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/71.jpg)
71
Paraphrasing the Phrase Table (2)
![Page 72: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/72.jpg)
72
Paraphrasing the Training Corpus
![Page 73: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/73.jpg)
73
Paraphrasing a Sentence
![Page 74: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/74.jpg)
74
Paraphrasing NPs/NCs
purelysyntactic
useWeb stats
![Page 75: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/75.jpg)
75
Results
![Page 76: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/76.jpg)
76
Conclusion
![Page 77: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/77.jpg)
77
Conclusion
Tapped the potential of very large corpora for unsupervised algorithms: Go beyond ngrams
Surface features Paraphrases
Results competitive with the best unsupervised
algorithms can rival supervised algorithms
![Page 78: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/78.jpg)
78
Resume
Surface Features & Paraphrases Syntactic Tasks
Noun Compound Bracketing Prepositional Phrase Attachment Noun Compound Coordination
Semantic Tasks Paraphrasing Noun Compounds Predicting Abstract Semantic Relations Relations Between Complex Nominals SAT Analogy Questions HeadModifier Relations
Application Machine Translation
![Page 79: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/79.jpg)
79
Future Work
New exciting features
Other problems
Use less queries
Use the Web as a corpus, and not justas a source of page hit frequencies!
![Page 80: The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fabe56d78569932051c3d83/html5/thumbnails/80.jpg)
80
Thank You
Questions?