1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005...

23
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley

Transcript of 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005...

Page 1: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

1

Search Engine Statistics Beyond the n-gram:Application to Noun Compound Bracketing

CoNLL-2005Preslav NakovEECS, Computer Science DivisionUniversity of California, BerkeleyMarti HearstSIMSUniversity of California, Berkeley

Page 2: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

2

Outline

Introduction Related Work Models and Features

Page 3: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

3

Introduction

Noun compound bracketing-> Noun compound interpretation

liver cell antibody [[liver cell] antibody]

liver cell line [liver [cell line]]

POS equivalent, different syntactic trees

Page 4: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

4

This Paper A highly accurate unsupervised method for

making bracketing decisions for noun compounds (NCs) Current: using bigram estimates to compute

adjacency and dependency scores Improvement

χ2 measure a new set of surface features for querying Web

search engines Evaluate on 2 domains, encyclopedia &

bioscience

Page 5: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

5

Related Work NC syntax and semantics

Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions

Adjacency model Probabilistic dependency model, Laucer

(1995) Data sparseness (use categories instead) 244 NCs from encyclopedia Inter-annotator agreement 81.5% Baseline 66.8% -> 77.5% Adding POS -> state-of-the-art result of 80.7%

Page 6: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

6

2003~2005 Keller and Lapata (2003)

Use Web Search Engines for obtaining frequencies for unseen bigrams

(2004) apply to six NLP tasks including disambiguation of NCs

Simpler version (use frequency only) - 78.68%

Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) 83.1%

Page 7: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

7

Models and Features Adjacency and dependency model w1w2w3 -> [w1 [w2w3]] (two reasons)

take on right bracketing1. w2w3 is a compound (modified by w1)

home health care Adjacency model checks 1.

2. w1 and w2 independently modify w3 adult male rat

(Better) Dependency model checks 2. Left bracketing -> only 1 choice

[law enforcement] agent

Page 8: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

8

Computing Probabilities

Alternative

Calculations

Page 9: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

9

χ2 measure

B=#(wi)-(A) C=#(wj)-(A) D=~N-A-B-C N=8T

=google 8B pages X 1000 words/page

(Yang and Pedersen, 1997) χ2 better than MI

Page 10: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

10

蛋包飯 蛋 2067593 蛋包 2217 包 10207448 包飯 3398 飯 1672224 χ2 包飯 750.34 > 蛋包 67.32

Page 11: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

11

Web-Derived Surface (1/2) Authors sometimes (consciously or not) disambiguate

the words they write by using surface-level markers to suggest the correct meaning.

Dash (hyphen) left bracketing

cell cycle analysis -> cell-cycle right bracketing less reliable

donor T-cell fiber optics-system t-cell-depletion

Possessive marker brain’s stem cells, brain stem’s cells, brain’s stem-cells

Internal capitalization Plasmodium vivax Malaria, brain Stem cells disable this feature on Roman digits and single-letter

words vitamin D deficiency

Page 12: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

12

Web-Derived Surface (2/2) Embedded slashes

leukemia/lymphoma cell growth factor (beta) or (growth factor) beta (brain) stem cells

a comma, a dot or a colon “health care, provider” or “lung cancer: patients” (weak

indicator) mouse-brain stem cells (weak indicator)

Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc.

collect them indirectly – post-processing the resulting summaries (up to 1000 results)

Above features are clearly more reliable than others, we do not try to weight them

Features verifying Counts returned by SE, page hits as a proxy for n-gram

frequencies from 1000 summaries

Page 13: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

13

Other Web-Derived Features Abbreviations

tumor necrosis factor (NF) tumor necrosis (TN) factor

Concatenation health care reform -> healthcare, carereform

Wildcard (*) “ health care * reform” <-> “health * care reform”

Reorder reform health care <-> care reform health

myosin heavy chain, heavy chain myosin Internal inflection variability

tyrosine kinase activation, tyrosine kinases activation Switching

“adult male rat”, we would also expect “male adult rat”.

Page 14: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

14

新發現

Page 15: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

15

Paraphrases Warren (1978) proposes

stem cells in the brain cells from the brain stem

Copula paraphrase office building that/which is a skyscraper pain associated with arthritis migraine

search engines lack linguistic annotations small set of hand-chosen paraphrases associated with, caused by, contained in, derived

from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for

Page 16: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

16

Evaluations Lauer’s Dataset (1995)

244 unambiguous 3-noun NC-s Biomedical Dataset (Nakov et al.,

2005, SIG BioLink) Open NLP tools

sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003)

500 NCs, 361 left, 69 right, 70 ambiguous

Page 17: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

17

Experiments used MSN Search statistics for the

n-grams and the paraphrases (unless the pattern contained a “*”) MSN always returned exact numbers

Google for the surface features Google and Yahoo rounded their page

hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)

Page 18: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

18

Tools Mentioned

UMLS Specialist lexicon 得到生物領域字不同的拼法 http://www.nlm.nih.gov/pubs/

factsheets/umlslex.html Carroll’s morphological tools

http://www.cogs.susx.ac.uk/lab/nlp/carroll/morph.html

Page 19: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

19

UMLS Lexicon {base=AAAentry=E0000049

cat=nounvariants=metareg variants=uncountacronym_of=abdominal aortic aneurysmectomy|E0429482

acronym_of=acne-associated arthritis|E0429483acronym_of=acquired aplastic anemia|E0429484acronym_of=acute anxiety attack|E0429485

acronym_of=androgenic anabolic agent|E0429486acronym_of=aneurysm of ascending aortaacronym_of=aromatic amino acid|E0356310acronym_of=acute apical abscess|E0356309abbreviation_of=abdominal aortic aneurysm|E0006446}

{base=AAMDspelling_variant=A.A.M.D.entry=E0000050cat=noun variants=groupuncountacronym_of=American Association on Mental Deficiency|E0000277}

Page 20: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

20

Page 21: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

21

Page 22: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

22

Page 23: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

23

Conclusions and Future Work Improved upon the state-of-the-art

approaches to NC bracketing Future include

test on > 3 words recognize the ambiguous case Include determiners and modifiers on other NLP problems refine the parser output

Parser typically assume right bracketing