1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005...
-
Upload
clyde-bishop -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005...
1
Search Engine Statistics Beyond the n-gram:Application to Noun Compound Bracketing
CoNLL-2005Preslav NakovEECS, Computer Science DivisionUniversity of California, BerkeleyMarti HearstSIMSUniversity of California, Berkeley
2
Outline
Introduction Related Work Models and Features
3
Introduction
Noun compound bracketing-> Noun compound interpretation
liver cell antibody [[liver cell] antibody]
liver cell line [liver [cell line]]
POS equivalent, different syntactic trees
4
This Paper A highly accurate unsupervised method for
making bracketing decisions for noun compounds (NCs) Current: using bigram estimates to compute
adjacency and dependency scores Improvement
χ2 measure a new set of surface features for querying Web
search engines Evaluate on 2 domains, encyclopedia &
bioscience
5
Related Work NC syntax and semantics
Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions
Adjacency model Probabilistic dependency model, Laucer
(1995) Data sparseness (use categories instead) 244 NCs from encyclopedia Inter-annotator agreement 81.5% Baseline 66.8% -> 77.5% Adding POS -> state-of-the-art result of 80.7%
6
2003~2005 Keller and Lapata (2003)
Use Web Search Engines for obtaining frequencies for unseen bigrams
(2004) apply to six NLP tasks including disambiguation of NCs
Simpler version (use frequency only) - 78.68%
Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) 83.1%
7
Models and Features Adjacency and dependency model w1w2w3 -> [w1 [w2w3]] (two reasons)
take on right bracketing1. w2w3 is a compound (modified by w1)
home health care Adjacency model checks 1.
2. w1 and w2 independently modify w3 adult male rat
(Better) Dependency model checks 2. Left bracketing -> only 1 choice
[law enforcement] agent
8
Computing Probabilities
Alternative
Calculations
9
χ2 measure
B=#(wi)-(A) C=#(wj)-(A) D=~N-A-B-C N=8T
=google 8B pages X 1000 words/page
(Yang and Pedersen, 1997) χ2 better than MI
10
蛋包飯 蛋 2067593 蛋包 2217 包 10207448 包飯 3398 飯 1672224 χ2 包飯 750.34 > 蛋包 67.32
11
Web-Derived Surface (1/2) Authors sometimes (consciously or not) disambiguate
the words they write by using surface-level markers to suggest the correct meaning.
Dash (hyphen) left bracketing
cell cycle analysis -> cell-cycle right bracketing less reliable
donor T-cell fiber optics-system t-cell-depletion
Possessive marker brain’s stem cells, brain stem’s cells, brain’s stem-cells
Internal capitalization Plasmodium vivax Malaria, brain Stem cells disable this feature on Roman digits and single-letter
words vitamin D deficiency
12
Web-Derived Surface (2/2) Embedded slashes
leukemia/lymphoma cell growth factor (beta) or (growth factor) beta (brain) stem cells
a comma, a dot or a colon “health care, provider” or “lung cancer: patients” (weak
indicator) mouse-brain stem cells (weak indicator)
Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc.
collect them indirectly – post-processing the resulting summaries (up to 1000 results)
Above features are clearly more reliable than others, we do not try to weight them
Features verifying Counts returned by SE, page hits as a proxy for n-gram
frequencies from 1000 summaries
13
Other Web-Derived Features Abbreviations
tumor necrosis factor (NF) tumor necrosis (TN) factor
Concatenation health care reform -> healthcare, carereform
Wildcard (*) “ health care * reform” <-> “health * care reform”
Reorder reform health care <-> care reform health
myosin heavy chain, heavy chain myosin Internal inflection variability
tyrosine kinase activation, tyrosine kinases activation Switching
“adult male rat”, we would also expect “male adult rat”.
14
新發現
15
Paraphrases Warren (1978) proposes
stem cells in the brain cells from the brain stem
Copula paraphrase office building that/which is a skyscraper pain associated with arthritis migraine
search engines lack linguistic annotations small set of hand-chosen paraphrases associated with, caused by, contained in, derived
from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for
16
Evaluations Lauer’s Dataset (1995)
244 unambiguous 3-noun NC-s Biomedical Dataset (Nakov et al.,
2005, SIG BioLink) Open NLP tools
sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003)
500 NCs, 361 left, 69 right, 70 ambiguous
17
Experiments used MSN Search statistics for the
n-grams and the paraphrases (unless the pattern contained a “*”) MSN always returned exact numbers
Google for the surface features Google and Yahoo rounded their page
hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)
18
Tools Mentioned
UMLS Specialist lexicon 得到生物領域字不同的拼法 http://www.nlm.nih.gov/pubs/
factsheets/umlslex.html Carroll’s morphological tools
http://www.cogs.susx.ac.uk/lab/nlp/carroll/morph.html
19
UMLS Lexicon {base=AAAentry=E0000049
cat=nounvariants=metareg variants=uncountacronym_of=abdominal aortic aneurysmectomy|E0429482
acronym_of=acne-associated arthritis|E0429483acronym_of=acquired aplastic anemia|E0429484acronym_of=acute anxiety attack|E0429485
acronym_of=androgenic anabolic agent|E0429486acronym_of=aneurysm of ascending aortaacronym_of=aromatic amino acid|E0356310acronym_of=acute apical abscess|E0356309abbreviation_of=abdominal aortic aneurysm|E0006446}
{base=AAMDspelling_variant=A.A.M.D.entry=E0000050cat=noun variants=groupuncountacronym_of=American Association on Mental Deficiency|E0000277}
20
21
22
23
Conclusions and Future Work Improved upon the state-of-the-art
approaches to NC bracketing Future include
test on > 3 words recognize the ambiguous case Include determiners and modifiers on other NLP problems refine the parser output
Parser typically assume right bracketing