Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.
-
Upload
ethan-sims -
Category
Documents
-
view
215 -
download
0
Transcript of Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.
![Page 1: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/1.jpg)
Term Informativeness for Named Entity Detection
Jason D. M. RennieMIT
Tommi JaakkolaMIT
![Page 2: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/2.jpg)
Information Extraction
President Bush signed the Central America Free Trade Agreement into law Tuesday…
Who What When
![Page 3: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/3.jpg)
Named Entity Detection
President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open-door policy that will benefit U.S. exporters
and seed prosperity and democracy in Central America and the Dominican
Republic.
![Page 4: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/4.jpg)
Informal Communication
• Other Sources of Information– E-mail– Web Bulletin Boards– Mailing Lists
• More specialized, up-to-date information
• But, harder to extract
![Page 5: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/5.jpg)
IE for Informal Comm.
SUBJECT: Two New Ipswich Seafood Joints to Open Soon.
ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…
![Page 6: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/6.jpg)
NED for Informal Comm.
Subject: finale harvard square
has anyone been to the recently openedfinale in harvard square?
![Page 7: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/7.jpg)
Restaurant Bulletin Board
• Gathered from a Restaurant BBoard– 6 sets of ~100 posts– 132 threads– Applied Ratnaparki’s POS tagger– Hand-labeled each token In/Out of restaurant
name
![Page 8: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/8.jpg)
Detecting Named Entities
Named Entity
Informative
Bursty
Named Entity
Informative
![Page 9: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/9.jpg)
Document 1 Document 2 Document 3
Quantifying Informativeness
the clandestineBrazil
![Page 10: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/10.jpg)
A Little History…
Z-measure [Brookes,1968]
Inverse Doc. Freq. [Jones,1973]
xI [Bookstein & Swanson, 1974]
Residual IDF [Church & Gale, 1995]
Gain [Papenini, 2001]
![Page 11: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/11.jpg)
Main Idea
• Informative words are:– Rare (IDF)– Modal (Mixture Score)
• Rarity and Modality are independent qualities
• We quantify informativeness using a product of IDF and Mixture Score
![Page 12: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/12.jpg)
Binomial Distribution
![Page 13: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/13.jpg)
Term Frequency Distributions
7
0
4
0
8
0
5
5
6
0
“the”
“Brazil”
![Page 14: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/14.jpg)
Mixture Models
0.1% 5%
10%
0 5
90%
![Page 15: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/15.jpg)
Modality
• Modal words fit a mixture much better than a single binomial
• We separately fit the binomial and mixture models to each term frequency distribution
• We quantify modality by comparing the fitness of the two models
![Page 16: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/16.jpg)
Learning Mixture Parameters
Use Gradient Descent to learn , 1, 2
![Page 17: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/17.jpg)
Comparing Fitness
• Use log-odds to compare fitness of the two models
![Page 18: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/18.jpg)
Top Mixture Score Words
Token Score Rest. Occur.
sichaun 99.62 31/52
fish 50.59 7/73
was 48.79 0/483
speed 44.69 16/19
tacos 43.77 4/19
![Page 19: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/19.jpg)
Independence
Rareness(IDF)
Modality(Mixture Score)
?
![Page 20: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/20.jpg)
Correlation Coefficient
Score Pair Corr. Coefficient
IDF/Mixture -.0139IDF/RIDF .4113
Mixture/RIDF .7380
![Page 21: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/21.jpg)
Top Words Overlap Plot
• Two sorted lists– Sorted by IDF– Sorted by Mixture Score
• Look at % overlap among top N in both lists
• Plot % overlap as we vary N
• Independent scores would produce line along diagonal
![Page 22: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/22.jpg)
Overlap Plot
# Top Words
Per
cent
Ove
rlap
IDF/Mixture
IDF/RIDF
![Page 23: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/23.jpg)
Top IDF*Mixture Words
Token Score Rest. Occur.
sichaun 379.97 31/52
villa 197.08 10/11
tokyo 191.72 7/11
ribs 181.57 0/13
speed 156.23 16/19
![Page 24: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/24.jpg)
Intro to NED Experiments
• Task: Identify Restaurant Names
• Use standard NED features (capitalization, punctuation, POS) as “Baseline”
• Add informativeness score as an additional feature
• Use F1 Breakeven as performance metric
![Page 25: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/25.jpg)
NED Experiments
Feature Set F1 Breakeven
Baseline 55.0%
IDF 56.0%
Mixture 56.0%
IDF,Mixture 56.9%
Residual IDF 57.4%
IDF*RIDF 58.5%
IDF*Mixture 59.3%
Better
![Page 26: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/26.jpg)
Summary
• Traditional syntax-based features are not enough for IE in e-mail & bulletin boards
• We used term occurrence statistics to construct an informativeness score (IDF*Mixture)
• We found IDF*Mixture to be useful for identifying topic-centric words and named entites
![Page 27: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.](https://reader030.fdocuments.in/reader030/viewer/2022032612/56649ef95503460f94c0b07d/html5/thumbnails/27.jpg)
Discussion
• Phrases
• Foreign languages, Speech
• Co-reference resolution, context tracking
• Collaborative filtering