CS 6998 NLP for the Web Columbia University 04/22/2010
description
Transcript of CS 6998 NLP for the Web Columbia University 04/22/2010
CS 6998 NLP for the WebColumbia University 04/22/2010
Analyzing Wikipedia and Gold-Standard Corpora for
NER Training
William Y. WangComputer Science
Nothman et al. 2009, EACL
Outline
1. Motivation
2. NER and Gold-Standard Corpora
3. The Problem: Cross-corpora Performance
4. Wikipedia for NER
5. Results
6. Conclusion and My Observation
Motivation
1. Manual Annotation is “expensive”. (1) expensive (2) time (3) extra problems
Can we use linguistic resources to create NER corpus automatically?
2. What’s the cross-corpora NER performance?
3. How can we utilize Web resource (e.g. Wikipedia) to improve NER?
NER Gold Corpora
1. MUC-7: Locations(LOC), organizations(ORG), personal names(PER)
2. CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC)3. BBN: 54 tags in Penn Treebank
Corpus Tags Train Tokens
DevTokens
TestTokens
MUC-7 3 83601 18655 60436
CoNLL-03 4 203621 51362 46435
BBN 54 901894 142218 129654
Problem: Poor Cross-corpus Performance
Train With MISCCoNLL BBN
Without MISCMUC CoNLL
BBN
MUC — — — — 73.5 55.5 67.5
CoNLL 81.2 62.3 65.9 82.1 62.4
BBN 54.7 86.7 77.9 53.9 88.4
Corpus and Error Analysis
• N-gram tag variation:Check tags of all n-grams appear multiple times to see if the NE tags are consistent
• Entity type frequency:(1) POS tag with its NE tag
(e.g. nationalities are often with JJ or NNPS)(2) Wordtypes(3) Wordtypes with Functions
(e.g. Bank of New England -> Aaa of Aaa Aaa)
• Tag sequence confusion:Looking into the detail of confusion matrix
Using Wikipedia to Build NER Corpus
1. Classify all articles into entity classes
2. Split Wikipedia articles into sentences
3. Label NEs according to link targets
4. Select sentences for inclusion in a corpus
Improve Wikipedia NER
• Baseline: 58.9% and 62.3% on CoNLL and BBN
1. Inferring extra links using Wikipedia Disambiguation Pages
2. Personal titles: not all preceding titles indicate PER(e.g. Prime Minister of Australia)
3. Previously missed JJ entities (e.g. American / MISC)
4. Miscellaneous changes
ResultsTrain With MISC
CoNLL BBNWithout MISC
MUC CoNLL BBN
MUC — — — — 82.3 54.9 69.3
CoNLL 85.9 61.9 69.9 86.9 60.2
BBN 59.4 86.5 80.2 59.0 88.0
WP0 62.8 69.7 69.7 64.7 70.0
WP1 67.2 73.4 75.3 67.7 73.6
WP2 69.0 74.0 76.6 69.4 75.1
WP3 68.9 73.5 77.2 69.5 73.7
WP4 66.2 72.3 75.6 67.3 73.3
DEV set results (higher but similar to test set results)
Conclusion
• The impact of NER training corpora on its corresponding test set is huge
• Annotation-free Wikipedia NER corpora created
• Wikipedia data performs better in the cross-corpora NER task
• Still much room for improvement
Comments
What I like about this paper:
• The scope of this paper is unique (analogy: cross-cultural studies)
• Utilizing novel linguistic resources to solve basic NLP problems
• Good results• Relatively clear and easy to understand
What I don’t like about this paper:
• The overall method to improve Wikipedia NER training is not a principal approach
Overall Assessment:
8/10
Thank you!