Named-Entity Recognition with Character-Level Models

10
Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh Conference on Natural Language Learning [email protected] [email protected] [email protected] [email protected]

description

Named-Entity Recognition with Character-Level Models. Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh Conference on Natural Language Learning. Unknown Words are a Central Challenge for NER. - PowerPoint PPT Presentation

Transcript of Named-Entity Recognition with Character-Level Models

Page 1: Named-Entity Recognition with Character-Level Models

Named-Entity Recognition with Character-Level Models

Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning

Stanford University

CoNLL-2003: Seventh Conference on Natural Language Learning

[email protected] [email protected] [email protected] [email protected]

Page 2: Named-Entity Recognition with Character-Level Models

2

Unknown Words are a Central Challenge for NER

Recognizing known named-entities (NEs) is relatively simple and accurate

Recognizing novel NEs requires recognizing context and/or word-internal features

External context and frequent internal words (e.g. “Inc.”) are most commonly used features

Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest

Page 3: Named-Entity Recognition with Character-Level Models

3

Are Names Self-Describing?

NO: names can be opaque/ambiguousWord-Level: “Washington” occurs as LOC, PER, and

ORGChar-Level: “–ville” suggests LOC, but exceptions

like “Neville”

YES: names can be highly distinctive/descriptiveWord-Level: “National Bank” is a bank (i.e. ORG)Char-Level: “Cotramoxazole” is clearly a drug

name

Question: Overall, how informative are names alone?

Page 4: Named-Entity Recognition with Character-Level Models

4

How Internally Descriptive are Isolated Named Entities?

Classification accuracy of pre-segmented CoNLL NEs without context is ~90%

Using character n-grams as features instead of words yields 25% error reduction

On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors

89.1

91.8

80

90

100

Words Char N-Grams

All NEs

37.5

60.7

30

40

50

60

70

Words Char N-Grams

Single-word UNKs

NE Classification Accuracy (%)[not CoNLL task]

Page 5: Named-Entity Recognition with Character-Level Models

5

Exploiting Word-Internal Features

Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.)

e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology)

Our approach: use char n-grams as primary representation

Use all substrings as classification features:

Char n-grams subsume word features Features are language-independent (assuming its

alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses

ALL char n-grams vs. just prefix/suffix

#Tom##Tom#, #Tom, Tom#, #To,

Tom, om#, #T, To, om, m#, T, o, m

Page 6: Named-Entity Recognition with Character-Level Models

6

Character-Feature Based Classifier

Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs

POS tags and contextual features complement n-grams

Description Added Features Overall F1 (English Dev.)

Words w0

Official Baseline

-

Char N-Grams n(w0)

POS Tags t0

Simple Context

w-1, w0, t-1, t1

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

52.29

73.10

74.17

82.39

83.09

71.18

Page 7: Named-Entity Recognition with Character-Level Models

7

Character-Based CMM

Model II: Joint classifications along the sequence

Previous classification decisions are clearly relevant: “Grace Road” is a single location, not a

person + location Include neighboring classification

decisions as features Perform joint inference across chain of

classifiers Conditional Markov Model (CMM, aka. maxent

Markov model) Borthwick 1999, McCallum et al 2000

Page 8: Named-Entity Recognition with Character-Level Models

8

Character-Based CMM

Final extra features: Letter-type patterns for each word

United Xx, 12-month d-x, etc. Conjunction features

E.g., previous state and current signature Repeated last words of multi-word names

E.g., Jones after having seen Doug Jones … and a few more

Description Added Features Overall F1 (English Dev)

More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›

Simple Sequence

s-1, ‹s-1, t-1, t0›

More Sequence ‹s-2, s-1›, ‹s-2, s-1, t-1, t0›

Final misc. extra features

83.09

87.21

92.27

85.44

Page 9: Named-Entity Recognition with Character-Level Models

9

Final Results

Drop from English dev to test largely due to inconsistent labeling

Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence

92.27

86.31

67.03

71.90

50

60

70

80

90

100

Eng Dev Eng Test Ger Dev Ger Test

Precision Recall F1

Page 10: Named-Entity Recognition with Character-Level Models

10

Conclusions

Character substrings are valuable and underexploited model features Named entities are internally quite

descriptive 25-30% error reduction vs. word-level models

Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model

What distinguishes our approach? More and better features Regularization is crucial for preventing

overfitting