Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham...

Towards a semantic extraction of named entities

Diana Maynard, Kalina Bontcheva, Hamish Cunningham

University of Sheffield, UK

Introduction

• Challenges posed by progression from traditional IE to a more semantic representation of NEs

• What techniques are best for the deeper level of analysis necessary?

• Can traditional rule-based methods cope with such a transition, or does the future lie solely with machine learning?

The ACE program

“A program to develop technology to extract and characterise meaning from human language”

Aims:• produce structured information about entities,

events and the relations that hold between them• promote design of more generic systems rather

than those tuned to a very specific domain and text type (as with MUC)

The ACE tasks

• Identification of entities and classification into semantic types (Person, Organisation, Location, GPE, Facility)

• Identification and coreference of all mentions of each entity in the text (name, pronominal, nominal)

• Identification of relations holding between such entities

The MACE System

• Rule-based NE system developed within GATE, adapted from ANNIE

• PRs: tokeniser, sentence splitter, POS tagger, gazetteer, semantic tagger, orthomatcher, pronominal and nominal coreferencer

• Also: genre ID, switching controller to select different PRs automatically

Differences between ANNIE and MACE

• Locations Location / GPE• GPEs have roles (GPE, Per, Org, Loc)• New type Facility (subsumes some Orgs)• Metonymy means context is necessary for

disambiguation (e.g. England cricket team vs England country)

• No Date, Time, Money, Percent, Address, Identifier

What does this mean in practical terms?

• Separation of specific from general information makes adaptation easier

• Reclassification of gazetteers unnecessary

• Changes mainly to semantic grammars to

- use different gazetteer lookups

- use more contextual information

- group rules together differently

Semantic Grammars

• ANNIE uses 21 phases, 187 rules, 9 entity types (av. 20.8 rules per entity type)

• MACE uses 15 phases, 180 rules, 5 entity types (av. 36 rules per entity type)

• The important factor is the increased complexity of new rules, rather than the number

• Rules may be hand-crafted, but an experienced JAPE user can write several rules per minute

• 6 weeks for adaptation

Evaluation (1)

Text Precision Recall Fmeasure

ACE 82.4 82 82.2

MUCENAMEX

only

89 90 89.5

Evaluation (2)

• NEWS – 92 articles (business news)

• ACE – 86 broadcast news from September 2002 evaluation

• Difference on ACE task

• MACE on MUC-style annotations – GPEs are left as GPE (so count as errors)– GPEs are mapped to Locations

Comparison of ANNIE vs MACE

0

10

20

30

40

50

60

70

80

90

100

ANNIE-Ace ANNIE-News MACE-Ace MACE-News

System

Precision

Recall

Fmeasure

72% Precision, 84% Recall if GPEs mapped to Locations

Conclusions

• MACE is a rule-based NE system, in contrast with most systems which use ML.

• Advantages that doesn’t require much training data, and is fast to adapt because of its robust design

• If large amounts of training data are available, HMM-based systems tend to perform slightly better

• Rule-based systems tend to be good at recall but sometimes low on precision unless supported additionally by ML methods

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham...

Documents

Transcript of Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham...