Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham...
-
Upload
clinton-allen -
Category
Documents
-
view
215 -
download
2
Transcript of Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham...
Towards a semantic extraction of named entities
Diana Maynard, Kalina Bontcheva, Hamish Cunningham
University of Sheffield, UK
Introduction
• Challenges posed by progression from traditional IE to a more semantic representation of NEs
• What techniques are best for the deeper level of analysis necessary?
• Can traditional rule-based methods cope with such a transition, or does the future lie solely with machine learning?
The ACE program
“A program to develop technology to extract and characterise meaning from human language”
Aims:• produce structured information about entities,
events and the relations that hold between them• promote design of more generic systems rather
than those tuned to a very specific domain and text type (as with MUC)
The ACE tasks
• Identification of entities and classification into semantic types (Person, Organisation, Location, GPE, Facility)
• Identification and coreference of all mentions of each entity in the text (name, pronominal, nominal)
• Identification of relations holding between such entities
<entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity>
The MACE System
• Rule-based NE system developed within GATE, adapted from ANNIE
• PRs: tokeniser, sentence splitter, POS tagger, gazetteer, semantic tagger, orthomatcher, pronominal and nominal coreferencer
• Also: genre ID, switching controller to select different PRs automatically
Differences between ANNIE and MACE
• Locations Location / GPE• GPEs have roles (GPE, Per, Org, Loc)• New type Facility (subsumes some Orgs)• Metonymy means context is necessary for
disambiguation (e.g. England cricket team vs England country)
• No Date, Time, Money, Percent, Address, Identifier
What does this mean in practical terms?
• Separation of specific from general information makes adaptation easier
• Reclassification of gazetteers unnecessary
• Changes mainly to semantic grammars to
- use different gazetteer lookups
- use more contextual information
- group rules together differently
Semantic Grammars
• ANNIE uses 21 phases, 187 rules, 9 entity types (av. 20.8 rules per entity type)
• MACE uses 15 phases, 180 rules, 5 entity types (av. 36 rules per entity type)
• The important factor is the increased complexity of new rules, rather than the number
• Rules may be hand-crafted, but an experienced JAPE user can write several rules per minute
• 6 weeks for adaptation
Evaluation (1)
Text Precision Recall Fmeasure
ACE 82.4 82 82.2
MUCENAMEX
only
89 90 89.5
Evaluation (2)
• NEWS – 92 articles (business news)
• ACE – 86 broadcast news from September 2002 evaluation
• Difference on ACE task
• MACE on MUC-style annotations – GPEs are left as GPE (so count as errors)– GPEs are mapped to Locations
Comparison of ANNIE vs MACE
0
10
20
30
40
50
60
70
80
90
100
ANNIE-Ace ANNIE-News MACE-Ace MACE-News
System
Precision
Recall
Fmeasure
72% Precision, 84% Recall if GPEs mapped to Locations
Conclusions
• MACE is a rule-based NE system, in contrast with most systems which use ML.
• Advantages that doesn’t require much training data, and is fast to adapt because of its robust design
• If large amounts of training data are available, HMM-based systems tend to perform slightly better
• Rule-based systems tend to be good at recall but sometimes low on precision unless supported additionally by ML methods