HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...
-
Upload
basis-technology -
Category
Technology
-
view
249 -
download
0
description
Transcript of HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...
Getting Things
Gregor William Stewart
Director of Product Management, Text Analytics
Basis Technology Corporation
2
Introduction
Product Manager for Text Analytics, including:– Rosette Linguistics Platform– Entity Analytics– Name Indexing and Translation– Chat Translator– Highlight
Questing for: – Quality: accuracy, performance– Coverage: languages, domains, genres– Integration: tasks, workflows, UX– Innovation: new aggregates, functions
3
Overview
Source Tasks Technologies Adaptations
Properties
Comparison
Description
Problem(s)
Challenge(s)
Approach(es)
Solution
Signal(s)
Action (Input/Output)
Components
Process
Adaptation Opportunities
“Out of Box”
Suggested Adaptations
Potential Benefits
Costs
1 4 2 +
Focus on entity analytics in four stages of the processing and exploitation of SOCOM-2012-0000011-HT
Reaching “state of the art” in practice means adapting to source, task and user.
4
Source: SOCOM-2012-0000011-HT
An Arabic language source document
Letters/emails from one colleague to others, regarding policy
Written years before it was acquired, processed
Perhaps imperfectly transcribed, or OCRed into our forensics platform
Of uncertain provenance, content, value
Not a current web news article for wide consumption, with metadata
Form DomainVocabulary
“Grammar”
5
Task: Triage
Triage: should we process further and/or urgently?
Too few trained, trusted linguists to review all the documents in time
Enable non-linguist to do linguist’s job
Gisting: MT All vs. MT Names alone Combine Entity Extraction with
Specialized Machine Translation Integrate into Triage workflow Signal: Documents Selected (How
are guidelines interpreted?)
Picture
6
Technology Entity Extraction (1)
7
Technology Entity Extraction (2)
8
Technology: Entity Extraction (3)
Probabilistic Extractor
Supervised Model
Unsupervised Model
Deterministic Extractor
Exact Match (Gazetteer)
Pattern Match (Regex)
Entit
y Re
dact
or
Entity JoiningInputText
Filtering
Overlap Adjudication
TaggedText
Domain
Text
OutputText
User Defined
Lists
User DefinedPatterns
1
2 3
1
2
4
3
5
9
Out of the box:– False +/- because contextual cues are fewer/different.– Weapon in this document missed, because not a default entity type.
Adaptation:– Add custom entity type(s) via deterministic extractor, e.g. weapons list
Benefit:– Highlights important documents that might otherwise be missed.– Fast and unlikely to affect performance of other components
Difficulties:– Requires forethought, maintenance of lists and patterns in many
languages, but much less work than developing a new model
Adaptation: Entity Extraction to Triage
10
Task: Translation
Produce standardized, “user language” versions of the source document
Too few translators; name standardization particularly labor intensive
Speed up translation without compromising quality
MT All reduces translation productivity NER, Coref and Name
Translation/Standardization Signals: Resource Selections,
Corrections, Resolutions
Picture
11
Adaptation: Extraction to Translation (1)
Out of the box:– Same problems as in Gisting case, only now they matter more.
Adaptation:– Train unsupervised model to help with form and domain differences– Tune co-reference algorithm to most important entity types– Develop form/domain specific resource sets, and allow users to select
them.
Benefit:– Fewer errors in highlighting should mean translation actually speeds up
Difficulties:– Often hard to amass a big enough corpus of like material for model
building.– Form/Domain may be ephemeral
12
Adaptation: Extraction to Translation (2)
Unsupervised algorithm clusters words with distributional similarities together
Word cluster ID is one feature used in learning the sequence model
Based on Collins & Singer (1999) Part of REX Field Training Kit Shown: random sample of words
clustered with “Aleppo” in a ~10GB English model
Note they’re almost all LOCs Would an annotated training corpus
ever cover so many remote entities?
Picture
Thanks:~ Itai_Rolnick$ cat en_wc.txt | grep -i " aleppo " | tr ' ' '\n' | shuf | head
Loveland -- City in Colorado
Svetogorsk -- Town in Russia
MASSOUD -- ?Probably also of a village.
Atiak -- Town in Uganda
Waltha -- typo for Waltham? - town in Mass
BASILICA -- type of Church?
Sapukai -- Town in Paraguai
Yeisk -- Town in Russia
Descoberto -- Town in Brasil
SINKHOLE -- ? A pub in Beligium ??
13
Task: Cataloging
Distill content into an index, to facilitate search and further refinement at scale
Impossible to annotate more than a tiny fraction of documents by hand
High quality automated enrichment that makes efficient use of knowledge resources and structure in data
Many approaches, e.g. LSI, topic modeling, document classification
Entity resolution is robust extension of NER; data and knowledge driven.
Signals: mentions/aliases, shallow relationships between entities
Picture
Technology: Entity Resolution (1)
Alberto
Alberto
AlbertoAlberto
Alberto Amos Fernandez…
Alberto M.Fernandez…
Alberto Fernandez…
Alberto Fernandiz…
AlbertFernandez…
Alberto
Alberto
AlbertoAlberto
Alberto Fernandez…
… Chief of Cabinet… Argentina… …Prof of Criminal Law…
Alberto Fernandez…
… born Sept 7, 1984… cycling… Madrid
Alberto Fernandez…
… born in Cuba… US Ambassador
Alburto Fernandez…
Alberto
Alberto Fernandezde la Puebla…
Alberto
Ratio ofPoliticians to Sportsmen?
2:1
Alberto Fernandez… Sportsmen?
YES
Nickname“El Galleta?”
?
15
Technology: Entity Resolution (2)
16
Technology: Entity Resolution (3)
17
Technology: Entity Resolution (4)
18
Technology: Entity Resolution (5)
Resolution EngineCandidate Selection
Entity Index
Entity Mention Link or
Ghost
Ranking
Knowledge Base
Learned
Seeded 2
1
3
4
19
Adaptation: EntRes to Cataloging (1)
Out of the box:– Quality dependent on output of extraction and order of input– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document– Seeding context selection may not be suited to domain
Adaptations:– Custom KB, sized and suited to the domain and languages– Seeding using context most likely to match in your domain– Choose Linking or Learning mode– Choose evidence factoring scheme that meets your operational needs
Benefits:– Linking throughput is high, accuracy is high, ghosts are informative (because fewer confounders)– System can maintain low latency after ingestion of many documents– Linking accuracy can remain high after ingestion of many documents
Difficulties:– Each element requires experimentation and thought– Changes likely to cause discontinuities unless re-indexing
20
Adaptation: Ent Res to Cataloging (2)
In Linking mode:– Link to existing KB or declare unknown,
discarding context– State size is constant, latency stable
In Learning mode:– Link to existing KB or create New, storing
context– State size increases, increasing latency– Semantic drift– Confidence measure gets complicated
Scaling with learning introduces the need to factor evidence.
Evidence factoring schemes need to be customized to use cases.
Picture
21
Task: Retrieval
Find relevant information for further analysis
String-based retrieval methods are easy to understand, but require a lot of effort and distract from the task.
Deliver search modalities that are more productive but still interpretable and correctable
Search using entity-driven facets, as well as keywords
Signals: query log, click through, curation, corrections
22
Adaptation: EntRes to Retrieval
Out of the box:– Entity labels not in user’s language confusing– Returns results that can’t be easily summarized as a Boolean, cf aliases– Complex, potentially misleading measure of confidence
Adaptations:– Use name translation for non user-language labels, e.g. from KB– Present users with cues to expansion in string terms, e.g. mentions– Present confidence measure carefully
Benefits:– User spends less time confused, search is more productive
Difficulties:– Users still want to do things like exclude certain mentions.
23
Summary
News-trained NER OK for Triage, but adding entity types via lists and patterns could improve results considerably.
Speeding up Translation requires a better fit: unsupervised adaptation and custom resource selection could make the difference between time saved or wasted.
Cataloguing by resolved entities enables powerful search, but relies on high quality extraction; Learning-mode requires evidence factoring at scale.
Entity-based search is incredibly productive compared to Boolean and keyword approaches, but users need cues that explain expansion and robust measures of confidence.
24
Remaining Challenges
Current reality: even “simple” adaptation can be difficult:– Too much knowledge, experience required– Too much data required, e.g. 10GB for unsupervised– Mostly “out of band”– Usually Offline
Through the REX Field Training Kit and Entity Resolution API, Basis lowering the barriers to manual adaptation to sources, tasks and users today.
Integration of explicit signals, e.g. corrections and implicit signals, e.g. selections is ongoing.