HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...

Getting Things

Gregor William Stewart

Director of Product Management, Text Analytics

Basis Technology Corporation

2

Introduction

Product Manager for Text Analytics, including:– Rosette Linguistics Platform– Entity Analytics– Name Indexing and Translation– Chat Translator– Highlight

Questing for: – Quality: accuracy, performance– Coverage: languages, domains, genres– Integration: tasks, workflows, UX– Innovation: new aggregates, functions

3

Overview

Source Tasks Technologies Adaptations

Properties

Comparison

Description

Problem(s)

Challenge(s)

Approach(es)

Solution

Signal(s)

Action (Input/Output)

Components

Process

Adaptation Opportunities

“Out of Box”

Suggested Adaptations

Potential Benefits

Costs

1 4 2 +

Focus on entity analytics in four stages of the processing and exploitation of SOCOM-2012-0000011-HT

Reaching “state of the art” in practice means adapting to source, task and user.

4

Source: SOCOM-2012-0000011-HT

An Arabic language source document

Letters/emails from one colleague to others, regarding policy

Written years before it was acquired, processed

Perhaps imperfectly transcribed, or OCRed into our forensics platform

Of uncertain provenance, content, value

Not a current web news article for wide consumption, with metadata

Form DomainVocabulary

“Grammar”

5

Task: Triage

Triage: should we process further and/or urgently?

Too few trained, trusted linguists to review all the documents in time

Enable non-linguist to do linguist’s job

Gisting: MT All vs. MT Names alone Combine Entity Extraction with

Specialized Machine Translation Integrate into Triage workflow Signal: Documents Selected (How

are guidelines interpreted?)

Picture

6

Technology Entity Extraction (1)

7

Technology Entity Extraction (2)

8

Technology: Entity Extraction (3)

Probabilistic Extractor

Supervised Model

Unsupervised Model

Deterministic Extractor

Exact Match (Gazetteer)

Pattern Match (Regex)

Entit

y Re

dact

or

Entity JoiningInputText

Filtering

Overlap Adjudication

TaggedText

Domain

Text

OutputText

User Defined

Lists

User DefinedPatterns

1

2 3

1

2

4

3

5

9

Out of the box:– False +/- because contextual cues are fewer/different.– Weapon in this document missed, because not a default entity type.

Adaptation:– Add custom entity type(s) via deterministic extractor, e.g. weapons list

Benefit:– Highlights important documents that might otherwise be missed.– Fast and unlikely to affect performance of other components

Difficulties:– Requires forethought, maintenance of lists and patterns in many

languages, but much less work than developing a new model

Adaptation: Entity Extraction to Triage

10

Task: Translation

Produce standardized, “user language” versions of the source document

Too few translators; name standardization particularly labor intensive

Speed up translation without compromising quality

MT All reduces translation productivity NER, Coref and Name

Translation/Standardization Signals: Resource Selections,

Corrections, Resolutions

Picture

11

Adaptation: Extraction to Translation (1)

Out of the box:– Same problems as in Gisting case, only now they matter more.

Adaptation:– Train unsupervised model to help with form and domain differences– Tune co-reference algorithm to most important entity types– Develop form/domain specific resource sets, and allow users to select

them.

Benefit:– Fewer errors in highlighting should mean translation actually speeds up

Difficulties:– Often hard to amass a big enough corpus of like material for model

building.– Form/Domain may be ephemeral

12

Adaptation: Extraction to Translation (2)

Unsupervised algorithm clusters words with distributional similarities together

Word cluster ID is one feature used in learning the sequence model

Based on Collins & Singer (1999) Part of REX Field Training Kit Shown: random sample of words

clustered with “Aleppo” in a ~10GB English model

Note they’re almost all LOCs Would an annotated training corpus

ever cover so many remote entities?

Picture

Thanks:~ Itai_Rolnick$ cat en_wc.txt | grep -i " aleppo " | tr ' ' '\n' | shuf | head

Loveland -- City in Colorado

Svetogorsk -- Town in Russia

MASSOUD -- ?Probably also of a village.

Atiak -- Town in Uganda

Waltha -- typo for Waltham? - town in Mass

BASILICA -- type of Church?

Sapukai -- Town in Paraguai

Yeisk -- Town in Russia

Descoberto -- Town in Brasil

SINKHOLE -- ? A pub in Beligium ??

13

Task: Cataloging

Distill content into an index, to facilitate search and further refinement at scale

Impossible to annotate more than a tiny fraction of documents by hand

High quality automated enrichment that makes efficient use of knowledge resources and structure in data

Many approaches, e.g. LSI, topic modeling, document classification

Entity resolution is robust extension of NER; data and knowledge driven.

Signals: mentions/aliases, shallow relationships between entities

Picture

Technology: Entity Resolution (1)

Alberto

Alberto

AlbertoAlberto

Alberto Amos Fernandez…

Alberto M.Fernandez…

Alberto Fernandez…

Alberto Fernandiz…

AlbertFernandez…

Alberto

Alberto

AlbertoAlberto


… Chief of Cabinet… Argentina… …Prof of Criminal Law…


… born Sept 7, 1984… cycling… Madrid


… born in Cuba… US Ambassador

Alburto Fernandez…

Alberto

Alberto Fernandezde la Puebla…

Alberto

Ratio ofPoliticians to Sportsmen?

2:1

Alberto Fernandez… Sportsmen?

YES

Nickname“El Galleta?”

?

15


16


17


18


Resolution EngineCandidate Selection

Entity Index

Entity Mention Link or

Ghost

Ranking

Knowledge Base

Learned

Seeded 2

1

3

4

19

Adaptation: EntRes to Cataloging (1)

Out of the box:– Quality dependent on output of extraction and order of input– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document– Seeding context selection may not be suited to domain

Adaptations:– Custom KB, sized and suited to the domain and languages– Seeding using context most likely to match in your domain– Choose Linking or Learning mode– Choose evidence factoring scheme that meets your operational needs

Benefits:– Linking throughput is high, accuracy is high, ghosts are informative (because fewer confounders)– System can maintain low latency after ingestion of many documents– Linking accuracy can remain high after ingestion of many documents

Difficulties:– Each element requires experimentation and thought– Changes likely to cause discontinuities unless re-indexing

20

Adaptation: Ent Res to Cataloging (2)

In Linking mode:– Link to existing KB or declare unknown,

discarding context– State size is constant, latency stable

In Learning mode:– Link to existing KB or create New, storing

context– State size increases, increasing latency– Semantic drift– Confidence measure gets complicated

Scaling with learning introduces the need to factor evidence.

Evidence factoring schemes need to be customized to use cases.

Picture

21

Task: Retrieval

Find relevant information for further analysis

String-based retrieval methods are easy to understand, but require a lot of effort and distract from the task.

Deliver search modalities that are more productive but still interpretable and correctable

Search using entity-driven facets, as well as keywords

Signals: query log, click through, curation, corrections

22

Adaptation: EntRes to Retrieval

Out of the box:– Entity labels not in user’s language confusing– Returns results that can’t be easily summarized as a Boolean, cf aliases– Complex, potentially misleading measure of confidence

Adaptations:– Use name translation for non user-language labels, e.g. from KB– Present users with cues to expansion in string terms, e.g. mentions– Present confidence measure carefully

Benefits:– User spends less time confused, search is more productive

Difficulties:– Users still want to do things like exclude certain mentions.

23

Summary

News-trained NER OK for Triage, but adding entity types via lists and patterns could improve results considerably.

Speeding up Translation requires a better fit: unsupervised adaptation and custom resource selection could make the difference between time saved or wasted.

Cataloguing by resolved entities enables powerful search, but relies on high quality extraction; Learning-mode requires evidence factoring at scale.

Entity-based search is incredibly productive compared to Boolean and keyword approaches, but users need cues that explain expansion and robust measures of confidence.

24

Remaining Challenges

Current reality: even “simple” adaptation can be difficult:– Too much knowledge, experience required– Too much data required, e.g. 10GB for unsupervised– Mostly “out of band”– Usually Offline

Through the REX Field Training Kit and Entity Resolution API, Basis lowering the barriers to manual adaptation to sources, tasks and users today.

Integration of explicit signals, e.g. corrections and implicit signals, e.g. selections is ongoing.

Q & A

[email protected]

Director of Product Management, Text Analytics

Basis Technology Corporation

HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...

Technology

Transcript of HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...