HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...

25
Getting Things Gregor William Stewart Director of Product Management, Text Analytics Basis Technology Corporation

description

Many of the most robust Human Language Technologies, including statistical part of speech taggers and entity extractors, are developed primarily using high quality newswire datasources. The performance of these technologies on texts in other genres, including short texts like tweets and even sub-genres of news like market summaries, is typically poor. Adapting such technologies to these increasingly important genres is still very difficult and an active area of commercial and academic research. In this presentation, Mr. Stewart will highlight the ways in which newswire trained modules typically fail on the most important emerging text genres, outline the most effective and lowest cost methods to adapt these resources that researchers and practitioners have discovered, and offer guidance on what degree of improvement users can expect to see in the short to medium term.

Transcript of HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor...

Page 1: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Getting Things

Gregor William Stewart

Director of Product Management, Text Analytics

Basis Technology Corporation

Page 2: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

2

Introduction

Product Manager for Text Analytics, including:– Rosette Linguistics Platform– Entity Analytics– Name Indexing and Translation– Chat Translator– Highlight

Questing for: – Quality: accuracy, performance– Coverage: languages, domains, genres– Integration: tasks, workflows, UX– Innovation: new aggregates, functions

Page 3: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

3

Overview

Source Tasks Technologies Adaptations

Properties

Comparison

Description

Problem(s)

Challenge(s)

Approach(es)

Solution

Signal(s)

Action (Input/Output)

Components

Process

Adaptation Opportunities

“Out of Box”

Suggested Adaptations

Potential Benefits

Costs

1 4 2 +

Focus on entity analytics in four stages of the processing and exploitation of SOCOM-2012-0000011-HT

Reaching “state of the art” in practice means adapting to source, task and user.

Page 4: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

4

Source: SOCOM-2012-0000011-HT

An Arabic language source document

Letters/emails from one colleague to others, regarding policy

Written years before it was acquired, processed

Perhaps imperfectly transcribed, or OCRed into our forensics platform

Of uncertain provenance, content, value

Not a current web news article for wide consumption, with metadata

Form DomainVocabulary

“Grammar”

Page 5: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

5

Task: Triage

Triage: should we process further and/or urgently?

Too few trained, trusted linguists to review all the documents in time

Enable non-linguist to do linguist’s job

Gisting: MT All vs. MT Names alone Combine Entity Extraction with

Specialized Machine Translation Integrate into Triage workflow Signal: Documents Selected (How

are guidelines interpreted?)

Picture

Page 6: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

6

Technology Entity Extraction (1)

Page 7: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

7

Technology Entity Extraction (2)

Page 8: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

8

Technology: Entity Extraction (3)

Probabilistic Extractor

Supervised Model

Unsupervised Model

Deterministic Extractor

Exact Match (Gazetteer)

Pattern Match (Regex)

Entit

y Re

dact

or

Entity JoiningInputText

Filtering

Overlap Adjudication

TaggedText

Domain

Text

OutputText

User Defined

Lists

User DefinedPatterns

1

2 3

1

2

4

3

5

Page 9: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

9

Out of the box:– False +/- because contextual cues are fewer/different.– Weapon in this document missed, because not a default entity type.

Adaptation:– Add custom entity type(s) via deterministic extractor, e.g. weapons list

Benefit:– Highlights important documents that might otherwise be missed.– Fast and unlikely to affect performance of other components

Difficulties:– Requires forethought, maintenance of lists and patterns in many

languages, but much less work than developing a new model

Adaptation: Entity Extraction to Triage

Page 10: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

10

Task: Translation

Produce standardized, “user language” versions of the source document

Too few translators; name standardization particularly labor intensive

Speed up translation without compromising quality

MT All reduces translation productivity NER, Coref and Name

Translation/Standardization Signals: Resource Selections,

Corrections, Resolutions

Picture

Page 11: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

11

Adaptation: Extraction to Translation (1)

Out of the box:– Same problems as in Gisting case, only now they matter more.

Adaptation:– Train unsupervised model to help with form and domain differences– Tune co-reference algorithm to most important entity types– Develop form/domain specific resource sets, and allow users to select

them.

Benefit:– Fewer errors in highlighting should mean translation actually speeds up

Difficulties:– Often hard to amass a big enough corpus of like material for model

building.– Form/Domain may be ephemeral

Page 12: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

12

Adaptation: Extraction to Translation (2)

Unsupervised algorithm clusters words with distributional similarities together

Word cluster ID is one feature used in learning the sequence model

Based on Collins & Singer (1999) Part of REX Field Training Kit Shown: random sample of words

clustered with “Aleppo” in a ~10GB English model

Note they’re almost all LOCs Would an annotated training corpus

ever cover so many remote entities?

Picture

Thanks:~ Itai_Rolnick$ cat en_wc.txt | grep -i " aleppo " | tr ' ' '\n' | shuf | head

Loveland -- City in Colorado

Svetogorsk -- Town in Russia

MASSOUD -- ?Probably also of a village.

Atiak -- Town in Uganda

Waltha -- typo for Waltham? - town in Mass

BASILICA -- type of Church?

Sapukai -- Town in Paraguai

Yeisk -- Town in Russia

Descoberto -- Town in Brasil

SINKHOLE -- ? A pub in Beligium ??

Page 13: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

13

Task: Cataloging

Distill content into an index, to facilitate search and further refinement at scale

Impossible to annotate more than a tiny fraction of documents by hand

High quality automated enrichment that makes efficient use of knowledge resources and structure in data

Many approaches, e.g. LSI, topic modeling, document classification

Entity resolution is robust extension of NER; data and knowledge driven.

Signals: mentions/aliases, shallow relationships between entities

Picture

Page 14: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Technology: Entity Resolution (1)

Alberto

Alberto

AlbertoAlberto

Alberto Amos Fernandez…

Alberto M.Fernandez…

Alberto Fernandez…

Alberto Fernandiz…

AlbertFernandez…

Alberto

Alberto

AlbertoAlberto

Alberto Fernandez…

… Chief of Cabinet… Argentina… …Prof of Criminal Law…

Alberto Fernandez…

… born Sept 7, 1984… cycling… Madrid

Alberto Fernandez…

… born in Cuba… US Ambassador

Alburto Fernandez…

Alberto

Alberto Fernandezde la Puebla…

Alberto

Ratio ofPoliticians to Sportsmen?

2:1

Alberto Fernandez… Sportsmen?

YES

Nickname“El Galleta?”

?

Page 15: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

15

Technology: Entity Resolution (2)

Page 16: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

16

Technology: Entity Resolution (3)

Page 17: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

17

Technology: Entity Resolution (4)

Page 18: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

18

Technology: Entity Resolution (5)

Resolution EngineCandidate Selection

Entity Index

Entity Mention Link or

Ghost

Ranking

Knowledge Base

Learned

Seeded 2

1

3

4

Page 19: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

19

Adaptation: EntRes to Cataloging (1)

Out of the box:– Quality dependent on output of extraction and order of input– Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document– Seeding context selection may not be suited to domain

Adaptations:– Custom KB, sized and suited to the domain and languages– Seeding using context most likely to match in your domain– Choose Linking or Learning mode– Choose evidence factoring scheme that meets your operational needs

Benefits:– Linking throughput is high, accuracy is high, ghosts are informative (because fewer confounders)– System can maintain low latency after ingestion of many documents– Linking accuracy can remain high after ingestion of many documents

Difficulties:– Each element requires experimentation and thought– Changes likely to cause discontinuities unless re-indexing

Page 20: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

20

Adaptation: Ent Res to Cataloging (2)

In Linking mode:– Link to existing KB or declare unknown,

discarding context– State size is constant, latency stable

In Learning mode:– Link to existing KB or create New, storing

context– State size increases, increasing latency– Semantic drift– Confidence measure gets complicated

Scaling with learning introduces the need to factor evidence.

Evidence factoring schemes need to be customized to use cases.

Picture

Page 21: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

21

Task: Retrieval

Find relevant information for further analysis

String-based retrieval methods are easy to understand, but require a lot of effort and distract from the task.

Deliver search modalities that are more productive but still interpretable and correctable

Search using entity-driven facets, as well as keywords

Signals: query log, click through, curation, corrections

Page 22: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

22

Adaptation: EntRes to Retrieval

Out of the box:– Entity labels not in user’s language confusing– Returns results that can’t be easily summarized as a Boolean, cf aliases– Complex, potentially misleading measure of confidence

Adaptations:– Use name translation for non user-language labels, e.g. from KB– Present users with cues to expansion in string terms, e.g. mentions– Present confidence measure carefully

Benefits:– User spends less time confused, search is more productive

Difficulties:– Users still want to do things like exclude certain mentions.

Page 23: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

23

Summary

News-trained NER OK for Triage, but adding entity types via lists and patterns could improve results considerably.

Speeding up Translation requires a better fit: unsupervised adaptation and custom resource selection could make the difference between time saved or wasted.

Cataloguing by resolved entities enables powerful search, but relies on high quality extraction; Learning-mode requires evidence factoring at scale.

Entity-based search is incredibly productive compared to Boolean and keyword approaches, but users need cues that explain expansion and robust measures of confidence.

Page 24: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

24

Remaining Challenges

Current reality: even “simple” adaptation can be difficult:– Too much knowledge, experience required– Too much data required, e.g. 10GB for unsupervised– Mostly “out of band”– Usually Offline

Through the REX Field Training Kit and Entity Resolution API, Basis lowering the barriers to manual adaptation to sources, tasks and users today.

Integration of explicit signals, e.g. corrections and implicit signals, e.g. selections is ongoing.

Page 25: HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

Q & A

[email protected]

Director of Product Management, Text Analytics

Basis Technology Corporation