Extraction as Classification. What is “Information Extraction” Filling slots in a database from...

41
Extraction as Classification
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    1

Transcript of Extraction as Classification. What is “Information Extraction” Filling slots in a database from...

Extraction as Classification

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

QA

End User

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka Named Entity Recognition

Landscape of IE Tasks (1/4):Degree of Formatting

Text paragraphswithout formatting

Grammatical sentencesand some formatting & links

Non-grammatical snippets,rich formatting & links Tables

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Landscape of IE Tasks (3/4):Complexity of extraction task

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

E.g. word patterns:

Landscape of IE Tasks (4/4):Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Models for NER

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Token Tagging

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

This is often treated as a structured prediction problem…classifying tokens sequentially

HMMs, CRFs, ….

MUC-7

• Last Message Understanding Conference, (fore-runner to ACE), about 1998– 200 articles in development set (newswire text

about aircraft accidents)– 200 articles in final test (launch events)– Names of: persons, organizations, locations,

dates, times, currency & percentage

TE: Template Elements (attributes) TR: Template Relations (binary relations)ST: Scenario Template (events) CO: coreference NE: named entity recognition

LTG

Identifinder (HMMs)

MENE+Proteus

Manitoba(NB filtered names)

NetOwlCommercial RBS

Bothwick, Sterling, Agichtein, Grishman:

the MENE system

?

Borthwick et al: MENE system

• Simple idea: tag every token:– 4 tags/field for x=person, organization:

• x_start, x_continue, x_end, x_unique• Also “other”

– To extract:• Compute P(y|xi) for each xi• Use Viterbi to find ML consistent sequence:

– continue follows start– end follows continue or start– …

Borthwick et al: MENE system• Simple idea: tag every token:

– 4 tags/field for x=person, organization: • x_start, x_continue, x_end, x_unique• Also “other”

– Learner:• Maxent/logistic regression• Regularize by dropping rare features

– To extract:• Compute P(y|xi) for each xi

• Use Viterbi to find ML consistent sequence:– continue follows start– end follows continue or start– …

Viterbi in MENE

<math/>

Borthwick et al: MENE system

• Features g(h,f) for the loglinear model– function of “history” (features of token) and

“future” (predicted class)– Lexical features combine

• Identity of token in window and • Predicted class

• eg g(h,f)=[token-1(h)=“Mr.” and f=person_unique]

– Section features combine:• section name & class

Borthwick et al: MENE system

• Features g(h,f) for the loglinear model– Dictionary features:

• Match each multi-world dictionary d to text• For token sequences record d_start, d_cont, ..• Combine these values with category

• eg g(h,f)=[places0(h)=“places_uniq” and f=organization_start]

Pittsburgh coach Tomlin …

Places_uniq Places_other Places_other

Celeb_other Celeb_other Celeb_uniq

Dictionaries in MENE

Borthwick et al: MENE system

• Features g(h,f) for the loglinear model– External system features:

• Run someone else’s system s on text• For token sequences record sx_start, sx_cont, ..• Combine these values with category

• eg g(h,f)=[proteus0(h)=“places_uniq” and f=organization_start]

MENE results (dry run)

MENE learning curves

96.393.392.2

• Largest U.S. Cable Operator Makes Bid for Walt Disney• By ANDREW ROSS SORKIN

• The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus.

• If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios.

Short names

Longer names

LTG system

• Another MUC-7 competitor• Handcoded rules for “easy” cases (amounts, etc)• Process of repeated tagging and “matching” for hard

cases– Sure-fire (high precision) rules for names where type is clear

(“Phillip Morris, Inc – The Walt Disney Company”)– Partial matches to sure-fire rule are filtered with a maxent

classifier (candidate filtering) using contextual information, etc– Higher-recall rules, avoiding conflicts with partial-match output

“Phillip Morris announced today…. - “Disney’s ….”– Final partial-match & filter step on titles with different learned

filter.

• Exploits discourse/context information

LTG Results

LTG

Identifinder (HMMs)

MENE+Proteus

Manitoba(NB filtered names)

NetOwlCommercial RBS

Jansch & Abney paper

He was a grammarian, and could doubtless see further into the future than others. -- J.R.R. Tolkien, "Farmer Giles of Ham"

echo golf echo x-ray | tr "ladyfinger orthodontics rx-" [email protected].

Background on JA paper

SCAN: Search & Summarization for Audio Collections (AT&T Labs)

Why IE from personal voicemail

• Unified interface for email, voicemail, fax, … requires uniform headers:– Sender, Time, Subject, …– Headers are key for uniform interface

• Independently, voicemail access is slow:– useful to have fast access to important parts of

message (contact number, caller)

Background on JA – con’t• Quick review of Huang, Zweig & Padmanabhan (IBM Yorktown)

“Information Extraction from Voicemail”:– Goal: find identity and contact number of callers in voicemail

(NER + role classification)– Evaluated three systems on ~= 5000 labeled manually

transcribed messages:• Baseline system:

– 200 hand-coded rules based on “trigger phrases”• State-of-art Ratnaparki-style MaxEnt tagger:

– Lexical unigrams, bigrams, dictionary features for names, numbers, “trigger phrases” + feature selection

– Poor results:• On manually transcribed data, F1 in 80s for both tasks (good!)• On ASR data, F1 about 50% for caller names, 80% for contact

numbers even with a very loose performance metric• Best learning method barely beat the baseline rule-based system.

What’s interesting in this paper

• How and when to we use ML?• Robust information extraction

– Generalizing from manual transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts

• Place of hand-coding vs learning in information extraction– How to break up task– Where and how to use

engineering

Candidate Generator

Learned filter

Candidate phrase

Extracted phrase

Voicemail corpus

• About 10,000 manually transcribed and annotated voice messages.

• 1869 used for evaluation

• Not quite the usual NER task: we only want the caller’s name

Observation: caller phrases are short and near the beginning of the message.

Caller-phrase extraction

• Propose start positions i1,…,iN

• Use a learned decision tree to pick the best i

• Propose end positions i+j1,i+j2,…,i+jM

• Use a learned decision tree to pick the best j

Baseline (HZP, Collins log-linear)

• IE as tagging, similar to Borthwick:

• Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model

• Beam search to find best tag sequence given word sequence (we’ll talk more about this next week)

• Features of model are words, word pairs, word pair+tag trigrams, ….

Hi there it’s Bill and…

Other Other Caller_start Caller_cont other

Performance

Observation: caller names are really short and near the beginning of the message.

What about ASR transcripts?

Extracting phone numbers

• Phase 1: hand-coded grammer proposes candidate phone numbers– Not too hard, due to limited vocabulary– Optimize recall (96%) not precision (30%)

• Phase 2: a learned decision tree filters candidates– Use length, position, a few context features

Results

Their Conclusions