Information Extraction with Markov Random Fields Andrew McCallum University of Massachusetts Amherst...

Information Extractionwith Markov Random Fields

Andrew McCallumUniversity of Massachusetts Amherst

Joint work with

Aron Culotta, Wei Li and Ben Wellner

Bruce Croft, James Allan, David Pinto, Xing Wei, Andres Corrada-Emmanuel

David Jensen, Jen Neville,

John Lafferty (CMU), Fernando Pereira (UPenn)

IE from Cargo Container Ship Manifests

Cargo Tracking Div.US Navy

http://images.google.com/imgres?imgurl=www.setv.org/jpgs/usnavy_seal.gif&imgrefurl=http://www.setv.org/mil_stmnts.html&h=140&w=142&prev=/images%3Fq%3Dus%2Bnavy%2Bseal%26start%3D40%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN

IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

IE from Research Papers[McCallum et al ‘99]

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mining Prediction Outlier detection Decision support

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION


Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT





NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE


Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT





Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation


Information Extraction = segmentation + classification + association + clustering


October 14, 2002, 4:00 a.m. PT





Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation


Information Extraction = segmentation + classification + association + clustering


October 14, 2002, 4:00 a.m. PT





Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Issues that arise

Application issues• Directed spidering• Page filtering• Sequence segmentation• Segment classification• Record association• Co-reference

Scientific issues• Learning more than 100k

parameters from limited and noisy training data

• Taking advantage of rich, interdependent features.

• Deciding which input feature representation to use.

• Efficient inference in models with many interrelated predictions.

• Clustering massive data sets

Outline

• The need. The task.

• Review of Random Field Models for IE.

• Recent work

– New results for Table Extraction

– New training procedure using Feature Induction

– New extended model for Co-reference Resolution

• Conclusions

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models

Yesterday Arny Rosenberg spoke this example sentence.

Yesterday Arny Rosenberg spoke this example sentence.

Person name: Arny Rosenberg

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

Regrets from Atomic View of Tokens

Would like richer representation of text: multiple overlapping features, whole chunks of text.

line, sentence, or paragraph features:– length– is centered in page– percent of non-alphabetics– white-space aligns with next line– containing sentence has two verbs– grammatically contains a question– contains links to “authoritative” pages– emissions that are uncountable– features at multiple levels of granularity

Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and Associates”

Problems with Richer Representationand a Generative Model

• These arbitrary features are not independent:– Overlapping and long-distance dependences

– Multiple levels of granularity (words, characters)

– Multiple modalities (words, formatting, layout)

– Observations from past and future

• HMMs are generative models of the text:

• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own

Bayes Net. But we are already starved for training data!

– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

),( osP

Conditional Sequence Models

• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating

them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

nn oooossss ,...,,..., 2121

HMM

MEMM

CRF

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

...

...

||

11 )|()|(),(

o

ttttt soPssPosP

||

11 ),|()|(

o

tttt ossPosP

||

1

1

),(

),(

exp1

)|(o

t

kttkk

jttjj

o osg

ssf

ZosP

(A special case of MEMMs and CRFs.)

Conditional Finite State Sequence Models

From HMMs to MEMMs to CRFs[Lafferty, McCallum, Pereira 2001]

[McCallum, Freitag & Pereira, 2000]

Conditional Random Fields (CRFs)

St St+1 St+2

O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

St+3 St+4

Markov on s, conditional dependency on o.

||

11 ),,,(exp

1)|(

o

t kttkk

o

tossfZ

osP

Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph.

Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.

Set parameters by maximum likelihood, using optimization method on L.

[Lafferty, McCallum, Pereira 2001]

General CRFs vs. HMMs

• More general and expressive modeling technique

• Comparable computational efficiency for inference

• Features may be arbitrary functions of any or all observations

• Parameters need not fully specify generation of observations; require less training data

• Easy to incorporate domain knowledge

Main Point #1

Conditional probability sequence models

give great flexibility regarding features used,

and have efficient dynamic-programming-based algorithms for inference.

Outline



• Recent work




• Conclusions

Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,

time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

100+ documents from www.fedstats.gov

Table Extraction Experimental Results

Line labels,percent correct

Table segments,F1

95 % 92 %

65 % 64 %

error = 85%

error = 77%

85 % -

HMM

StatelessMaxEnt

CRF w/outconjunctions

CRF

52 % 68 %

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Main Point #2

Conditional Random Fields are more accurate in practice than HMMs

... on a table extraction task.

... (and others)

Outline



• Recent work




• Conclusions

A key remaining question for CRFs,given their freedom to include thekitchen sink's worth of features, is

what features to include?

Feature Induction for CRFs

1. Begin with knowledge of atomic features, but no features yet in the model.

2. Consider many candidate features, including atomic and conjunctions.

3. Evaluate each candidate feature.

4. Add to the model some that are ranked highest.

5. Train the model.

[McCallum, 2003, UAI, submitted]

Candidate Feature Evaluation

Common method: Information Gain

Ff

fCHfPCHFC )|( )()(),(InfoGain

True optimization criterion: Likelihood of training data

LLf f),(Gain Likelihood

Technical meat is in how to calculate this efficiently for CRFs• Mean field approximation• Emphasize error instances (related to Boosting)• Newton's method to set

[McCallum, 2003, UAI, submitted]

Named Entity Recognition

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: Examples:

PER Yayuk BasukiInnocent Butare

ORG 3MKDPCleveland

LOC ClevelandNirmal HridayThe Oval

MISC JavaBasque1,000 Lakes Rally

Automatically Induced Features

Index Feature

0 inside-noun-phrase (ot-1)

5 stopword (ot)

20 capitalized (ot+1)

75 word=the (ot)

100 in-person-lexicon (ot-1)

200 word=in (ot+2)

500 word=Republic (ot+1)

711 word=RBI (ot) & header=BASEBALL

1027 header=CRICKET (ot) & in-English-county-lexicon (ot)

1298 company-suffix-word (firstmentiont+2)

4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)

4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot)

4474 word=the (ot-2) & word=of (ot)

[McCallum & Li, 2003, CoNLL]

Named Entity Extraction Results

Method F1 # features

CRFs w/out Feature Induction 75% ~1M

CRFs with Feature Induction 89% ~60kbased on LikelihoodGain

[McCallum & Li, 2003, CoNLL]

Main Point #3

One of CRFs chief challenges---choosing the feature set---can also be addressed in a formal way, based on maximum likelihood.

For CRFs (and many other methods) Information Gain is not the appropriate metric for selecting features.

Outline



• Recent work




• Conclusions

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mining Prediction Outlier detection Decision support

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Coreference Resolution

Input

AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty"

Output

News article, with named-entity "mentions" tagged

Number of entities, N = 3

#1 Secretary of State Colin Powell he Mr. Powell Powell

#2 Condoleezza Rice she Rice

#3 President Bush Bush

Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .. . . . . . . . . . . . . Condoleezza Rice . . . . .. . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . .. . . President Bush . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution

Mention (3) Mention (4)

. . . Mr Powell . . . . . . Powell . . .

N Two words in common 29Y One word in common 13Y "Normalized" mentions are string identical 39Y Capitalized word in common 17Y > 50% character tri-gram overlap 19N < 25% character tri-gram overlap -34Y In same sentence 9Y Within two sentences 8N Further than 3 sentences apart -1Y "Hobbs Distance" < 3 11N Number of entities in between two mentions = 0 12N Number of entities in between two mentions > 4 -3Y Font matches 1Y Default -19

OVERALL SCORE = 98 > threshold=0

Pair-wise Affinity Metric

Y/N?

The Problem

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

affinity = 98

affinity = 11

affinity = 104

Pair-wise mergingdecisions are beingmade independentlyfrom each otherY

Y

N

Affinity measures are noisy and imperfect.

They should be madein relational dependencewith each other.

A UCBerkeley Solution[Russell 2001], [Pasula et al 2002]

id

wordscontext

distance fonts

id

surname

agegender

N

.

.

.

.

.

.

(Applied to citation matching, and object correspondence in vision)

2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---complicated.

1) Generative model makes it difficult to use complex features.

Problems:

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30Y/N

Y/N

Y/N

[McCallum & Wellner, 2003, ICML submitted]

ji l kjiikjkijijjill

x

yyyfyxxfZ

xyP, ,,

),,(''),,(exp1

)|(

(MRF)

Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.

11


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

Y

N

N



x

yyyfyxxfZ

xyP, ,,

),,(''),,(exp1

)|(

(MRF)

4

45)

30)

(11)


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

Y

Y

N



x

yyyfyxxfZ

xyP, ,,

),,(''),,(exp1

)|(

(MRF)

infinity

45)

30)

(11)


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

N

Y

N



x

yyyfyxxfZ

xyP, ,,

),,(''),,(exp1

)|(

(MRF)

64

45)

30)

(11)

Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

11

. . . Condoleezza Rice . . .

134

10


x

yyyfyxxfZ

xyP, ,,

),,(''),,(exp1

)|(

106

45

30


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

11

30

. . . Condoleezza Rice . . .

134

10

paritions

across ,ij

paritions w/in,

ij,

ww),,()|(logjijiji l

ijjill yxxfxyP

106


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . . . . . Condoleezza Rice . . .

= 22

45

11

30

134

10

106

paritions

across ,ij

paritions w/in,

ij,

ww),,()|(logjijiji l

ijjill yxxfxyP


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . . . . . Condoleezza Rice . . .

paritions

across ,ij

paritions w/in,

ij,

w'w),,()|(logjijiji l

ijjill yxxfxyP = 314

45

11

30

134

10

106

Markov Random Fields for Co-reference

• Train by maximum likelihood– Can calculate gradient by Gibbs sampling, or approximate

by stochastic gradient ascent (e.g. voted perceptron).– Given labeled training data in which partions are given,

learn an affinity measure for which partitioning will re-produce those partitions.

• In need of better algorithms for graph partitioning– Standard algorithms (e.g. Fiducia Mathesis) do not apply

with negative edge weights.– Currently beginning to use "Typical Cut" [Gdalyahu,

Weinshall, Werman, 1999]---a greedy, randomized algorithm.

Co-reference Experimental Results

Proper noun co-reference

DARPA ACE broadcast news transcripts, 117 stories

Partition F1 Pair F1Single-link threshold 16 % 18 %Best prev match [Morton] 83 % 89 %MRFs 88 % 92 %

error=30% error=28%

DARPA MUC-6 newswire article corpus, 30 stories

Partition F1 Pair F1Single-link threshold 11% 7 %Best prev match [Morton] 70 % 76 %MRFs 74 % 80 %

error=13% error=17%


Outline



• Recent work




• Conclusions

MEMM & CRF Related Work• Maximum entropy for language tasks:

– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]

• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]

• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]

• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]

Conclusions

• Conditional Markov Random fields combine the benefits of– Conditional probability models (arbitrary features)– Markov models (for sequences or other relations)

• Success in– Feature induction– Coreference analysis

• Future work:– Factorial CRFs, structure learning, semi-

supervised learning.– Tight integration of IE and Data Mining

End of talk

Voted Perceptron Sequence Models

before as ),,,(),( where

),(),( :k

),,,(expmaxarg

i instances, trainingallfor

:econvergenc toIterate

0k :zero toparameters Initialize

},{ :data ningGiven trai

1

)()()(

1

k

)(

tossfosC

osCosC

tossfs

so

ttt

kk

iViterbik

iikk

t kttkksViterbi

i

[Collins 2002]

Like CRFs with stochastic gradient ascent and a Viterbi approximation.

Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.

Analogous tothe gradientfor this onetraining instance

Part-of-speech Tagging

The asbestos fiber , crocidolite, is unusually resilient once

it enters the lungs , with even brief exposures to it causing

symptoms that show up decades later , researchers said .

DT NN NN , NN , VBZ RB JJ IN

PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG

NNS WDT VBP RP NNS JJ , NNS VBD .

45 tags, 1M words training data, Penn Treebank

Error oov error error err oov error err

HMM 5.69% 45.99%

CRF 5.55% 48.05% 4.27% -24% 23.76% -50%

Using spelling features*

* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.

[Pereira 2001 personal comm.]

Big-Picture Future Directions

• Unification of IE, data mining, decision making– Build large, working systems to test these ideas– Bring CiteSeer to UMass; build super-replacement

• Configuring & training complex systems with less human effort– Resource-bounded human/machine interaction– Learning from labeled and unlabeled data,

semi-supervised learning.

• Combinations of text with other modalities– images, video, sound– robotics, state estimation, model-building– bioinformatics, security ...

• Use of IE-inspired models for applications in– networking, compilers, security, bio-sequences, ...

IE from ...• SEC Filings

– Dale Kirkland (NASD) charged with detecting securities trading fraud

• Open Source software– Lee Giles (PennState) building system that supports code sharing,

interoperability and assessment of reliability.

• Biology and Bioinformatics Research Pubs– Karen Duca (Virginia Bioinformatics Institute) part of large team

trying to assemble facts that will help guide future experiments.

• Web pages about researchers, groups, pubs, conferences and grants– Steve Mahaney (NSF) charged with Congressionally-mandated

effort by NSF to model research impact of its grants.

Information Extraction with Markov Random Fields Andrew McCallum University of Massachusetts Amherst...

Documents

Transcript of Information Extraction with Markov Random Fields Andrew McCallum University of Massachusetts Amherst...