Post on 01-Jan-2016
Information Extractionwith Markov Random Fields
Andrew McCallumUniversity of Massachusetts Amherst
Joint work with
Aron Culotta, Wei Li and Ben Wellner
Bruce Croft, James Allan, David Pinto, Xing Wei, Andres Corrada-Emmanuel
David Jensen, Jen Neville,
John Lafferty (CMU), Fernando Pereira (UPenn)
IE from Cargo Container Ship Manifests
Cargo Tracking Div.US Navy
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documentsseveral millennia old
- Qing Dynasty Archives- memos- newspaper articles- diaries
IE from Research Papers[McCallum et al ‘99]
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mining Prediction Outlier detection Decision support
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
What is “Information Extraction”
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
Issues that arise
Application issues• Directed spidering• Page filtering• Sequence segmentation• Segment classification• Record association• Co-reference
Scientific issues• Learning more than 100k
parameters from limited and noisy training data
• Taking advantage of rich, interdependent features.
• Deciding which input feature representation to use.
• Efficient inference in models with many interrelated predictions.
• Clustering massive data sets
Issues that arise
Application issues• Directed spidering• Page filtering• Sequence segmentation• Segment classification• Record association• Co-reference
Scientific issues• Learning more than 100k
parameters from limited and noisy training data
• Taking advantage of rich, interdependent features.
• Deciding which input feature representation to use.
• Efficient inference in models with many interrelated predictions.
• Clustering massive data sets
Outline
• The need. The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
IE with Hidden Markov Models
Yesterday Arny Rosenberg spoke this example sentence.
Yesterday Arny Rosenberg spoke this example sentence.
Person name: Arny Rosenberg
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
Regrets from Atomic View of Tokens
Would like richer representation of text: multiple overlapping features, whole chunks of text.
line, sentence, or paragraph features:– length– is centered in page– percent of non-alphabetics– white-space aligns with next line– containing sentence has two verbs– grammatically contains a question– contains links to “authoritative” pages– emissions that are uncountable– features at multiple levels of granularity
Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and Associates”
Problems with Richer Representationand a Generative Model
• These arbitrary features are not independent:– Overlapping and long-distance dependences
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
• HMMs are generative models of the text:
• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
),( osP
Conditional Sequence Models
• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
nn oooossss ,...,,..., 2121
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
||
11 ),|()|(
o
tttt ossPosP
||
1
1
),(
),(
exp1
)|(o
t
kttkk
jttjj
o osg
ssf
ZosP
(A special case of MEMMs and CRFs.)
Conditional Finite State Sequence Models
From HMMs to MEMMs to CRFs[Lafferty, McCallum, Pereira 2001]
[McCallum, Freitag & Pereira, 2000]
Conditional Random Fields (CRFs)
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
Markov on s, conditional dependency on o.
||
11 ),,,(exp
1)|(
o
t kttkk
o
tossfZ
osP
Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.
Set parameters by maximum likelihood, using optimization method on L.
[Lafferty, McCallum, Pereira 2001]
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all observations
• Parameters need not fully specify generation of observations; require less training data
• Easy to incorporate domain knowledge
Main Point #1
Conditional probability sequence models
give great flexibility regarding features used,
and have efficient dynamic-programming-based algorithms for inference.
Outline
• The need. The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
--------------------------------------------------------------------------------
: : Production of Milk and Milkfat 2/
: Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in All :------------------
: : Milk : Milkfat : Milk Produced : Milk : Milkfat
--------------------------------------------------------------------------------
: 1,000 Head --- Pounds --- Percent Million Pounds
:
1993 : 9,589 15,704 575 3.66 150,582 5,514.4
1994 : 9,500 16,175 592 3.66 153,664 5,623.7
1995 : 9,461 16,451 602 3.66 155,644 5,694.3
--------------------------------------------------------------------------------
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
Table Extraction Experimental Results
Line labels,percent correct
Table segments,F1
95 % 92 %
65 % 64 %
error = 85%
error = 77%
85 % -
HMM
StatelessMaxEnt
CRF w/outconjunctions
CRF
52 % 68 %
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Main Point #2
Conditional Random Fields are more accurate in practice than HMMs
... on a table extraction task.
... (and others)
Outline
• The need. The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
A key remaining question for CRFs,given their freedom to include thekitchen sink's worth of features, is
what features to include?
Feature Induction for CRFs
1. Begin with knowledge of atomic features, but no features yet in the model.
2. Consider many candidate features, including atomic and conjunctions.
3. Evaluate each candidate feature.
4. Add to the model some that are ranked highest.
5. Train the model.
[McCallum, 2003, UAI, submitted]
Candidate Feature Evaluation
Common method: Information Gain
Ff
fCHfPCHFC )|( )()(),(InfoGain
True optimization criterion: Likelihood of training data
LLf f),(Gain Likelihood
Technical meat is in how to calculate this efficiently for CRFs• Mean field approximation• Emphasize error instances (related to Boosting)• Newton's method to set
[McCallum, 2003, UAI, submitted]
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
Labels: Examples:
PER Yayuk BasukiInnocent Butare
ORG 3MKDPCleveland
LOC ClevelandNirmal HridayThe Oval
MISC JavaBasque1,000 Lakes Rally
Automatically Induced Features
Index Feature
0 inside-noun-phrase (ot-1)
5 stopword (ot)
20 capitalized (ot+1)
75 word=the (ot)
100 in-person-lexicon (ot-1)
200 word=in (ot+2)
500 word=Republic (ot+1)
711 word=RBI (ot) & header=BASEBALL
1027 header=CRICKET (ot) & in-English-county-lexicon (ot)
1298 company-suffix-word (firstmentiont+2)
4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)
4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot)
4474 word=the (ot-2) & word=of (ot)
[McCallum & Li, 2003, CoNLL]
Named Entity Extraction Results
Method F1 # features
CRFs w/out Feature Induction 75% ~1M
CRFs with Feature Induction 89% ~60kbased on LikelihoodGain
[McCallum & Li, 2003, CoNLL]
Main Point #3
One of CRFs chief challenges---choosing the feature set---can also be addressed in a formal way, based on maximum likelihood.
For CRFs (and many other methods) Information Gain is not the appropriate metric for selecting features.
Outline
• The need. The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mining Prediction Outlier detection Decision support
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mining Prediction Outlier detection Decision support
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Coreference Resolution
Input
AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty"
Output
News article, with named-entity "mentions" tagged
Number of entities, N = 3
#1 Secretary of State Colin Powell he Mr. Powell Powell
#2 Condoleezza Rice she Rice
#3 President Bush Bush
Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .. . . . . . . . . . . . . Condoleezza Rice . . . . .. . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . .. . . President Bush . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
Inside the Traditional Solution
Mention (3) Mention (4)
. . . Mr Powell . . . . . . Powell . . .
N Two words in common 29Y One word in common 13Y "Normalized" mentions are string identical 39Y Capitalized word in common 17Y > 50% character tri-gram overlap 19N < 25% character tri-gram overlap -34Y In same sentence 9Y Within two sentences 8N Further than 3 sentences apart -1Y "Hobbs Distance" < 3 11N Number of entities in between two mentions = 0 12N Number of entities in between two mentions > 4 -3Y Font matches 1Y Default -19
OVERALL SCORE = 98 > threshold=0
Pair-wise Affinity Metric
Y/N?
The Problem
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
affinity = 98
affinity = 11
affinity = 104
Pair-wise mergingdecisions are beingmade independentlyfrom each otherY
Y
N
Affinity measures are noisy and imperfect.
They should be madein relational dependencewith each other.
A UCBerkeley Solution[Russell 2001], [Pasula et al 2002]
id
wordscontext
distance fonts
id
surname
agegender
N
.
.
.
.
.
.
(Applied to citation matching, and object correspondence in vision)
2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---complicated.
1) Generative model makes it difficult to use complex features.
Problems:
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30Y/N
Y/N
Y/N
[McCallum & Wellner, 2003, ICML submitted]
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
(MRF)
Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.
11
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
Y
N
N
[McCallum & Wellner, 2003, ICML submitted]
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
(MRF)
4
45)
30)
(11)
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
Y
Y
N
[McCallum & Wellner, 2003, ICML submitted]
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
(MRF)
infinity
45)
30)
(11)
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
N
Y
N
[McCallum & Wellner, 2003, ICML submitted]
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
(MRF)
64
45)
30)
(11)
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
11
. . . Condoleezza Rice . . .
134
10
ji l kjiikjkijijjill
x
yyyfyxxfZ
xyP, ,,
),,(''),,(exp1
)|(
106
45
30
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
11
30
. . . Condoleezza Rice . . .
134
10
paritions
across ,ij
paritions w/in,
ij,
ww),,()|(logjijiji l
ijjill yxxfxyP
106
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . . . . . Condoleezza Rice . . .
= 22
45
11
30
134
10
106
paritions
across ,ij
paritions w/in,
ij,
ww),,()|(logjijiji l
ijjill yxxfxyP
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . . . . . Condoleezza Rice . . .
paritions
across ,ij
paritions w/in,
ij,
w'w),,()|(logjijiji l
ijjill yxxfxyP = 314
45
11
30
134
10
106
Markov Random Fields for Co-reference
• Train by maximum likelihood– Can calculate gradient by Gibbs sampling, or approximate
by stochastic gradient ascent (e.g. voted perceptron).– Given labeled training data in which partions are given,
learn an affinity measure for which partitioning will re-produce those partitions.
• In need of better algorithms for graph partitioning– Standard algorithms (e.g. Fiducia Mathesis) do not apply
with negative edge weights.– Currently beginning to use "Typical Cut" [Gdalyahu,
Weinshall, Werman, 1999]---a greedy, randomized algorithm.
Co-reference Experimental Results
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Partition F1 Pair F1Single-link threshold 16 % 18 %Best prev match [Morton] 83 % 89 %MRFs 88 % 92 %
error=30% error=28%
DARPA MUC-6 newswire article corpus, 30 stories
Partition F1 Pair F1Single-link threshold 11% 7 %Best prev match [Morton] 70 % 76 %MRFs 74 % 80 %
error=13% error=17%
[McCallum & Wellner, 2003, ICML submitted]
Outline
• The need. The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
MEMM & CRF Related Work• Maximum entropy for language tasks:
– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]
• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]
• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]
• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]
Conclusions
• Conditional Markov Random fields combine the benefits of– Conditional probability models (arbitrary features)– Markov models (for sequences or other relations)
• Success in– Feature induction– Coreference analysis
• Future work:– Factorial CRFs, structure learning, semi-
supervised learning.– Tight integration of IE and Data Mining
End of talk
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor
:econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.
Analogous tothe gradientfor this onetraining instance
Part-of-speech Tagging
The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said .
DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD .
45 tags, 1M words training data, Penn Treebank
Error oov error error err oov error err
HMM 5.69% 45.99%
CRF 5.55% 48.05% 4.27% -24% 23.76% -50%
Using spelling features*
* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
[Pereira 2001 personal comm.]
Big-Picture Future Directions
• Unification of IE, data mining, decision making– Build large, working systems to test these ideas– Bring CiteSeer to UMass; build super-replacement
• Configuring & training complex systems with less human effort– Resource-bounded human/machine interaction– Learning from labeled and unlabeled data,
semi-supervised learning.
• Combinations of text with other modalities– images, video, sound– robotics, state estimation, model-building– bioinformatics, security ...
• Use of IE-inspired models for applications in– networking, compilers, security, bio-sequences, ...
IE from ...• SEC Filings
– Dale Kirkland (NASD) charged with detecting securities trading fraud
• Open Source software– Lee Giles (PennState) building system that supports code sharing,
interoperability and assessment of reliability.
• Biology and Bioinformatics Research Pubs– Karen Duca (Virginia Bioinformatics Institute) part of large team
trying to assemble facts that will help guide future experiments.
• Web pages about researchers, groups, pubs, conferences and grants– Steve Mahaney (NSF) charged with Congressionally-mandated
effort by NSF to model research impact of its grants.