CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of...

Post on 23-Dec-2015

225 views 1 download

Tags:

Transcript of CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of...

CRFs and Joint Inferencein NLP

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta, Xuerui Wang,Ben Wellner, Fuchun Peng, Michael Hay.

From Text to Actionable Knowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference

An HLT Pipeline

SNA, KDD, EventsTDT, Summarization

Coreference

Relations

NER

Parsing

MT

ASR

Errorscascade &

accumulate

An HLT Pipeline

SNA, KDDTDT, Summarization

Coreference

Relations

NER

Parsing

MT

ASR

Unified,joint

inference.

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

ProbabilisticModel

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output sequence given input sequence

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

p(y | x) =1

Zx

Φ(y t , y t−1,x, t)t

∏ where

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎠ ⎟

Outline

• Motivating Joint Inference for NLP.

• Brief introduction of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

– Joint Extraction and Data Mining (Iterative)

• Topical N-gram models

Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

But errors cascade--must be perfect at every stage to do well.

Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Loopy Belief Propagation

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

See also[Finkel, et al, 2005]

3. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

~25% reduction in error on co-reference of proper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Sparse Generalized Belief Propagation

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Pal, Sutton, McCallum, 2005]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

see also [Marthi, Milch, Russell, 2003]

Joint IE and Coreference from Research Paper Citations

Textual citation mentions(noisy, with duplicates)

Paper database, with fields,clean, duplicates collapsed

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…

4. Joint segmentation and co-reference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

Citation Segmentation and Coreference

Y?N

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Resolving conflicts

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Perform jointly.

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

J Besag 1986 On the…

AUT AUT YR TITL TITL

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

Citation mention attributes

J Besag 1986 On the…

AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”

c

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Structure for each citation mention

x

s

IE + Coreference Model

c

Binary coreference variablesfor each pair of mentions

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

x

s

IE + Coreference Model

c

y n

n

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Binary coreference variablesfor each pair of mentions

y n

n

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Research paper entity attribute nodes

AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...

Inference by Sparse “Generalized BP”

Exact inference onthese linear-chain regions

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

From each chainpass an N-best List

into coreference

[Pal, Sutton, McCallum 2005]

Inference by Sparse “Generalized BP”

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Approximate inferenceby graph partitioning…

…integrating outuncertaintyin samples

of extraction

Make scale to 1Mcitations with Canopies

[McCallum, Nigam, Ungar 2000]

[Pal, Sutton, McCallum 2005]

y n

n

Inference by Sparse “Generalized BP”

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Exact (exhaustive) inferenceover entity attributes

[Pal, Sutton, McCallum 2005]

y n

n

Inference by Sparse “Generalized BP”

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Revisit exact inferenceon IE linear chain,

now conditioned on entity attributes

[Pal, Sutton, McCallum 2005]

y n

n

Parameter Estimation: Piecewise Training

Coref graph edge weightsMAP on individual edges

Divide-and-conquer parameter estimation

IE Linear-chainExact MAP

Entity attribute potentialsMAP, pseudo-likelihood

In all cases:Climb MAP gradient with

quasi-Newton method

[Sutton & McCallum 2005]

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]

Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

Outline

• Motivating Joint Inference for NLP.

• Brief introduction of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

– Joint Extraction and Data Mining (Iterative)

• Topical N-gram models

“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”

“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”

“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”

“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”

?

Relation Extraction as Sequence Labeling

George W. Bush

…George H. W. Bush (son of Prescott Bush) …

Father Grandfather

Learning Relational Database Features

George W. Bush

…George H. W. Bush (son of Prescott Bush) …

Father Grandfather

Name Son

Prescott Bush George H. W. Bush

George H. W. Bush George W. Bush

Search DB for “relational paths” between subject and token

Subject_Is_SonOf_SonOf_Token=1.0

Highly weighted relational paths

• Many Family equivalences– Sibling=Parent_Offspring– Cousin=Parent_Sibling_Offspring

• College=Parent_College• Religion=Parent_Religion• Ally=Opponent_Opponent• Friend=Person_Same_School

• Preliminary results: nice performance boost using relational features (~8% absolute F1)

Testing on Unknown Entities

John F. Kennedy

… son of Joseph P. Kennedy, Sr. and Rose Fitzgerald

Name Son

Joseph P. Kennedy John F. Kennedy

Rose Fitzgerald John F. Kennedy

Father Mother

Fill DB with “first-pass” CRFUse relational features with “second-pass” CRF

Next Steps

• Feature induction to discover complex rules

• Measure relational features’ sensitivity to noise in DB

• Collective inference among related relations

Outline

• Motivating Joint Inference for NLP.

• Brief introduction of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

– Joint Extraction and Data Mining (Iterative)

• Topical N-gram models

Topical N-gram Model - Our first attempt

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

{0, 1, 1:2, 2:2, 1:3, 2:3, 3:3}

Wang & McCallum

Beyond bag-of-words

z1 z2 z3 z4

w1 w2 w3 w4

TW

D

. . .

. . .

Wallach

LDA-COL (Collocation) Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1 2

T

Griffiths & Steyvers

D

1 2

WW

. . .

. . .

. . .

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

Wang & McCallum

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

Wang & McCallum

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning RLfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2+) Topical N-grams (1)

Topic Comparison

motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal

LDA

receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli

motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections

Topical N-grams (2+) Topical N-grams (1)

Topic Comparison

wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels

LDA

speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent

speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid

Topical N-grams (2+) Topical N-grams (1)

Summary

• Joint inference can avoid accumulating errors in an pipeline from extraction to data mining.

• Examples– Factorial finite state models– Jointly labeling distant entities– Coreference analysis– Segmentation uncertainty aiding coreference & vice-versa– Joint Extraction and Data Mining

• Many examples of sequential topic models.