A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew...

A Conditional Random Field for Discriminatively-trained

Finite-state String Edit Distance

Andrew McCallum

Kedar Bellare

Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

• Applications– Database Record Deduplication

Apex International Hotel Grassmarket Street

Apex Internat’l Grasmarket Street

Records are duplicates of the same hotel?

– Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA

AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

– Machine Translation

Il a achete une pomme

He bought an apple

– Machine Translation

– Textual Entailment He bought a new car last night

He purchased a brand new automobile yesterday evening

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

Lowest costalignment

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

operation cost

Total cost = 6= Levenshtein Distance

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0

Align two strings William W. CohonWillleam Cohen

[1966]

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

insert

D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1

D(i,j-1) + 1

Dynamic program

total cost =distance

Levenshtein Distancewith Markov Dependencies

Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

Learn these costsfrom training data

insertdelete

3DDPtable

repeateddelete

is cheaper

Ristad & Yianilos (1997)Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏ complete data likelihood

Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

a.ea.i2

string 1

alignment

string 2

p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏a:x1 ,x 2

∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)

Match score =

O = p(x1

( j ),x2

( j ))j

∏Given training set ofmatching string pairs, objective fn is

Ristad & Yianilos Regrets

• Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”

• Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.

• Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”

So, consider model trained by conditional probability

Conditional Probability (Sequence) Models

• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

Jointyt-1 yt

xt+1xt-1

[Lafferty, McCallum, Pereira 2001]

From HMMs to Conditional Random Fields

P(y,x) = P(y t | y t−1)P(x t | y t )t=1

vs = s1,s2,...sn

v o = o1,o2,...on

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

Conditional

P(y | x) =1

P(x)P(y t | y t−1)P(x t | y t )

|v o |

∏yt-1 yt yt+1

xt xt+1xt-1

Z(x)Φs(y t ,y t−1)Φo(x t , y t )

(A super-special case of Conditional Random Fields.)

Set parameters by maximum likelihood, using optimization method on L.

Φo(x t , y t ) = exp λ k fk (y t ,x t )k

∑ ⎛

⎝ ⎜

⎠ ⎟

Linear-chain ^

CRF String Edit Distance

p(a | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

joint complete data likelihood

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

a.ea.i2

string 1

alignment

string 2

conditional complete data likelihood

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2

| at )t

Want to train from set of string pairs,each labeled one of {match, non-match}

match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”

CRF String Edit Distance FSM

insertdelete

matchm = 1

non-matchm = 0

p(m | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏a∈Sm

∑conditional incomplete data likelihood

insertdelete

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

Probability summed overall alignments in non-match states

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

insertdelete

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

Probability summed overall alignments in non-match states

x1 = “Tom Dietterich”x2 = “Tom Dean”

Parameter Estimation

Expectation Maximization• E-step: Estimate distribution over alignments,

, using current parameters• M-step: Change parameters to maximize the

complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

O = log p(m( j ) | x1

( j ),x2

( j ))( )j

∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood

The complete log likelihood

log p(m( j ) | a,x1

( j ),x2

( j ))p(a | x1

( j ),x2

( j ))( )a

p(a | x1

( j ),x2

( j ))

This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Efficient Training

• Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries

• Use beam search during E-step[Pal, Sutton, McCallum 2005]

• Unlike completely observed CRFs, objective function is not convex.

• Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

What Alignments are Learned?

insertdelete

matchm = 1

non-matchm = 0

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

T o m m i J a a k k o l a

Jakola

insertdelete

matchm = 1

non-matchm = 0

x1 = “Bruce Croft”x2 = “Tom Dean”

B r u c e C r o f t

insertdelete

matchm = 1

non-matchm = 0

x1 = “Jaime Carbonell”x2 = “Jamie Callan”

J a i m e C a r b o n e l l

Callan

Summary of Advantages

• Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet

• Extremely flexible edit operations– Single operation may make arbitrary jumps in both

strings, of size determined by input features

• Discriminative Training– Maximize ability to predict match vs non-match

Experimental Results:Data Sets

• Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”

“Abe’s Grill, East Main Street”

• People names, UIS DB generator– synthetic noise– E.g. “John Smith” vs “Snith, John”

• CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”

“Russell & Norvig, “Artificial Intelligence: An Intro...”

Experimental Results:Features

• same, different• same-alphabetic, different alphbetic• same-numeric, different-numeric• punctuation1, punctuation2• alphabet-mismatch, numeric-mismatch• end-of-1, end-of-2• same-next-character, different-next-character

Experimental Results:Edit Operations

• insert, delete, substitute/copy• swap-two-characters• skip-word-if-in-lexicon• skip-parenthesized-words• skip-any-word• substitute-word-pairs-in-translation-lexicon• skip-word-if-present-in-other-string

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

0.964 0.918 0.917 0.976

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

CRF Edit Distance

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

0.8560.981

Without skip-if-present-in-other-stringWith skip-if-present-in-other-string

Data set: person names, with word-order noise added

Related Work

• Learned Edit Distance– [Bilenko & Mooney 2003], [Cohen et al 2003],...– [Joachims 2003]: Max-margin, trained on alignments

• Conditionally-trained models with latent variables– [Jebara 1999]: “Conditional Expectation Maximization”– [Quattoni, Collins, Darrell 2005]: CRF for visual object

recognition, with latent classes for object sub-patches– [Zettlemoyer & Collins 2005]: CRF for mapping

sentences to logical form, with latent parses.

“Predictive Random Fields”Latent Variable Models fit by

Multi-way Conditional Probability

• For clustering structured data,ala Latent Dirichlet Allocation & its successors

• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]

• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities

(c.f. “Predictive Likelihood”)

[McCallum, Wang, Pal, 2005]

Predictive Random Fieldsmixture of Gaussians on synthetic data

Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Predictive Random Field

Predictive Random Fieldsvs. Harmoniun

on document retrieval task

Harmonium, joint with words

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Predictive Random Field,multi-way conditionally trained

Summary• String edit distance

– Widely used in many fields

• As in CRF sequence labeling, benefit by– conditional-probability training, and– ability to use arbitrary, non-independent input features

• Example of conditionally-trained model withlatent variables.– “Find the alignments that most help distinguish match from non-match.”– May ultimately want the alignments, but only have relatively-easier-to-

label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning”

• Future work: Edit distance on trees.

• See also “Predictive Random Fields”http://www.cs.umass.edu/~pal/PRFTR.pdf

End of talk

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew...

Documents

Transcript of A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew...

kedar documents

A Security Framework for Privacy-preserving Data ... · (with symmetric-key encryption [Bellare et al. 1994] for privacy protection and message authentication codes [Bellare et al.

Maumee Watershed Conservancy Dist. Bd. of Dirs. … · {¶1} Defendant-Appellants Kedar Army, Mary Lou Army, Kedar Army as Trustee of the Kedar D. Army Revocable Living Trust and

Adam O’Neill, Georgetown University Joint with Mihir Bellare , UCSD

Kedar Vishnu Parmod Kumar 438 - Kedar Vishnu and Parmod Kumar...Kedar Vishnu and Parmod Kumar1 Abstract Indian Modern Food Retail Chains (MFRC) have been growing the fastest in developing

Object Detection with Discriminatively Trained Part Based ...lear.inrialpes.fr/~oneata/reading_group/dpm.pdf · Object Detection with Discriminatively Trained Part Based Model P.F.

AWI labreport-kedar

Bellare 2013 Chapter 1 Introduction

Mihir Bellare Alexandra Boldyreva Adriana Palacio

Human food intake is discriminatively sensitive to gastric ... · Human food intake is discriminatively sensitive to gastric signaling ... absence of control of eating by gastric

Object Detection with Discriminatively Trained Part Based ...cs.brown.edu/people/pfelzens/talks/mlss.pdf · Object Detection with Discriminatively Trained Part Based Models Pedro

Discriminatively Trained Sparse Code Gradients for Contour Detectionpapers.nips.cc/paper/4787-discriminatively-trained... · · 2014-04-23basis of many tasks such as image segmentation

Discriminatively Trained Templates for 3D Object Detection: A … · 2013. 11. 9. · Discriminatively Trained Templates for 3D Object Detection: A Real Time Scalable Approach Reyes

Kedar Enterprises, Navi Mumbai, Angle Grinder

JOURNAL OF LA A Discriminatively Learned CNN Embedding …JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 A Discriminatively Learned CNN Embedding for Person Re-identiﬁcation

kedar food projects

Copyright ©2004 Carlos Guestrin guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.

A Discriminatively Trained, Multiscale, Deformable …efros/courses/LBMV09/presentations/...A Discriminatively Trained, Multiscale, Deformable Part Model Edward Hsiao 16-721 Learning

Jayesh Bellare

Mundane Astrology by M N Kedar