Post on 31-Dec-2015
A Conditional Random Field for Discriminatively-trained
Finite-state String Edit Distance
Andrew McCallum
Kedar Bellare
Fernando Pereira
Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
String Edit Distance
• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
String Edit Distance
• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications– Database Record Deduplication
Apex International Hotel Grassmarket Street
Apex Internat’l Grasmarket Street
Records are duplicates of the same hotel?
String Edit Distance
• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications– Database Record Deduplication
– Biological Sequences
AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
String Edit Distance
• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications– Database Record Deduplication
– Biological Sequences
– Machine Translation
Il a achete une pomme
He bought an apple
String Edit Distance
• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications– Database Record Deduplication
– Biological Sequences
– Machine Translation
– Textual Entailment He bought a new car last night
He purchased a brand new automobile yesterday evening
Levenshtein Distance
copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)
Edit operations
Lowest costalignment
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
operation cost
Total cost = 6= Levenshtein Distance
dele
te
0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0
Align two strings William W. CohonWillleam Cohen
x1 =
x2 =
[1966]
Levenshtein Distance
copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2
insert
subst
D(i,j) = score of best alignment from x1... xi to y1... yj.
D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1
D(i,j-1) + 1
Dynamic program
total cost =distance
Levenshtein Distancewith Markov Dependencies
Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2
Learn these costsfrom training data
subst
insertdelete
3DDPtable
repeateddelete
is cheaper
copy
12
12
Ristad & Yianilos (1997)Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings
€
p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2
| at )t
∏ complete data likelihood
Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
dele
te
1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14
x1
x2
a.i1
a.ea.i2
string 1
alignment
string 2
€
p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2
| at )t
∏a:x1 ,x 2
∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)
Match score =
€
O = p(x1
( j ),x2
( j ))j
∏Given training set ofmatching string pairs, objective fn is
Ristad & Yianilos Regrets
• Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”
• Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.
• Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”
So, consider model trained by conditional probability
Conditional Probability (Sequence) Models
• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):
– Can examine features, but not responsible for generating them.
– Don’t have to explicitly model their dependencies.
Jointyt-1 yt
xt
yt+1
xt+1xt-1
...
...
[Lafferty, McCallum, Pereira 2001]
From HMMs to Conditional Random Fields
€
P(y,x) = P(y t | y t−1)P(x t | y t )t=1
|x |
∏
€
vs = s1,s2,...sn
v o = o1,o2,...on
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
Conditional
€
P(y | x) =1
P(x)P(y t | y t−1)P(x t | y t )
t=1
|v o |
∏yt-1 yt yt+1
xt xt+1xt-1
...
...
€
=1
Z(x)Φs(y t ,y t−1)Φo(x t , y t )
t=1
|x |
∏
(A super-special case of Conditional Random Fields.)
where
Set parameters by maximum likelihood, using optimization method on L.
€
Φo(x t , y t ) = exp λ k fk (y t ,x t )k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Linear-chain ^
CRF String Edit Distance
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
dele
te
€
p(a | x1,x2) =1
Zx1 ,x 2
Φ(at ,at−1,x1,x2)t
∏
joint complete data likelihood
1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14
x1
x2
a.i1
a.ea.i2
string 1
alignment
string 2
conditional complete data likelihood
€
p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2
| at )t
∏
Want to train from set of string pairs,each labeled one of {match, non-match}
match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”
CRF String Edit Distance FSM
subst
insertdelete
copy
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
€
p(m | x1,x2) =1
Zx1 ,x 2
Φ(at ,at−1,x1,x2)t
∏a∈Sm
∑conditional incomplete data likelihood
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
Probability summed overall alignments in match states
0.8
Probability summed overall alignments in non-match states
0.2
x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
Probability summed overall alignments in match states
0.1
Probability summed overall alignments in non-match states
0.9
x1 = “Tom Dietterich”x2 = “Tom Dean”
Parameter Estimation
Expectation Maximization• E-step: Estimate distribution over alignments,
, using current parameters• M-step: Change parameters to maximize the
complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)
€
O = log p(m( j ) | x1
( j ),x2
( j ))( )j
∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood
The complete log likelihood
€
log p(m( j ) | a,x1
( j ),x2
( j ))p(a | x1
( j ),x2
( j ))( )a
∑j
∑
€
p(a | x1
( j ),x2
( j ))
This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.
Efficient Training
• Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries
• Use beam search during E-step[Pal, Sutton, McCallum 2005]
• Unlike completely observed CRFs, objective function is not convex.
• Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”
T o m m i J a a k k o l a
Tommi
Jakola
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Bruce Croft”x2 = “Tom Dean”
B r u c e C r o f t
Tom
Dean
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Jaime Carbonell”x2 = “Jamie Callan”
J a i m e C a r b o n e l l
Jamie
Callan
Summary of Advantages
• Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet
• Extremely flexible edit operations– Single operation may make arbitrary jumps in both
strings, of size determined by input features
• Discriminative Training– Maximize ability to predict match vs non-match
Experimental Results:Data Sets
• Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”
“Abe’s Grill, East Main Street”
• People names, UIS DB generator– synthetic noise– E.g. “John Smith” vs “Snith, John”
• CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”
“Russell & Norvig, “Artificial Intelligence: An Intro...”
Experimental Results:Features
• same, different• same-alphabetic, different alphbetic• same-numeric, different-numeric• punctuation1, punctuation2• alphabet-mismatch, numeric-mismatch• end-of-1, end-of-2• same-next-character, different-next-character
Experimental Results:Edit Operations
• insert, delete, substitute/copy• swap-two-characters• skip-word-if-in-lexicon• skip-parenthesized-words• skip-any-word• substitute-word-pairs-in-translation-lexicon• skip-word-if-present-in-other-string
Experimental Results
CiteSeerReason Face Reinf Constraint
0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913
Restaurantname
0.2900.3540.3650.433
Restaurantaddress
0.6860.7120.3800.532
Distancemetric
LevenshteinLearned Leven.VectorLearned Vector
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
Experimental Results
CiteSeerReason Face Reinf Constraint
0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913
0.964 0.918 0.917 0.976
Restaurantname
0.2900.3540.3650.433
0.448
Restaurantaddress
0.6860.7120.3800.532
0.783
Distancemetric
LevenshteinLearned Leven.VectorLearned Vector
CRF Edit Distance
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
Experimental Results
F1
0.8560.981
Without skip-if-present-in-other-stringWith skip-if-present-in-other-string
Data set: person names, with word-order noise added
Related Work
• Learned Edit Distance– [Bilenko & Mooney 2003], [Cohen et al 2003],...– [Joachims 2003]: Max-margin, trained on alignments
• Conditionally-trained models with latent variables– [Jebara 1999]: “Conditional Expectation Maximization”– [Quattoni, Collins, Darrell 2005]: CRF for visual object
recognition, with latent classes for object sub-patches– [Zettlemoyer & Collins 2005]: CRF for mapping
sentences to logical form, with latent parses.
“Predictive Random Fields”Latent Variable Models fit by
Multi-way Conditional Probability
• For clustering structured data,ala Latent Dirichlet Allocation & its successors
• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]
• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities
(c.f. “Predictive Likelihood”)
[McCallum, Wang, Pal, 2005]
Predictive Random Fieldsmixture of Gaussians on synthetic data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data, classify by color Generatively trained
Conditionally-trained [Jebara 1998]
Predictive Random Field
[McCallum, Wang, Pal, 2005]
Predictive Random Fieldsvs. Harmoniun
on document retrieval task
Harmonium, joint with words
Harmonium, joint,with class labels and words
Conditionally-trained,to predict class labels
Predictive Random Field,multi-way conditionally trained
[McCallum, Wang, Pal, 2005]
Summary• String edit distance
– Widely used in many fields
• As in CRF sequence labeling, benefit by– conditional-probability training, and– ability to use arbitrary, non-independent input features
• Example of conditionally-trained model withlatent variables.– “Find the alignments that most help distinguish match from non-match.”– May ultimately want the alignments, but only have relatively-easier-to-
label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning”
• Future work: Edit distance on trees.
• See also “Predictive Random Fields”http://www.cs.umass.edu/~pal/PRFTR.pdf
End of talk