CSE 573 Finite State Machines for Information Extraction
description
Transcript of CSE 573 Finite State Machines for Information Extraction
04/21/23 13:12 1
CSE 573 Finite State Machines for Information Extraction
• Topics– Administrivia– Background– DIRT– Finite State Machine Overview– HMMs– Conditional Random Fields– Inference and Learning
Mini-Project Options1. Write a program to solve the counterfeit
coin problem on the midterm. 2. Build a DPLL and or a WalkSAT
satisfiability solver. 3. Build a spam filter using naïve Bayes,
decision trees, or compare learners in the Weka ML package.
4. Write a program which learns Bayes nets.
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
Slides from Cohen & McCallum
Landscape: Our Focus
Pattern complexity
Pattern feature domain
Pattern scope
Pattern combinations
Models
closed set regular complex ambiguous
words words + formatting formatting
site-specific genre-specific general
entity binary n-ary
lexicon regex window boundary FSM CFG
Slides from Cohen & McCallum
Landscape of IE Techniques Models
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
…and beyond
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Slides from Cohen & McCallum
Simple Extractor
Boston, Seattle, …
Cities
suchas EOP
* * ** *
DIRT• How related to IE?
• Why unsupervised?
• Distributional Hypothesis?
DIRTDependency Tree?
DIRT• Path Similarity • Path Database
DIRTEvaluation? X is author of
Y
Overall• Accept?• Proud?
Finite State Models
Naïve Bayes
Logistic Regression
Linear-chain CRFs
HMMsGenerative
directed models
General CRFs
Sequence
Sequence
Conditional Conditional Conditional
GeneralGraphs
GeneralGraphs
Graphical Models• Family of probability distributions that factorize
in a certain way• Directed (Bayes Nets)
• Undirected (Markov Random Field)
• Factor Graphs
x0
x1x2
x3
x4
p(x) =Q K
i=1 p(xi jP arents(xi))
p(x) = 1Z
QA ª A (xA )
x = x1x2 : : :xK
x0
x1x2
x4
x3x5
ª A factor function
A ½fx1; : : :;xK g
x0
x1x2
x4
x3x5
p(x) = 1Z
QC ª C (xC )
C ½fx1; : : :;xK g clique
ª C potential function
Node is independent of its non-descendants given its
parentsNode is independent all other
nodes given its neighbors
Recap: Naïve Bayes• Assumption: features independent given label• Generative Classifier
– Model joint distribution p(x,y)
– Inference
– Learning: counting– Example
y
x1x2
xKp(y;x) = p(y)
KY
k=1
p(xkjy):::
p(yjx) = p(y)KY
k=1
p(xk jy)1
p(x)
The article appeared in the Seattle Times.
city?
lengthcapitalization
suffix
Need toconsider
sequence!
Labels ofneighboring
wordsdependent!
П
П
Hidden Markov Models
• Generative Sequence Model– 2 assumptions to make joint distribution tractable
1. Each state depends only on its immediate predecessor.2. Each observation depends only on current state.
Finite state model
x1 x2 x3 x4 x5 x6 x7 x8
Graphical Model
transitions
observations
…
…
statesequence
observationsequence
Yesterday Pedro
other
person
location
person
other person …
yt¡ 1
xt¡ 1 xt xt+1
yt yt+1
…
y1 y2 y3 y4 y5 y6 y7 y8
yt-1
Hidden Markov Models
• Generative Sequence Model
• Model Parameters – Start state probabilities– Transition probabilities– Observation probabilities
Finite state model
x1 x2 x3 x4 x5 x6 x7 x8
Graphical Model
transitions
observations
…
…
statesequence
observationsequence
Yesterday Pedro
other
person
location
person
other person …
p(y;x) =TY
t=1
p(yt jyt¡ 1)p(xtjyt)yt¡ 1
xt¡ 1 xt xt+1
yt yt+1
p(y1) := p(y1jy0)
p(yt jyt¡ 1)
p(xtjyt)
…
y1 y2 y3 y4 y5 y6 y7 y8
П -
-
IE with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
Slide by Cohen & McCallum
IE with Hidden Markov ModelsFor sparse extraction tasks :• Separate HMM for each type of
target • Each HMM should
– Model entire document– Consist of target and non-target states– Not necessarily fully connected
18Slide by Okan Basegmez
Information Extraction with HMMs• Example – Research Paper Headers
19Slide by Okan Basegmez
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
Train on ~500k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
Back-off to: Back-off to:
or
Results:
Slide by Cohen & McCallum
p(yt jyt¡ 1;xt¡ 1)
p(yt jyt¡ 1)
p(yt)
p(xtjyt;yt¡ 1)
p(xtjyt;xt¡ 1)
p(xtjyt)
p(xt)
- - -
-
-
A parse of a sequenceGiven a sequence x = x1……xN,
A parse of o is a sequence of states y = y1, ……, yN
1
2
K
…
1
2
K
…
1
2
K
…
…
…
…
1
2
K
…
x1 x2 x3 xK
2
1
K
2
Slide by Serafim Batzoglou
person
other
location
Question #1 – EvaluationGIVEN
A sequence of observations x1 x2 x3 x4 ……xN
A trained HMM θ=( , , )
QUESTION
How likely is this sequence, given our HMM ?
P(x, θ)
p(yt jyt¡ 1) p(xtjyt) p(y1)
Why do we care?
Need it for learning to choose among competing models!
-
Question #2 - DecodingGIVEN
A sequence of observations x1 x2 x3 x4 ……xN
A trained HMM θ=( , , )
QUESTION
How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ?
p(yt jyt¡ 1) p(xtjyt) p(y1)
There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …
-
Question #3 - LearningGIVEN
A sequence of observations x1 x2 x3 x4 ……xN
QUESTION
How do we learn the model parameters θ =( , , ) to maximize P(x, λ ) ?p(yt jyt¡ 1) p(xtjyt) p(y1)-
Solution to #1: EvaluationGiven observations x=x1 …xN and HMM θ, what is p(x) ?
Naïve: enumerate every possible state sequence y=y1
…yN
Probability of x and given particular y
Probability of particular y
Summing over all possible state sequences we get
p(xjy) =TY
t=1
p(xtjyt)
p(y) =TY
t=1
p(yt jyt¡ 1)
p(x) =X
all y
p(xjy)p(y)
NT state sequences!
2T multiplications per sequence
For small HMMsT=10, N=10there are 10
billion sequences!
П
П
Solution to #1: EvaluationUse Dynamic Programming:
Define forward variable
probability that at time t- the state is yi
- the partial observation sequence x=x1 …xt has been emitted
p(x) =X
y
p(y)p(xjy)
=X
y
Y
t=1::T
p(yt jyt¡ 1)p(xtjyt)
=X
yT
X
yT ¡ 1
p(yT ;yT ¡ 1;xT )X
yT ¡ 2
p(yT ¡ 1jyT ¡ 2)p(xT ¡ 1jyT ¡ 1)X
yT ¡ 3
: : :
Solution to #1: Evaluation• Use Dynamic Programming
• Cache and reuse inner sums• Define forward variables
®t(i) := P (x1x2:::xt;yt = Si )
probability that at time t- the state is yt = Si
- the partial observation sequence x=x1 …xt has been emitted
П
y
y
-
- -
-
--- - -
The Forward Algorithm
INITIALIZATION
INDUCTION
TERMINATION
®t(i) = p(x1x2:::xt;yt = Si )
=X
j 2S
®t¡ 1(j )p(yt = Si jyt¡ 1 = Sj )p(xtjyt)
®1(i) = p(y1 = Si )p(x1jy1)
p(x) =X
j 2S
®T (j )
®t(i) := P (x1x2:::xt;yt = Si )
Time:O(K2N)
Space:O(KN)
K = |S| #statesN length of sequence
-
-
p(yt = S3 jyt¡ 1 = S
2)
p(yt = S3jyt¡ 1 = S3)
The Forward Algorithm
S1
S2
S3
SN
S1
S2
S3
SN
t ¡ 1 tot
®t¡ 1(1)
®t¡ 1(2)
®t¡ 1(3)
®t¡ 1(N )
®t(3)
p(yt =
S3 jy
t¡1 =
S1 )
p(yt=
S3jyt¡
1=
SN)
p(xt = ot jyt = S3)
®t(i) := P (x1x2:::xt;yt = Si )
p(yt+1 = S3jyt = S3)
¯t+1(N )
The Backward Algorithm
S1
S2
S3
SN
S1
S2
S3
SN
p(yt+1= S2jyt
= S3)
t t +1ot+1
¯t+1(1)
p(yt+
1=
S1jyt
=S3
)
p(yt+1 =SN jyt =
S3 ) p(xt = ot+1jyt = S3)
¯t+1(2)
¯t+1(3)
¯t(i) := P (yt = Si ;xt+1xt+2:::xT )
The Backward Algorithm
INITIALIZATION
INDUCTION
TERMINATION
¯t(i) = p(yt = Si ;xt+1;xt+2 : : :xT )
=X
j 2S
p(yt+1 = Sj jyt = Si )p(xt+1jyt+1)¯t+1(j )
¯T (i) = 1
p(x) =X
j 2S
p(y1 = Sj )p(x1jy1)¯1(j )
¯t(i) := P (yt = Si ;xt+1xt+2:::xT )
Time:O(K2N)
Space:O(KN)
Solution to #2 - DecodingGiven x=x1 …xN and HMM θ, what is “best” parse y1 …yN?
Several optimal solutions• 1. States which are individually most likely:
most likely state y*t at time t is then
P (yt = Si jx) =®t(i)¯t(i)
P (x)=
®t(i)¯t(i)P N
i=1 ®t(i)¯t(i)
y¤t = argmax1· i · N P (yt = Si jx)
But some transitions may have 0 probability!
Solution to #2 - DecodingGiven x=x1 …xN and HMM θ, what is “best” parse y1 …
yN?
Several optimal solutions• 1. States which are individually most likely• 2. Single best state sequence
We want to find sequence y1 …yN, such that P(x,y) is maximized
y* = argmaxy P( x, y )
Again, we can use dynamic programming!
1
2
K
…
1
2
K
…
1
2
K
…
…
…
…
1
2
K
…
o1 o2 o3 oK
2
1
K
2
The Viterbi AlgorithmDEFINE
INITIALIZATION
INDUCTION
TERMINATION
±t(i) = maxy1;y2;:::;yt ¡ 1
P (y1;y2; : : :;yt¡ 1;yt = i;o1;o2; : : :;otj¸)
±1(i) = p(y1 = Si )p(x1jy1 = Si )
±t(j ) = maxi 2S
±t¡ 1(i)p(yt = Sj jyt¡ 1 = Si )p(xt jyt = Sj )
p¤= maxi 2S
±T (i)
Backtracking to get state sequence y*
buggy
The Viterbi Algorithm
Time:O(K2T)
Space:O(KT)
x1 x2 ……xj-1 xj……………………………..xT
State 1
2
K
i δj(i)
Max
Remember:
δk(i) = probability of most likely state seq ending with state SkSlides from Serafim Batzoglou
Linear in length of sequence
The Viterbi Algorithm
36
Pedro Domingos
Solution to #3 - LearningGiven x1 …xN , how do we learn θ =( , , ) to
maximize P(x)?
• Unfortunately, there is no known way to analytically find a global maximum θ * such that
θ * = arg max P(o | θ)
• But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that
P(o | θ’) ≥ P(o | θ)
p(yt jyt¡ 1) p(xtjyt) p(y1)
Solution to #3 - Learning• Use hill-climbing
– Called the forward-backward (or Baum/Welch) algorithm
• Idea– Use an initial parameter instantiation– Loop
• Compute the forward and backward probabilities for given model parameters and our observations
• Re-estimate the parameters
– Until estimates don’t change much
Expectation Maximization• The forward-backward algorithm is
an instance of the more general EM algorithm
– The E Step: Compute the forward and backward probabilities for given model parameters and our observations
– The M Step: Re-estimate the model parameters
40
Chicken & Egg Problem• If we knew the actual sequence of states
– It would be easy to learn transition and emission probabilities
– But we can’t observe states, so we don’t!
• If we knew transition & emission probabilities– Then it’d be easy to estimate the sequence of states
(Viterbi)– But we don’t know them!
Slide by Daniel S. Weld
41
Simplest Version• Mixture of two distributions
• Know: form of distribution & variance,% =5
• Just need mean of each distribution.01 .03 .05 .07 .09
Slide by Daniel S. Weld
42
Input Looks Like
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
43
We Want to Predict
.01 .03 .05 .07 .09
?
Slide by Daniel S. Weld
44
Chicken & Egg
.01 .03 .05 .07 .09
Note that coloring instances would be easy
if we knew Gausians….
Slide by Daniel S. Weld
45
Chicken & Egg
.01 .03 .05 .07 .09
And finding the Gausians would be easy
If we knew the coloring
Slide by Daniel S. Weld
46
Expectation Maximization (EM)
• Pretend we do know the parameters
– Initialize randomly: set 1=?; 2=?
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
47
Expectation Maximization (EM)• Pretend we do know the parameters
– Initialize randomly• [E step] Compute probability of instance
having each possible value of the hidden variable
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
48
Expectation Maximization (EM)• Pretend we do know the parameters
– Initialize randomly• [E step] Compute probability of instance
having each possible value of the hidden variable
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
49
Expectation Maximization (EM)• Pretend we do know the parameters
– Initialize randomly• [E step] Compute probability of instance
having each possible value of the hidden variable
.01 .03 .05 .07 .09
[M step] Treating each instance as fractionally having both values compute the new parameter values
Slide by Daniel S. Weld
50
ML Mean of Single Gaussian
Uml = argminu i(xi – u)2
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
51
Expectation Maximization (EM)
• [E step] Compute probability of instance having each possible value of the hidden variable
.01 .03 .05 .07 .09
[M step] Treating each instance as fractionally having both values compute the new parameter values
Slide by Daniel S. Weld
52
Expectation Maximization (EM)
• [E step] Compute probability of instance having each possible value of the hidden variable
.01 .03 .05 .07 .09
Slide by Daniel S. Weld
53
Expectation Maximization (EM)
• [E step] Compute probability of instance having each possible value of the hidden variable
.01 .03 .05 .07 .09
[M step] Treating each instance as fractionally having both values compute the new parameter values
Slide by Daniel S. Weld
54
Expectation Maximization (EM)
• [E step] Compute probability of instance having each possible value of the hidden variable
.01 .03 .05 .07 .09
[M step] Treating each instance as fractionally having both values compute the new parameter values
Slide by Daniel S. Weld
The Problem with HMMs• We want more than an Atomic View
of Words• We want many arbitrary, overlapping
features of words
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”
yt -1
yt
xt
yt+1
xt +1
xt -1
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Slide by Cohen & McCallum
Finite State Models
Naïve Bayes
Logistic Regression
Linear-chain CRFs
HMMsGenerative
directed models
General CRFs
Sequence
Sequence
Conditional Conditional Conditional
GeneralGraphs
GeneralGraphs
?
Problems with Richer Representationand a Joint Model
These arbitrary features are not independent.– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!
Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
St -1
St
Ot
St+1
Ot +1
Ot -1
St -1
St
Ot
St+1
Ot +1
Ot -1Slide by Cohen & McCallum
Discriminative and Generative Models
• So far: all models generative• Generative Models …
model P(y,x)• Discriminative Models …
model P(y|x)
P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!
Discriminative Models often better• Eventually, what we care about is p(y|x)!
– Bayes Net describes a family of joint distributions of, whose conditionals take certain form
– But there are many other joint models, whose conditionals also have that form.
• We want to make independence assumptions among y, but not among x.
Conditional Sequence Models• We prefer a model that is trained to
maximize a conditional probability rather than joint probability:
P(y|x) instead of P(y,x):
– Can examine features, but not responsible for generating them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
Slide by Cohen & McCallum
Finite State Models
Naïve Bayes
Logistic Regression
Linear-chain CRFs
HMMsGenerative
directed models
General CRFs
Sequence
Sequence
Conditional Conditional Conditional
GeneralGraphs
GeneralGraphs
Key Ideas • Problem Spaces
– Use of KR to Represent States– Compilation to SAT
• Search– Dynamic Programming, Constraint Sat, Heuristics
• Learning– Decision Trees, Need for Bias, Ensembles
• Probabilistic Inference– Bayes Nets, Variable Elimination, Decisions: MDPs
• Probabilistic Learning– Naïve Bayes, Parameter & Structure Learning– EM– HMMs: Viterbi, Baum-Welch
Applications• SAT, CSP, Scheduling
– Everywhere• Planning
– NASA, Xerox• Machine Learning:
– Everywhere• Probabilistic Reasoning:
– Spam filters, robot localization, etc.
http://www.cs.washington.edu/education/courses/cse473/06au/schedule/lect27.pdf