CSE 573 Finite State Machines for Information Extraction

04/21/23 13:12 1

CSE 573 Finite State Machines for Information Extraction

• Topics– Administrivia– Background– DIRT– Finite State Machine Overview– HMMs– Conditional Random Fields– Inference and Learning

Mini-Project Options1. Write a program to solve the counterfeit

coin problem on the midterm. 2. Build a DPLL and or a WalkSAT

satisfiability solver. 3. Build a spam filter using naïve Bayes,

decision trees, or compare learners in the Weka ML package.

4. Write a program which learns Bayes nets.

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Slides from Cohen & McCallum

Landscape: Our Focus

Pattern complexity

Pattern feature domain

Pattern scope

Pattern combinations

Models

closed set regular complex ambiguous

words words + formatting formatting

site-specific genre-specific general

entity binary n-ary

lexicon regex window boundary FSM CFG


Landscape of IE Techniques Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates


Classifier

which class?

…and beyond

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Boundary Models


Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars


NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines


Most likely state sequence?


Simple Extractor

Boston, Seattle, …

Cities

suchas EOP

* * ** *

DIRT• How related to IE?

• Why unsupervised?

• Distributional Hypothesis?

DIRTDependency Tree?

DIRT• Path Similarity • Path Database

DIRTEvaluation? X is author of

Y

Overall• Accept?• Proud?

Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence

Conditional Conditional Conditional

GeneralGraphs

GeneralGraphs

Graphical Models• Family of probability distributions that factorize

in a certain way• Directed (Bayes Nets)

• Undirected (Markov Random Field)

• Factor Graphs

x0

x1x2

x3

x4

p(x) =Q K

i=1 p(xi jP arents(xi))

p(x) = 1Z

QA ª A (xA )

x = x1x2 : : :xK

x0

x1x2

x4

x3x5

ª A factor function

A ½fx1; : : :;xK g

x0

x1x2

x4

x3x5

p(x) = 1Z

QC ª C (xC )

C ½fx1; : : :;xK g clique

ª C potential function

Node is independent of its non-descendants given its

parentsNode is independent all other

nodes given its neighbors

Recap: Naïve Bayes• Assumption: features independent given label• Generative Classifier

– Model joint distribution p(x,y)

– Inference

– Learning: counting– Example

y

x1x2

xKp(y;x) = p(y)

KY

k=1

p(xkjy):::

p(yjx) = p(y)KY

k=1

p(xk jy)1

p(x)

The article appeared in the Seattle Times.

city?

lengthcapitalization

suffix

Need toconsider

sequence!

Labels ofneighboring

wordsdependent!

П

П

Hidden Markov Models

• Generative Sequence Model– 2 assumptions to make joint distribution tractable

1. Each state depends only on its immediate predecessor.2. Each observation depends only on current state.

Finite state model

x1 x2 x3 x4 x5 x6 x7 x8

Graphical Model

transitions

observations

…

…

statesequence

observationsequence

Yesterday Pedro

other

person

location

person

other person …

yt¡ 1

xt¡ 1 xt xt+1

yt yt+1

…

y1 y2 y3 y4 y5 y6 y7 y8

yt-1

Hidden Markov Models

• Generative Sequence Model

• Model Parameters – Start state probabilities– Transition probabilities– Observation probabilities

Finite state model

x1 x2 x3 x4 x5 x6 x7 x8

Graphical Model

transitions

observations

…

…

statesequence

observationsequence

Yesterday Pedro

other

person

location

person

other person …

p(y;x) =TY

t=1

p(yt jyt¡ 1)p(xtjyt)yt¡ 1

xt¡ 1 xt xt+1

yt yt+1

p(y1) := p(y1jy0)

p(yt jyt¡ 1)

p(xtjyt)

…

y1 y2 y3 y4 y5 y6 y7 y8

П -

-

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

Slide by Cohen & McCallum

IE with Hidden Markov ModelsFor sparse extraction tasks :• Separate HMM for each type of

target • Each HMM should

– Model entire document– Consist of target and non-target states– Not necessarily fully connected

18Slide by Okan Basegmez

Information Extraction with HMMs• Example – Research Paper Headers

19Slide by Okan Basegmez

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on ~500k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

Back-off to: Back-off to:

or

Results:


p(yt jyt¡ 1;xt¡ 1)

p(yt jyt¡ 1)

p(yt)

p(xtjyt;yt¡ 1)

p(xtjyt;xt¡ 1)

p(xtjyt)

p(xt)

- - -

-

-

A parse of a sequenceGiven a sequence x = x1……xN,

A parse of o is a sequence of states y = y1, ……, yN

1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

x1 x2 x3 xK

2

1

K

2

Slide by Serafim Batzoglou

person

other

location

Question #1 – EvaluationGIVEN

A sequence of observations x1 x2 x3 x4 ……xN

A trained HMM θ=( , , )

QUESTION

How likely is this sequence, given our HMM ?

P(x, θ)

p(yt jyt¡ 1) p(xtjyt) p(y1)

Why do we care?

Need it for learning to choose among competing models!

-

Question #2 - DecodingGIVEN


A trained HMM θ=( , , )

QUESTION

How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ?


There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

-

Question #3 - LearningGIVEN


QUESTION

How do we learn the model parameters θ =( , , ) to maximize P(x, λ ) ?p(yt jyt¡ 1) p(xtjyt) p(y1)-

Solution to #1: EvaluationGiven observations x=x1 …xN and HMM θ, what is p(x) ?

Naïve: enumerate every possible state sequence y=y1

…yN

Probability of x and given particular y

Probability of particular y

Summing over all possible state sequences we get

p(xjy) =TY

t=1

p(xtjyt)

p(y) =TY

t=1

p(yt jyt¡ 1)

p(x) =X

all y

p(xjy)p(y)

NT state sequences!

2T multiplications per sequence

For small HMMsT=10, N=10there are 10

billion sequences!

П

П

Solution to #1: EvaluationUse Dynamic Programming:

Define forward variable

probability that at time t- the state is yi

- the partial observation sequence x=x1 …xt has been emitted

p(x) =X

y

p(y)p(xjy)

=X

y

Y

t=1::T

p(yt jyt¡ 1)p(xtjyt)

=X

yT

X

yT ¡ 1

p(yT ;yT ¡ 1;xT )X

yT ¡ 2

p(yT ¡ 1jyT ¡ 2)p(xT ¡ 1jyT ¡ 1)X

yT ¡ 3

: : :

Solution to #1: Evaluation• Use Dynamic Programming

• Cache and reuse inner sums• Define forward variables

®t(i) := P (x1x2:::xt;yt = Si )

probability that at time t- the state is yt = Si

- the partial observation sequence x=x1 …xt has been emitted

П

y

y

-

- -

-

--- - -

The Forward Algorithm

INITIALIZATION

INDUCTION

TERMINATION

®t(i) = p(x1x2:::xt;yt = Si )

=X

j 2S

®t¡ 1(j )p(yt = Si jyt¡ 1 = Sj )p(xtjyt)

®1(i) = p(y1 = Si )p(x1jy1)

p(x) =X

j 2S

®T (j )

®t(i) := P (x1x2:::xt;yt = Si )

Time:O(K2N)

Space:O(KN)

K = |S| #statesN length of sequence

-

-

p(yt = S3 jyt¡ 1 = S

2)

p(yt = S3jyt¡ 1 = S3)

The Forward Algorithm

S1

S2

S3

SN

S1

S2

S3

SN

t ¡ 1 tot

®t¡ 1(1)

®t¡ 1(2)

®t¡ 1(3)

®t¡ 1(N )

®t(3)

p(yt =

S3 jy

t¡1 =

S1 )

p(yt=

S3jyt¡

1=

SN)

p(xt = ot jyt = S3)

®t(i) := P (x1x2:::xt;yt = Si )

p(yt+1 = S3jyt = S3)

¯t+1(N )

The Backward Algorithm

S1

S2

S3

SN

S1

S2

S3

SN

p(yt+1= S2jyt

= S3)

t t +1ot+1

¯t+1(1)

p(yt+

1=

S1jyt

=S3

)

p(yt+1 =SN jyt =

S3 ) p(xt = ot+1jyt = S3)

¯t+1(2)

¯t+1(3)

¯t(i) := P (yt = Si ;xt+1xt+2:::xT )

The Backward Algorithm

INITIALIZATION

INDUCTION

TERMINATION

¯t(i) = p(yt = Si ;xt+1;xt+2 : : :xT )

=X

j 2S

p(yt+1 = Sj jyt = Si )p(xt+1jyt+1)¯t+1(j )

¯T (i) = 1

p(x) =X

j 2S

p(y1 = Sj )p(x1jy1)¯1(j )

¯t(i) := P (yt = Si ;xt+1xt+2:::xT )

Time:O(K2N)

Space:O(KN)

Solution to #2 - DecodingGiven x=x1 …xN and HMM θ, what is “best” parse y1 …yN?

Several optimal solutions• 1. States which are individually most likely:

most likely state y*t at time t is then

P (yt = Si jx) =®t(i)¯t(i)

P (x)=

®t(i)¯t(i)P N

i=1 ®t(i)¯t(i)

y¤t = argmax1· i · N P (yt = Si jx)

But some transitions may have 0 probability!

Solution to #2 - DecodingGiven x=x1 …xN and HMM θ, what is “best” parse y1 …

yN?

Several optimal solutions• 1. States which are individually most likely• 2. Single best state sequence

We want to find sequence y1 …yN, such that P(x,y) is maximized

y* = argmaxy P( x, y )

Again, we can use dynamic programming!

1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

o1 o2 o3 oK

2

1

K

2

The Viterbi AlgorithmDEFINE

INITIALIZATION

INDUCTION

TERMINATION

±t(i) = maxy1;y2;:::;yt ¡ 1

P (y1;y2; : : :;yt¡ 1;yt = i;o1;o2; : : :;otj¸)

±1(i) = p(y1 = Si )p(x1jy1 = Si )

±t(j ) = maxi 2S

±t¡ 1(i)p(yt = Sj jyt¡ 1 = Si )p(xt jyt = Sj )

p¤= maxi 2S

±T (i)

Backtracking to get state sequence y*

buggy

The Viterbi Algorithm

Time:O(K2T)

Space:O(KT)

x1 x2 ……xj-1 xj……………………………..xT

State 1

2

K

i δj(i)

Max

Remember:

δk(i) = probability of most likely state seq ending with state SkSlides from Serafim Batzoglou

Linear in length of sequence

The Viterbi Algorithm

36

Pedro Domingos

Solution to #3 - LearningGiven x1 …xN , how do we learn θ =( , , ) to

maximize P(x)?

• Unfortunately, there is no known way to analytically find a global maximum θ * such that

θ * = arg max P(o | θ)

• But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that

P(o | θ’) ≥ P(o | θ)


Solution to #3 - Learning• Use hill-climbing

– Called the forward-backward (or Baum/Welch) algorithm

• Idea– Use an initial parameter instantiation– Loop

• Compute the forward and backward probabilities for given model parameters and our observations

• Re-estimate the parameters

– Until estimates don’t change much

Expectation Maximization• The forward-backward algorithm is

an instance of the more general EM algorithm

– The E Step: Compute the forward and backward probabilities for given model parameters and our observations

– The M Step: Re-estimate the model parameters

40

Chicken & Egg Problem• If we knew the actual sequence of states

– It would be easy to learn transition and emission probabilities

– But we can’t observe states, so we don’t!

• If we knew transition & emission probabilities– Then it’d be easy to estimate the sequence of states

(Viterbi)– But we don’t know them!

Slide by Daniel S. Weld

41

Simplest Version• Mixture of two distributions

• Know: form of distribution & variance,% =5

• Just need mean of each distribution.01 .03 .05 .07 .09


42

Input Looks Like

.01 .03 .05 .07 .09


43

We Want to Predict

.01 .03 .05 .07 .09

?


44

Chicken & Egg

.01 .03 .05 .07 .09

Note that coloring instances would be easy

if we knew Gausians….


45

Chicken & Egg

.01 .03 .05 .07 .09

And finding the Gausians would be easy

If we knew the coloring


46

Expectation Maximization (EM)

• Pretend we do know the parameters

– Initialize randomly: set 1=?; 2=?

.01 .03 .05 .07 .09


47

Expectation Maximization (EM)• Pretend we do know the parameters

– Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09


48




.01 .03 .05 .07 .09


49




.01 .03 .05 .07 .09

[M step] Treating each instance as fractionally having both values compute the new parameter values


50

ML Mean of Single Gaussian

Uml = argminu i(xi – u)2

.01 .03 .05 .07 .09


51


• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09



52



.01 .03 .05 .07 .09


53



.01 .03 .05 .07 .09



54



.01 .03 .05 .07 .09



The Problem with HMMs• We want more than an Atomic View

of Words• We want many arbitrary, overlapping

features of words

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

yt -1

yt

xt

yt+1

xt +1

xt -1

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”


Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence


GeneralGraphs

GeneralGraphs

?

Problems with Richer Representationand a Joint Model

These arbitrary features are not independent.– Multiple levels of granularity (chars, words, phrases)

– Multiple dependent modalities (words, formatting, layout)

– Past & future

Two choices:

Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

St -1

St

Ot

St+1

Ot +1

Ot -1

St -1

St

Ot

St+1

Ot +1

Ot -1Slide by Cohen & McCallum

Discriminative and Generative Models

• So far: all models generative• Generative Models …

model P(y,x)• Discriminative Models …

model P(y|x)

P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!

Discriminative Models often better• Eventually, what we care about is p(y|x)!

– Bayes Net describes a family of joint distributions of, whose conditionals take certain form

– But there are many other joint models, whose conditionals also have that form.

• We want to make independence assumptions among y, but not among x.

Conditional Sequence Models• We prefer a model that is trained to

maximize a conditional probability rather than joint probability:

P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.


Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence


GeneralGraphs

GeneralGraphs

Key Ideas • Problem Spaces

– Use of KR to Represent States– Compilation to SAT

• Search– Dynamic Programming, Constraint Sat, Heuristics

• Learning– Decision Trees, Need for Bias, Ensembles

• Probabilistic Inference– Bayes Nets, Variable Elimination, Decisions: MDPs

• Probabilistic Learning– Naïve Bayes, Parameter & Structure Learning– EM– HMMs: Viterbi, Baum-Welch

Applications• SAT, CSP, Scheduling

– Everywhere• Planning

– NASA, Xerox• Machine Learning:

– Everywhere• Probabilistic Reasoning:

– Spam filters, robot localization, etc.

http://www.cs.washington.edu/education/courses/cse473/06au/schedule/lect27.pdf

CSE 573 Finite State Machines for Information Extraction

Documents

Transcript of CSE 573 Finite State Machines for Information Extraction