Graphical Models for Sequence Labeling in NLP

Graphical Models for Sequence Labeling inNLP

M.Tech Seminar Report

Submitted by

Anup Abhay Kulkarni

Roll No: 08305045

Under the guidance of

Prof. Pushpak Bhattacharyya

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

Mumbai

2009

Abstract

In NLP task it is important to understand interaction between words and their labels. More informationcan be fetched from context than treating words a unit. Thus it looks and proved beneficial to representNLP task in the form of graph and train probabilistic model on that graph. In this report, we will discusshow NLP tasks can be represented and solved using different probabilistic graphical models.

Table of Contents

1 Introduction 1

1.1 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Random Fields 3

2.1 Labeling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Neighboring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.2 Gibbs Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Gibbs-Markov Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.1 Proof of GRF →MRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.2 Proof of MRF → GRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Maximum Entropy Model 8

3.1 HMM to MEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Problems in Traditional HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.2 Basics of Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Formulating Dual Objecive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.2 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.1 Feature Selection for POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Cyclic Dependancy Network 16

4.1 From HMM to Cyclic Dependency Network . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Cyclic Dependency Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Inferencing in Cyclic Dependency Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conditional Random Fields 19

5.1 Label Bias Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1

5.2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5.1 chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Conclusion 26

2

Chapter 1

Introduction

The ultimate goal of natural language processing is to understand language. But it is hard task to achieve.So this task is broken down to relatively simple and computationally achievable form of sequence labeling.Sequence labeling particularly looks for different patterns in sequences of text and attempts to label themaccordingly. We will discuss mainly three main sequence labeling tasks

• POS Tagging

• Chunking

• Named Entity Recognition

1.1 POS Tagging

One of the important tasks in Natural Language Processing is to classify each word in to its lexicalcategory. The simplest form lexical categories can be: Noun, Verb, Adjective, Adverb etc. But dependingupon usage of different words, morphology these words are categorized into more number of POS tags.POS tagging is task of tagging each word with its correct POS.Example:

The AT postman NN collected VBD letters NNS and CNJ left VBD.

In the example given above The is article so its marked as AT. Collected and left are verbs in theirpast tense forms and hence tagged as VBD. Though postman and letters both are nouns postman issingular but letters is plural so they are tagged differently. In this example it looks like every word hasa unique POS tag. Consider another example in which same word can be tagged differently dependingupon role it plays.

He PN plays VBZ violin NN in PP the AT play NN.

Here word “Play” appears twice and tagged differently. First time word “Play” is used as verb andhence tagged as VBZ(verb in 3rd person singular form). Whereas second time it is used as noun andtagged as NN.

These two examples shows how POS tags are labeled and it is more than simply keeping dictionaryof words and their POS tags. To correctly identify POS tags of a word we need to know what role thatword plays in that sentence. To identify the role we require to identify certain rules in the languagewhich will help to identify correct tags.Example: In English, article is followed by Noun. So if we could correctly identify articles then wecould correctly label nouns following it. Word been is always followed by verb in past tense or verb ingerund.

We use graphical models to find such rules from the training dataset.

1

1.2 Chunking

Chunking is task of understanding structure of sentence without fully parsing it. Output after chunkingwill be segmentation of sentence into collection of words which are in same grammatical unit. The labelsgiven after chunking will be the grammatical phrase appended by B, I and O. B for begins the phrase, Ifor continues the current phrase and O for outside current phrase. Question can be asked is how graphicalmodels can be used for chunking.

Example:

He Reckons The Current Account Deficit Will Narrow . < O >

In this “He” begins a Noun Phrase. “Reckons” again begins new verb phrase. “The” starts newNoun Phrase which is continued by “Current Account Deficit”. Presence of auxiliary verbs and verbs ininfinitive form demands for I-VP that is continuing same verb phrase.

It can be seen that since I is for continues the current phrase, I will never occur as a label for thefirst word in the sentence. Also NP − I (meaning continues current Noun Phrase) can never followV P −I (continues current verb phrase). There must be V P −B label in-between them stating beginningof verb phrase. Thus the labels associated with every word are not independent from each other andthere mutual interaction can be understood by constructing a graph structure out of them. Since achunk is more likely to contain semantically related words, chunking plays important role in informationextraction.

1.3 Named Entity Recognition

Named Entity as name suggests is entity with name given to it. The entities are proper nouns and taskis to classify them into name of person, name of a location, name of a organization etc. Recognizingnamed entity plays very important role in Question Answering system, Information Extraction systemand Search Engines.

Example:

I bought 100 shares of IBM Corp< ORG > .

In the above example “IBM Corp”. is tagged as name of organization. Now lets consider an examplewhen same word is labeled with different tags.

IIT Bombay< ORG > is located in Powai which is in north Bombay< LOC >.

Now in this example “Bombay” appears twice, first is tagged as name of organization and second istagged as location.

Wesley< PER > went to buy the book published by Addison Wesley< ORG >.

Again here “Wesley” is first tagged as name of person whereas second time it is tagged as name oforganization.

2

Chapter 2

Random Fields

2.1 Labeling Problem

Labeling problem is formally defined as assign label from set L to each node yi of the graph. The setY = y1, y2, . . . , yn is called labeling or configuration.

Suppose total number of possible labels at each node is |L| = l then the total number of configurationspossible are

Y = l × l × l × . . . l = ln

Where n is size of set Y.Example: Consider a example of POS tagging. In this task, possible labels are L = N,V,ADJ,ADV, PP

where N is noun, V is verb, ADJ is adjective, ADV is adverb and PP is preposition. Consider one ex-ample sentence “IIT-Bombay is one of the best institute in India”. Since there are 9 words to label andeach word can be given 5 possible tags there are 59 possible labeling.

In sequence labeling task, images or text is considered as a graph with every pixel or every wordbeing a node in that graph. Edges of this graph are defined implicitly by considering mutual interactionamongst the nodes. Eg. In image labeling, given an image task is to label each pixel correctly as towhich object in the image that pixel belongs to. In text labeling, task is to find which phrase of thesentence a word belongs to. Edges in this graph are then defined between the words (or pixels) whichdirectly influence or assumed to influence labels of the other word (or pixel).

2.2 Neighboring System

Once we define a graph over text data we define neighboring system in that graph. A neighboring systemNe is defined as

Ne = Ne(i)|∀i ∈ Y

Where each Ne(i) is set of nodes directly connected to i. Ne(i) have following propertiesA site is not neighboring to itself i.e. i /∈ Ne(i) Neighbours are mutual i.e. if j ∈ Ne(i)theni ∈ Ne(j)In case of images neighboring set for a pixel can be defined by considering 4-connectivity or 8 con-

nectivity of pixel. The 4-connectivity and 8-connectivity is shown in fig. In case of text data, the graphusually is chain graph and hence there is 2-connectivity for each node. Which means each word isconnected to its previous word and the next word.

In graph theory clique is defined as maximally connected sub-graph. In this discussion we take allpossible cliques in our consideration. So set of cliques is

C = c1, c2, c3, . . .

Where c1 is set of single nodes, c2 is set of pair of connected nodes, c3 is triple neighboring nodesand so on. Since in text application graph is usually chain graph, cliques up to size 2 are considered.

3

2.3 Random Fields

A random field is a generalization of a stochastic process such that the underlying parameter need nolonger be a simple real, but can instead be a multidimensional vector space. In simple words insteadof single random variable over a sample space we have list or vector of random variables then thecorresponding random process is called as random field.

Let Y = y1, y2, . . . , yn be set of random variables defined over the set S in which in random variableyi takes value li ∈ L. The probability associated with every random variable is denoted as Pr(yi = li).Whereas probability of joint event is given by,

Pr(Y ) = Pr(y1, y2, . . . , yn)

2.3.1 Markov Random Fields

Markov random field, is a model of the (full) joint probability distribution of a set X of random variableshaving the Markov property. A random field is said to be Markov Random Field (MRF) when it holdsfollowing properties

• Positivity A random field is said to be positive when

∀i : Pr(yi > 0)

• MarkovianityPr(yi|yN−i) = Pr(yi|yNe(i))

Where Ne(i) = set of all neighbours of node i in the graph.

Markovian property tells that probability of a node taking a particular label given labels of all theother nodes in the graph is equal to the probability of node taking that label given only labels of neighborsof that node.

Example: Consider same graph for sentence “ IIT-Bombay is one of the best institute in India”then by Markovian property label of word “institute” given all the other labels in the sentence is sameas its label given only label of “best” and “in” as shown in 2.1

Figure 2.1: Graphical model induced on a sentence. The word “insitute” depends only on “best” and“in”

2.3.2 Gibbs Random Field

A random field is said to be Gibbs random field if and only if its configuration obeys Gibbs distribution.Gibbs distribution is defined as

Pr(y) =exp(−U(y)

T)

Z

Where Z is defined as

Z =∑

y∈Y

exp(−U(y)

T)

4

Z is called as normalizing constant. T is called as temperature and its 1 unless and until explicitlymentioned. And U(y) is called as energy function defined as

U(y) =∑

c∈C

Vc(y)

That is sum of all clique potentials. The value of Vc(y) depends upon local configuration of clique whichis redefined as features in later discussion. Energy function can also be written as

U(y) =∑

c1

V1(y) +∑

c2

V2(y) + . . .

Here V1 is called as potential over single nodes or node potential, V2 is called as potential over pairof neighboring nodes or edge-potential.

2.4 Gibbs-Markov Equivalence

An MRF is characterized by its local property (the Markovianity) whereas a GRF is characterized byits global property (the Gibbs distribution). Hammersley-Clifford theorem establishes the equivalence ofthese two types of properties.

Hammersley-Clifford theorem says that, F is MRF if and only if F is GRF.

2.4.1 Proof of GRF →MRF

Let Di = Ne(i) ∪Xi be set of neighbors of Xi and Xi itself.We have to prove that if Gibbs distribution is satisfied then markovian property is hold. That means

given Gibbs distribution we have to arrive at

Pr(Xi|XS−i) = Pr(Xi|XNe(i)) (2.1)

Starting with LHS

Pr(Xi|XS−i) =Pr(Xi, XS−i)

Pr(XS−i)

=Pr(Xi, XS−i)

∑

XiPr(Xi, XS−i)

=

exp(P

c∈C Vc(X))

Z(X)∑

Xi

exp(P

c∈C Vc(X))

Z(X)

=exp(

∑

c∈C Vc(X))∑

Xiexp(

∑

c∈C Vc(X))

Now cliques C in the graph is set of cliques A which contains node Xi and B which do not contain nodeXi. Using this,

Pr(Xi|XS−i) =exp(

∑

c∈C Vc(X))∑

Xiexp(

∑

c∈C Vc(X))

=exp(

∑

c∈A Vc(X))× exp(∑

c∈B Vc(X))∑

Xiexp(

∑


c∈B Vc(X))

since cliques in B do not contain node Xi, the term exp(∑

c∈B Vc(X)) can be brought out of summation.

Pr(Xi|XS−i) =exp(

∑


c∈B Vc(X))∑

Xiexp(

∑


c∈B Vc(X))

=exp(

∑

c∈A Vc(X))∑

Xiexp(

∑

c∈A Vc(X))

5

The term in the numerator contains all the cliques which contains Xi.We arrive at the same expression from RHS as

Pr(Xi|XNe(i)) =Pr(Xi, XNe(i))

Pr(XNe(i))

=

exp(P

c∈A Vc(X))

Z(X)∑

Xi

exp(P

c∈A Vc(X))

Z(X)

=exp(

∑

c∈A Vc(X))∑

Xiexp(

∑

c∈A Vc(X))

Thus 2.1 is proved.

2.4.2 Proof of MRF → GRF

For this Mobius inversion principle is used,

F (A) =∑

B:B⊆A

G(B)⇔ G(B) =∑

B:B⊆A

(−1)|B|−|C|F (C)

This when written for energy function U(X) and potentials V(X),

U(xv) =∑

B:B⊆V

V (xb)⇐⇒ V (xb) =∑

C:C⊆B

(−1)|B|−|C|U(xc)

where V our graph and B is subgraph.Gibbs distribution requires energy to be defined over all cliques of the graph. So to prove that

V (xb) = 0 (2.2)

whenever B is not completely connected ie. B is not clique.when B is not a clique, let Z1 and Z2 be two nodes such that they are not connected in B.

let S be separator : Z1SZ2 = Bnow, V (xb) =

∑

C:C⊆B(−1)|B|−|C|U(xc)can be written as

V (xb) =∑

C′:C′⊆S

(−1)|B|−|C′|U(xc′)

+∑

C′:C′⊆S

(−1)|B|−|Z1C′|U(xz1c′)

+∑

C′:C′⊆S

(−1)|B|−|C′Z2|U(xc′z2)

+∑

C′:C′⊆S

(−1)|B|−|Z1C′Z2|U(xz1c′z2)

adjusting exponents of (−1)

V (xb) =∑

C′:C′⊆S

(−1)|B|−|C′|[U(xc′)− U(xz1c′)− U(xc′z2) + U(xz1c′z2

)] (2.3)

Now we will prove that bracketed term [.] is 0. Taking exp of [.] term,

exp[.] = exp(U(xc′)− U(xz1c′)− U(xc′z2) + U(xz1c′z2

))

=exp(U(xc′))× exp(U(xz1c′z2

))

exp(U(xz1c′))× exp(U(xc′z2))

=exp(U(xc′ ))

Z×

exp(U(xz1c′z2))

Zexp(U(xz1c′))

Z×

exp(U(xc′z2))

Z

=Pr(xc′ )

Pr(xz1c′)

Pr(xz1c′z2)

Pr(xc′z2)

6

Using Baye’s Therom,

exp[.] =Pr(xz1

|xc′z2)

Pr(xz1|xc′)

since XZ1and XZ2

are not directly connected, using Markovian property,

Pr(xz1|xc′z2

) = Pr(xz1|xc′)

exp[.] = 1

so [.] in 2.3 is 0. So by 2.2 Every MRF is GRF.

7

Chapter 3

Maximum Entropy Model

3.1 HMM to MEMM

3.1.1 Problems in Traditional HMM

[5] discuss need to graduate from Hidden Markov Models. In Hidden Markov Model, observation prob-abilities are expressed as multinomial distribution over the observation data and then parameters areestimated. There are two main problems with this appoach

1. Though input given is raw text data, many important characteristics can be used in constructingmodel. One of those characteristics are Capitalization, suffix, position in the page, POS tag ofthe word. Example: for unseen words, we can make use of the suffix of the unknown word. Andsee the distribution of that suffix taking perticular label in the corpus. Similar is the case withcapitalization. This extra ssinformation is not used by traditional HMM.

2. Almost in all the applications of text labeling, the problem is to find label(or label sequence)given observation sequence. So even though problem requires conditional model, HMM modelsjoint probability. Due to such joint probability modeling the dependancy graph of HMM becomesobservation depends upon label, which intuitively should be label depends on observation.

With these drawbacks in traditional HMM, we look for new probability model.

Figure 3.1: A.Dependency Graph for traditional HMM B.Dependency Graph for Maximum EntropyMarkov Model(MEMM)

3.1.2 Basics of Maximum Entropy

Consider one sequence labeling task: POS tagging. In this task we tag every word in the sentence withPart-Of-Speech tags. We are given with obeserved dataset in which words are correctly tagged by humanexperts. Our main aim is to build a model which correctly fits observed data.Consider, for simplicity we have only 5 tags namely Noun(N), Verb(V), Adjective(ADJ), Adverb(ADV)and Preposition(PP). So given a word W , we have only 5 choices of tags to tag. This imposes oneconstraint on probability model that:

Pr(N |W ) + Pr(V |W ) + Pr(ADJ |W ) + Pr(ADV |W ) + Pr(PP |W ) = 1

8

Given only this equation there can be infinitely many probability models are possible. One model whichwill make very strong assumptions and assign Pr(N) = 1 and others 0, which means this model willassign POS tag N to every word. One model can assign equal probabilities to all Noun and Verb, and 0to others:

Pr(N |W ) + Pr(V |W ) = 1

Both of these models make some extra assumptions than what is observed from the data.Another modelcan be given as

Pr(N |W ) = Pr(V |W ) = Pr(ADJ |W ) = Pr(ADV |W ) = Pr(PP |W ) =1

5

which assumes every tag is possible. Now, suppose its seen from observed data that the word W istagged as Verb 70% of the times so we can incorporate this information in our model as:

Pr(N |W ) + Pr(ADJ |W ) + Pr(ADV |W ) + Pr(PP |W ) = 0.3

Pr(V |W ) = 0.7

Now we need a model for Pr(N |W ) + Pr(ADJ |W ) + Pr(ADV |W ) + Pr(PP |W ) = 0.3. But for this wedont have any information available from the observed data. In general the question we need to answeris how to model or what to assume about unknown information. Here we make use of maximum entropyprinciple which states that:if incomplete information about probability distribution is availale, the only unbiased assumption thatcan be made is a distribution which is as uniform as possible with known information as constraint.

Since uniformity of probability distribution is given by maximizing entropy, so we select probabilitymodel which will maximize entropy under given constraints.

3.2 Formulation

In this section we will formally define Maximum Entropy model and obtain formula for it as discussedby [2]. To state again, our goal is to come up with a model which correctly fits the observed data. Weare given N samples in the form of sample pairs (word, label) : (x1, y1), (x2, y2), ..., (xN , yN ). We usethese training examples to calculated empirical probability distribution

∼

Pr(x, y) =No.oftimes(x, y)occursinsample

N

One of the important part of this model is statistic collected from the sample data. This statistic shouldgive us maximum possible information about the trend between word and its label.Example: In most of the cases article(a, an, the) is followed by noun. Or, if word ‘play’ is preceded by‘noun’ then ‘play’ is Verb.Such information about the interaction between words and labels is captured using “features”. A featureis indicator function which takes form like

f(x, y) =

{

1 if x = play, y = verb0 otherwise

}

We populate these features from the training data and put constraint over our model that expectationof features in our model should be equal to expectation of features in training data.

∼

Pr(f) = Pr(f) (3.1)

where expected value of feature f in empirical distribution is defined as

∼

Pr(f) =∑

x,y

∼

Pr(x, y)f(x, y)

similarly expected value of feature f in our probability model is

Pr(f) =∑

x,y

Pr(x, y)f(x, y)

9

using chain rule of probability

Pr(f) =∑

x,y

Pr(x) Pr(y|x)f(x, y)

here assumption is made that probability of a word in empirical distribution is exactly equal to probabilityof word in model which we are formulating,

Pr(f) =∑

x,y

∼

Pr(x) Pr(y|x)f(x, y)

putting this formula in 3.1

∑

x,y

∼

Pr(x, y)f(x, y) =∑

x,y

∼

Pr(x) Pr(y|x)f(x, y) (3.2)

this is one constraint on our model. Another constraint is imposed by definition of probability:

∑

y

Pr(y|x) = 1 (3.3)

We maximize entropy of the model under the constraints 3.2 and 3.3. Conditional entropy is defined as,

H(y|x) = −∑

(x,y)

∼

Pr(x) Pr(y|x)log Pr(y|x)

our probability model is then,∗

Pr(y|x) = arg maxPr(y|x)

H(y|x)

3.2.1 Formulating Dual Objecive

In order to maximize the entropy with respect to constraints we introduce Langrange Multipliers forevery constraint. Let λ1...m be Langrange Multiplier associated 3.2 for each feature f1...m. Let λm+1 bethe Langrange Multiplier associated 3.3. Then our dual objective is:

D(Pr(y|x), λ) = H(y|x) +

m∑

i=1

λi(Pr∼(fi)−∼

Pr(fi)) + λm+1(∑

y

Pr(y|x)− 1)

where m is the total number of features. now we differentiate this to find the optimal value,

∂

∂ Pr(y|x)H(y|x) =

∂

∂ Pr(y|x)−

∑

(x,y)

∼

Pr(x) Pr(y|x)log Pr(y|x)

= −∼

Pr(x)(log Pr(y|x) + 1)

∂

∂ Pr(y|x)

m∑

i=1

λi(Pr(fi)−∼

Pr(fi)) =∂

∂ Pr(y|x)

m∑

i=1

λi(∑

(x,y)

∼

Pr(x) Pr(y|x)fi(x, y)− (∑

(x,y)

∼

Pr(x, y)fi(x, y)))

=

m∑

i=1

λi

∼

Pr(x)fi(x, y)

∂

∂ Pr(y|x)λm+1(

∑

y

Pr(y|x)− 1) = λm+1

the differentiation of complete dual is

∂

∂ Pr(y|x)D(Pr(y|x), λ) = −

∼

Pr(x)(1 + logPr(y|x)) +

m∑

i=1

λi

∼

Pr(x)fi(x, y) + λm+1 (3.4)

10

equating to 0,

∂

∂ Pr(y|x)H(y|x) = 0

∼

Pr(x)(1 + logPr(y|x)) =m

∑

i=1

λi

∼

Pr(x)fi(x, y) + λm+1

1 + logPr(y|x) =

m∑

i=1

λifi(x, y) +λm+1

Pr∼(x)

Pr(y|x) = exp(m

∑

i=1

λifi(x, y) +λm+1

Pr∼(x)− 1)

Pr(y|x) = exp(

m∑

i=1

λifi(x, y)) × exp(λm+1

Pr∼(x)− 1) (3.5)

putting this in 3.3

∑

y

Pr(y|x) = 1

∑

y

exp(

m∑

i=1

λifi(x, y))× exp(λm+1

Pr∼(x)− 1) = 1

exp(λm+1

Pr∼(x)− 1) =

1∑

y exp(∑m

i=1 λifi(x, y))

putting this value in 3.5 we get our final probability model

Pr(y|x) =exp(

∑mi=1 λifi(x, y))

∑

y exp(∑m

i=1 λifi(x, y))(3.6)

Once we get the model we are interested in our next task is to find values of parameters λi...m. Forwhich IIS training algorithm is used.

3.3 Training

3.3.1 Parameter Estimation

Paramter is estimated using Improved Iterative Scaling algorithm [1]. We select the Λ such that it willminimize DKL(Pr∼(x, y)||Pr(y|x)).

DKL(∼

Pr(x, y)||Pr(y|x)) =∑

x,y

(∼

Pr(x, y)log(Pr∼(x, y)

PrΛ(y|x)))

=∑

x,y

(∼

Pr(x, y)log(∼

Pr(x, y)))−∑

x,y

(∼

Pr(x, y)log(PrΛ

(y|x)))

since∑

x,y(Pr∼(x, y)log(Pr∼(x, y))) does not change with respect to ΛDKL(Pr∼(x, y)||Pr(y|x)) will min-imize when−

∑

x,y(Pr∼(x, y)log(PrΛ(y|x))) is minimized, which means when∑

x,y(Pr∼(x, y)log(PrΛ(y|x)))is maximized.Let O(Λ) = −

∑

x,y(Pr∼(x, y)log(PrΛ(y|x)))Now, suppose Λ is changed by ∆(that means each λi is changed by δi)

O(Λ + ∆) =∑

x,y

(∼

Pr(x, y)log( PrΛ+∆

(y|x)))

change in objective from Λ to Λ + ∆ is

11

O(Λ + ∆)−O(Λ) =∑

x,y

(∼

Pr(x, y)log( PrΛ+∆

(y|x))) −∑

x,y

(∼

Pr(x, y)log(PrΛ

(y|x)))

putting the values from 3.6

O(Λ + ∆)−O(Λ) =∑

x,y

(∼

Pr(x, y)log(exp(

∑mi=1(λi + δi)fi(x, y))

∑

y exp(∑m

i=1(λi + δi)fi(x, y))))

−∑

x,y

(∼

Pr(x, y)log(exp(

∑mi=1 λifi(x, y))

∑

y exp(∑m

i=1 λifi(x, y))))

=∑

x,y

(∼

Pr(x, y)∑

i

δifi(x, y))−∑

x,y

(∼

Pr(x, y)log(ZΛ+∆(x)

ZΛ(x)))

=∑

x,y

(∼

Pr(x, y)∑

i

δifi(x, y))−∑

x

(∼

Pr(x)log(ZΛ+∆(x)

ZΛ(x)))

We will now make use of inequality, for a > 0

−log(a) ≥ (1− a)

O(Λ + ∆)−O(Λ) ≥∑

x,y

(∼

Pr(x, y)∑

i

δifi(x, y)) + 1−∑

x

(∼

Pr(x)ZΛ+∆(x)

ZΛ(x))

=∑

x,y

(∼

Pr(y|x)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)

∑

y exp(∑m

i=1(λi + δi)fi(x, y))∑

y exp(∑m

i=1(λi)fi(x, y))

=∑

x,y

(∼

Pr(y|x)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)exp(∑

i

δifi(x, y))

renaming RHS to A(Λ,∆)

A(Λ,∆) =∑

x,y

(∼

Pr(y|x)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)exp(∑

i

δifi(x, y))

since O(Λ + ∆)−O(Λ) ≥ A(Λ,∆), we can find ∆ for which A(Λ,∆) > 0. Let total count of features be,T (x, y) =

∑

i fi(x, y). Using this, we can write A(Λ,∆) as,

A(Λ,∆) =∑

x,y

(∼

Pr(x, y)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)exp(∑

i

δifi(x, y))

=∑

x,y

(∼

Pr(x, y)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)exp(T (x, y)∑

i

δifi(x, y)

T (x, y))

By the definition of T (x, y), the term fi(x,y)T (x,y) forms probability distribution over features. Hence we can

apply Jensen’s inequality,

exp(∑

x

p(x)q(x)) ≤∑

x

p(x)exp(q(x))

A(Λ,∆) ≥∑

x,y

(∼

Pr(y|x)∑

i

δifi(x, y)) + 1−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)∑

i

(fi(x, y)

T (x, y)exp(δiT (x, y)))

let the RHS of this eqn be B(∆,Λ). Since O(Λ + ∆)−O(Λ) ≥ A(Λ,∆) ≥ B(∆,Λ), B(∆,Λ) gives alower bound on O(Λ + ∆)−O(Λ). Differentiating B(∆,Λ) with respect to δi gives,

∂

∂δiB(∆,Λ) =

∑

x,y

(∼

Pr(x, y)fi(x, y)−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)fi(x, y)exp(δiT (x, y))

12

equating to 0,

∑

x,y

(∼

Pr(x, y)fi(x, y)−∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)fi(x, y)exp(δiT (x, y)) = 0

∑

x,y

(∼

Pr(x, y)fi(x, y) =∑

x

∼

Pr(x)∑

y

PrΛ

(y|x)fi(x, y)exp(δiT (x, y)) (3.7)

Solving this equation we get update deltai

3.3.2 Training Algorithm

Using the parameter estimation objective given in previous discussion, we can write training algorithmfor maximum entropy model as,

Algorithm 1 Algorithm for training MaxEnt model

Input Feature Functions f1, f2, ..., fm;empirical distribution Pr∼(x, y)Ouput Optimal Parameters values Λ∗; optimal model PrΛ∗(y|x)repeat

for i = 1 to m do

δi be solution to 3.7update values as λn+1

i ← λni + δi

end for

until Λ is converged

The key step in the algorithm is finding solution to 3.7. If T (x, y) = M for all x,y then 3.7 can bewritten as,

δi = log

∑

x,y(Pr∼(x, y)fi(x, y)∑

x Pr∼(x)∑

y PrΛ(y|x)fi(x, y)

which becomes update rule.

3.4 Inferencing

Maximum Entropy model is descriminative model as opposed to HMM which is generative model. Graph-ical representations of both of these is given in Fig. Being generative model HMM believes that observa-tion is generated by the state. Whereas Maximum Entropy model does not makes this statement. Thusin terms of alpha-beta notation, recurence equation for probability of being in state s for HMM was,

αt+1(y) =∑

y′∈Y

Pr(y′|y)× Pr(xt+1|y)αt(y′)

but in case of Maximum Entropy model this equation is changed as

αt+1(y) =∑

y′∈Y

Pry′

(y|xt+1)αt(y′)

where Pry′(y|xt+1) means we will use edge features f(y′, y). Thus we can write equation for entire tagsequence as

Pr(Y |X) = Prstart

(y1|X)× Pry1

(y2|X)× Pry2

(y3|X)× ...

in general

Pr(Y |X) =∏

i

Pryi−1

(yi|X)

13

Condition Features

xi is not rare xi = W, yi = Kxi is rare W is prefix of xi , |W | ≤ 4, yi = K

W is suffix of xi , |W | ≤ 4, yi = Kxi contains number , yi = Kxi contains upper case character , yi = Kxi contains hyphen , yi = K

∀xi yi−1 = T, yi = Kyi−2yi−1 = T2T1, yi = Kxi−2 = W, yi = Kxi−1 = W, yi = Kxi+1 = W, yi = Kxi+2 = W, yi = K

Table 3.1: Table summerizing features used by [6]

3.5 Application

This section will discuss application of Maximum Entropy Markov model for POS tagging as discussed in[6] paper. As we have seen in earlier sections, features plays important role in maximum entropy modelsemphasis is given on what features are used for POS tagging application. Also one of the importantissues in POS tagging application is how to handle words which do not appear in test dataset. Thissection will also discuss the unknown word features used by [6].

3.5.1 Feature Selection for POS tagging

Features mainly depend upon the local history of current work. Typically for tagging word xi history hi

is chosen ashi = xi, xi−1, xi−2, yi−1, yi−2

The typical feature set for POS tagging application is as given in the table. .

• When word xi is not rare, which means xi is in training dataset then can use the information inthe proximity of that word for modeling features related with that word. But,

• When word xi is rare, which means xi is not in training dataset then can not use the informationin the proximity of that word for modeling features. So in order to find features related to thatword, we look for the proximity of the words which “look like” this new word xi. By “look like”means we check for suffixes or prefixes of the word. Example.: if the word has suffix “tion” thenit is most likely that the word is “Noun”. Also, the unknown words are usually proper nouns, tocheck if a word is proper noun, it is checked if it contains capital character. If word is beginingwith capital character then it is more likely to be proper noun. Also, numbers are given differenttag.

• Along with the word based features as discussed in previous points, there are features which takescare of interaction between successive words. In sequence labeling task its believed that label atposition i is influenced by its neighbours. So, tags used by two preceding words, means yi−2yi−1

is considered while fixing yi. This is called as tri-gram assumption, co-occurence upto 3 labels isconsidered.

3.5.2 Example

Consider one example sentence: “Young generation chooses Information Technology.” Suppose currentlywe are tagging word “Information”. Then Maximum Entropy model will look for which features are firedfor all possible tags for this word.

14

• xi =Information and yi =N

• xi =Information and yi =V

• xi =Information and yi =ADJ

• xi =Information and yi =ADV

Also it will look for if in the training data word “information” is preceded by

• xi−1 =chooses

• xi−2 =generation

Following tri-gram feature this model will also look for what is the likely label sequence of “generationchooses” and whether that sequence be followed by “information”. Also looking at the capitalized firstword and noun prefering suffix “tion” features for proper noun tag will get fired.

15

Chapter 4

Cyclic Dependancy Network

4.1 From HMM to Cyclic Dependency Network

In sequence labeling task, probability of label sequence is obtained from probability of local label se-quences. If we consider a sentence as a chain graph then this local model will only consist of neighborsof current node under investigation. That means label at yi is influenced by both labels at yi−1 andyi+1. However, Hidden Markov Model considers only one direction of influence. That means in Left-Right HMM yi is influenced by yi−1 whereas in Right-to-Left HMM yi is influenced by yi+1. Though inunidirectional HMM interaction between yi−1 and yi is explicit but yi+1 and yi is implicitly consideredwhile calculating label for yi+1. It is intuitive that one should consider effect of both yi−1 and yi+1

while deciding label at yi. Cyclic Dependency Network jointly considers effect of both preceding andsucceeding labels on current label.

Figure 4.1: (A)Left-Right unidirectional CMM, (B)Right-Left CMM, (C)cyclic dependency network

Example: The example given in [Manning] illustrates the above phenomenon. In a sequence “Willto fight”, ‘will’ plays role of noun than modal verb. But this effect is not captured in unidirectionalHMM because: Pr(< MOD > |will, < start >) is higher than Pr(< Noun > |will, < start >), alsolexical probability Pr(< TO > |to) is 1 hence unidirectional model fails to tag ‘will’ as < Noun >.

4.2 Cyclic Dependency Network

A very simple unidirectional dependency network is shown in image below. We can factorise these graphseasily using chain rule as,

Pr(X) =∏

i

Pr(Xi|Pa(Xi))

16

Figure 4.2: (A)Left-Right unidirectional dependency network, (B)Right-Left unidirectional dependencynetwork, (C)cyclic dependency network

where Pa(Xi) = Parents of node Xi

Using this formula for network in Fig.(A) we will get Pr(a, b) = Pr(a) Pr(b|a). Similarly for network inFig.(B) we can write Pr(a, b) = Pr(b) Pr(a|b). But if we try to apply chain rule for network in Fig.(C)we will get Pr(a, b) = Pr(b|a) Pr(a|b) [Manning] have given very simple observation data to disprove thatchain rule does not hold for cyclic dependency networks as in Fig.(C)Example: Suppose we have two random variables a, b which can be represented using dependencynetwork as given in Fig. (C). Suppose we have sample observation over a, b =< 11, 11, 11, 12, 21, 33>. Inthis observation sample, Pr(a = 1, b = 1) = 0.5. Now, Pr(a = 1|b = 1) = 3/4 and Pr(b = 1|a = 1) = 3/4.So applying chain rule, we will get

Pr(a = 1, b = 1) = Pr(a = 1|b = 1)Pr(b = 1|a = 1) = 3/4 ∗ 3/4 = 9/16

which is wrong.Thus we cannot use chain rule to calculate probability of joint distribution over the random variableswhen they follow Cyclic Dependency.

4.3 Inferencing in Cyclic Dependency Network

Though we cannot easily compute the joint probability distribution, our main aim is not to find joint prob-ability but to find heighest scoring sequence of labels. Hence we can use the product term

∏

i Pr(Xi|Pa(Xi))to score the sequences which are labeled by our model.

score(x) =∏

i

Pr(Xi|Pa(Xi))

Using this score we can use modified Viterbi Algorithm to find the maximizing label sequence in poly-nomial time. The psuedocode for the algorithm is given in [Manning] as:

Algorithm 2 Psuedocode for inference for the first order bidirectional cyclic dependency model

function bestScore()return bestScoreSub(n+ 2, < end, end, end >);function bestScoreSub(i+ 1, < ti−1, ti, ti+1 >)/*left boundry case*/if i = −1 then

if < ti−1, ti, ti+1 >==< start, start, start > then

return 1;else

return 0;end if

end if

/*recursive case*/return maxti−2

bestScoreSub(i, < ti−2, ti−1, ti >)× Pr(ti|ti−1, ti+1, wi);

The state diagram for the inferencing algorithm can be seen above. In this diagram, 1, 2 and 3 aresites for which labels are to be infered. Each of these sites can take labels A, B and C. Base case for therecursion is defined as < ti−1, ti, ti+1 >==< start, start, start > when i = −1. In next step when i = 0,

17

Figure 4.3: State diagram for inferencing algorithm for Cyclic Dependency Network

best score will be obtained by recursive step, bestScoreSub(i, < ti−2, ti−1, ti >) × Pr(ti|ti−1, ti+1, wi).bestScoreSub will return 1 becasue i = −1. Pr(ti|ti−1, ti+1, wi) is calculated from the model. Eachtime when window is at position i instead of receiving Pr(ti|ti−1, ti+1, wi) one will receive score forPr(ti−1|ti−2, ti, wi−1).Example: When window will be at state 2A, score received are max amongst all states at i − 2which is only one state start, bestScore for < start, 1A, 2A >,< start, 1B, 2A >,< start, 1C, 2A >.Similarly when window will be at state 2B, scores received will be for bestScore for < start, 1A, 2B >,<start, 1B, 2B >,< start, 1C, 2B >. And the process will continue.

18

Chapter 5

Conditional Random Fields

5.1 Label Bias Problem

As we have seen in chapter 3, Maximum Entropy models normalize scores at each stage, giving constraint∑

y Pr(y|x) = 1. This leads to label bias problem. Consider fig 5.1A in which all the states transitionsare given score. In fig 5.1B all those score are normalized after every state and their correspondingprobability values are obtained. Maximum Entropy models use these probabilities to find out heighestscoring label sequence. Label bias problem occurs when a particular state is biased towards only few ofthe all possible next stages. Consider state 1A, for example. This state gives scores 10 to both 2A and2B. and scores 0 to all the other states in stage 2. Since scores are normalized after every state, theprobability of reaching 2A and 2B will become 0.5.On the other hand consider state 1B which gives score of 35 to 2A and 2B, and 10 to all the others.After normalizing the scores will become 0.35 to 2A and 2B, and 0.1 to all the others. Though actualscores of transition from 1A are smaller than 1B, after normalizing their probability value goes up. Andhence maximum entropy model will end up choosing path will lower score. The solution to this problemis not to normalize the scores after each state but normalize at the end with respect to all possible pathscores. This gives rise to Conditional Random Fields.

Figure 5.1: A.Transitions with un-normalized scores B.Transitions with probability values

5.2 Conditional Random Field

Conditional Random Fields(CRF) is defined by [Lafferty] as Let G = (V,E) be a graph such thatY = (Yv)v∈V , so that Y is indexed by vertices of G. The (X,Y ) is a conditional radom field when every

19

random variable Yv is conditioned on X and obeys Markovian property with respect to the graph G.Here X is input sequence, also called as observation sequence, and is set of observations X = (x1, x2, . . .).And Y is called output sequence, also called label sequence, Y = (y1, y2, . . .).

5.2.1 Formulation

Expression for CRF is derived as ,

Pr(Y |X) =Pr(X,Y )

Pr(X)

=Pr(X,Y )

∑

Y Pr(X,Y )

By definition of CRF, every CRF is MRF and as seen in chapter 2, every MRF follows Gibbs distribution,

Pr(Y |X) =Pr(X,Y )

∑

Y Pr(X,Y )

=1Zexp(

∑

c∈C Vc(Xc, Yc))1Z

∑

Y exp(∑

c∈C Vc(Xc, Yc)

=exp(

∑

c∈C Vc(Xc, Yc)∑

Y exp(∑

c∈C Vc(Xc, Yc)

So CRF can be mathematically written as,

Pr(Y |X) =exp(

∑

c∈C Vc(Xc, Yc)

Z(X)(5.1)

where Vc(Xc, Yc) is clique potential over clique c. As discussed in previous chapters, since in textlabeling task cliques are nodes and pair cliques, Vc(Xc, Yc) perticulary takes form of node potentials andedge potentials. Such potentials are specified using local features fi and corresponding weights λi. Eachfeature can be node feature, also called state feature, s(yj , X, j); or edge feature, also called as transitionfeature, t(yj−1, yj , X, j), for all positions j.

And Z(X) =∑

Y exp(∑

c∈C Vc(Xc, Yc). It can be seen in the CRF equation that normalizing constantZ(X) is over all possible state sequences Y . Thus, scores are not normalized at every state but kept asit is till the end, and finally scores are converted to probability distribution by deviding by normalizingconstant. Thus CRF is not suffered from label bias problem.

Suppose there are M total features f1...M and corresponding weights λ1...M then clique potentialscan be obtained by summing out λifi and hence 5.1 can be written as,

Pr(Y |X) =exp(

∑nj=1

∑Mi=1 λifi(yj−1, yj , X, j))

Z(X)(5.2)

where Z(X) =∑

Y exp(∑n

j=1

∑Mi=1 λifi(yj−1, yj , X, j)).

Task of finding features and appropriate weights is done during the training.

5.3 Training

During training data is given in the form of obeservation sequence and labeling sequence pairs T = (X,Y ).CRF is formulated using set of features and corresponding weights. Two main tasks to be carried outduring the training phase are

• Parameter Estimation

• Efficient Feature Selection

20

5.3.1 Parameter Estimation

During the parameter estimation λi are estimated so that probability model best describes the trainingdata. Methods for parameter estimation are:

1. Penalized Maximum Likelihood Method

2. L-BFGS Method

3. Voted Perceptron Method

Penalized Maximum Likelihood Method

In penalized maximum likelihood method, log-likelihood of probability model is maximized with somepenalty to avoid overfitting. Log-likelihood of CRF is

L(T ) =∑

(X,Y )∈T

log(Pr(Y |X))

=∑

(X,Y )∈T

log(exp(

∑nj=1

∑Mi=1 λifi(yj−1, yj, X, j))

∑

Y exp(∑n

j=1


)

to avoid overfitting maximum likelihood is penalized with term −∑M

i=1λ2

i

2σ2 .

L(T ) =∑

(X,Y )∈T

log(exp(

∑nj=1


∑

Y exp(∑n

j=1

∑Mi=1 λifi(yj−1, yj, X, j))

)−

M∑

i=1

λ2i

2σ2

=∑

(X,Y )∈T

n∑

j=1

M∑

i=1

λifi(yj−1, yj, X, j)− log(∑

Y

exp(

n∑

j=1

M∑

i=1

λifi(yj−1, yj, X, j)))

−

M∑

i=1

λ2i

2σ2

=∑

(X,Y )∈T

n∑

j=1

M∑

i=1

λifi(yj−1, yj , X, j)−∑

(X,Y )∈T

log(∑

Y

exp(

n∑

j=1

M∑

i=1

λifi(yj−1, yj , X, j)))−

M∑

i=1

λ2i

2σ2

partial derivatives of these with respect to λk are

∂

∂λk

∑

(X,Y )∈T

n∑

j=1

M∑

i=1

λifi(yj−1, yj, X, j) =∑

(X,Y )∈T

n∑

j=1

fk(yj−1, yj , X, j)

∂

∂λk

∑

(X,Y )∈T

=∑

(X,Y )∈T

∑

Y

PrΛ

(Y |X)

n∑

j=1

fk(yj−1, yj , X, j)

∂

∂λk

= −λk

σ2

equating this to 0,∂

∂λk

L(T ) = 0

∑

(X,Y )∈T

n∑

j=1

fk(yj−1, yj, X, j) +∑

(X,Y )∈T

∑

Y

PrΛ

(Y |X)n

∑

j=1

fk(yj−1, yj , X, j)−λk

σ2= 0

solving this equation optimal Λ are obtained.The underlined term in the above equation involves summation over all possible tag sequences. This

can be efficiently calculated using alpha-beta message passing given by [3]

αj(s|X) =∑

s′∈Pred(s)

αj−1(s′|X)× exp(

M∑

i

λifi(yj−1 = s′, yj = s,X, j))

βj(s|X) =∑

s′∈Succ(s)

βj+1(s′|X)× exp(

M∑

i

λifi(yj−1 = s, yj = s′, X, j))

21

Using this definitions

∑

(X,Y )∈T

∑

Y

PrΛ

(Y |X)

n∑

j=1

fk(yj−1, yj , X, j) =∑

(X,Y )∈T

1

ZΛ(X)

n∑

j=1

∑

s∈S

∑

s′∈Succ(s)

fi(s, s′, X, j)×

αj(s|X)exp(

M∑

i

λifi(yj−1 = s, yj = s′, X, j))βj+1(s′|X)

L-BFGS Method

Using Taylor’s expansion,

L(Λ + ∆) = L(Λ) + ∆T▽(Λ) +

1

2∆TH(Λ)∆

diff wrt ∆d

d∆L(Λ + ∆) = ▽(Λ) +H(Λ)∆

equating to 0 to get optimal step size ∆

▽(Λ) +H(Λ)∆ = 0

∆ = −H−1(Λ)▽(Λ)

so the update rule becomes,Λk+1 = Λk + ∆

inverse of Hessian is computationally hard to compute. So its approximated as, let H−1(Λ = B fornotational convinience,

Bk+1 = Bk +sks

Tk

yTk sk

(yT

k Bkyk

yTk sk

+ 1)−1

yTk sk

(skyTk Bk +Bkyks

Tk )

where,sk = Λk − Λk−1

yk = ▽k − ▽k−1

Voted Perceptron Method

Perceptron uses misclassified example for training the weights. In CRF, after every iteration of Λt

, training example is labeled using this values of Λt. This labeled observation sequence is used forobtaining feature vector. The difference between the feature vector obtained from actual training dataand the feature vector obtained from probability model is taken as step to change weights. Step size,

∆t = F (Y,X)− F (Y ∗, X)

where Y ∗ is best scoring labels obtained using model.

Λt+1 = Λt + ∆t

In voted perceptron, to avoid over fitting. To avoid this unweighted average of each δi is taken and thatvalue is added to obtain new weights.

22

5.3.2 Feature Selection

Typically the features are set of hand crafted rules in the form of templates. Some of which have been asdiscussed in [6].[4] have given algorithm to select the features which will increase log-likelihood at eachstep.

1. Start with no features

2. Consider set of new features

3. select for inclusion only those candidate features that will most increase the log-likelihood of thecorrect state path

4. train weights for all included features

5. iterate to step 2.

Probability after adding new feature g and its weight µ can be efficiently calculated using the formulaas given by [4]

PrΛ+g,µ

(Y |X) =PrΛ(Y |X)exp(

∑Tt=1 µg(yt−1, yt, X, t))

ZX(Λ, g, µ)

where

ZX(Λ, g, µ) =∑

Y

PrΛ

(Y |X)exp(

T∑

t=1

µg(yt−1, yt, X, t))

Log-likelihood of the training data after adding new feature is

LΛ,g,µ = (

N∑

i=1

log( PrΛ,g,µ

(yi|xi)))−µ2

2σ2−

K∑

k=1

λ2k

2σ2

which is approximated as

LΛ,g,µ = (

N∑

i=1

Ti∑

t=1

log( PrΛ,g,µ

(yit|x

i, t))) −µ2

2σ2−

K∑

k=1

λ2k

2σ2

Rather than summing over all output variables of all instances,∑N

i=1

∑Ti

t=1 its efficient to sum only overvariables which are not correctly labeledlet x(i) : i = 1...M be such variables, then likelihood will be

LΛ,g,µ = (

M∑

i=1

log( PrΛ,g,µ

(yit|x

i, t)))−µ2

2σ2−

K∑

k=1

λ2k

2σ2

Gain of feature is defined as improvement in log-likelihood,

GΛ(g, µ) = LΛ,g,µ − LΛ

5.4 Inferencing

The problem of inference is to find the most likely sequence of Y given observations X . The quantitydeltaj(s|X), which is the heighest score along a path, at a position j, which ends in s, is defined as

δj(s|X) = maxy1,y2,...,yj−1Pr(y1, y2, ..., yj = s|X)

This can be efficiently computed using recursive algorithm,

1. Initialization: The values for all steps from the start state ⊥ to all possible first states s are set tothe corresponding factor value.

∀s ∈ S : δ1(s) = V1(X,⊥, s)

= ψ1(s) = ⊥

23

q(yi−1, yi) p(x, i)

yi = y trueyi = y, yi−1 = y′

c(yi) = ciyi = y wi = w

wi−1 = wwi+1 = wwi−2 = wwi+2 = wwi = w,wi−1 = w′

wi = w,wi+1 = w′

ti = tti−1 = tti−2 = tti+1 = tti+2 = tti = t, ti−1 = t′

ti = t, ti+1 = t′

Table 5.1: table summerizing features used by [7]

2. Recursion: The values for the next steps are computed from the current value and the maximumvalues regarding all possible succeesing states s′

∀s ∈ S : 1 ≤ j ≤: δj(s) = maxs′∈S

δj−1(s′)Vj(X, s

′, s)

ψj(s) = arg maxs′∈S

δj−1(s′)Vj(X, s

′, s)

3. Termination:

p∗ = maxs′∈S

δn(s′)

y∗n = argmaxs′∈S

δn(s′)

4. Path Backtracking: Recomputing the optimal path from the lattice using the track keeping valuesψt

y∗t = ψt+1(y∗t+1)

where t = n− 1, n− 2, . . . , 1

5.5 Application

5.5.1 chunking

[7] discuss the application of CRF for chunking of the text. Each word can take on of the tags B, I andO. CRF is modelled as second order Markov network, which means chunk tag at position i will dependon chunk tags at position i− 1 and i− 2. This is achieved by making CRF labels as pair of consecutivetags like BI,BO,IO etc. So CRF will label each word with two chunk tags: label at word i is yi = ci−1ciwhere ci is the actual chunk tag of word i.And label at 0th position is explicitly taken to be O. Featuresare devided into two sets and represented as

f(yi−1, yi, X, i) = p(x, i)q(yi−1, yi)

out of which p(x, i) takes into account words and POS tags of the word whereas q(yi−1, yi) takes intoaccount the influence of one label on the other. The feature templets are summarized in the table as

24

featureinside-noun-phrasestopwordcapitalizedprevious-words and nextwordsin-lexiconheadercompany-suffix

Table 5.2: Table summarizing features used by [4]

5.5.2 Named Entity Recognition

[4] have applied CRF for named entity recognition.The labels given to entities are PER, LOC, ORG,MISC for person name, location name, organization name and misc. The obeservational tests made are:

1. the word itself

2. POS tag

3. regular expression for capitalization, digit and number identification

4. lexicons used for names of contries, companies, organizations, names, surnames, universities etc.

5. the second time the capitalized word appears, the results of all the tests applied to the first occu-rance of the word are copied to the second occurance.

feature templets can be summarized asThe use of header can be explained using one example. When name of the country appears in the

article which is about sports then most likely that the country name is not reffered as location butas participating team. So if header of the artical is sports then even though country name occurs incountry-name-lexicon it is not directly tagged as location. Also most of the company names end with“Inc.”, such suffixes are used to separate organization name from name of person. If wi−3 = the andw(i− 1) = of then wi is likely to be organization eg. “the CEO of IBM”. If wi+1 = Republic then wi ismost likely to be name of the country.

25

Chapter 6

Conclusion

Sequence labeling task plays very important role in applications like Question-Answering and InformationExtraction. Representing text as a graph helps to find inter-relationships between the words in thecontext. Traditionally such graph structure of text is modelled and trained using Hidden Markov Model.But such model fails to capture addiional information from the text, other than just joint distribution.

Maximum Entropy models extend HMM to capture characteristics of text as features. Also beingdesigned as descriminative model ME models dont represent wront dependancies in the graph structureof the text. ME models trained by finding lower bound on change in the likelihood using ImprovedIterative Scaling algorithm.

Since a word is influed by both of its neighbors, both right and left neighbors should show dependencywith that word on the graph. Cyclic Dependency Networks impose such more intuitive dependencies inthe graph.

Normalizing scores after every state cause label bias problem in ME models. Conditional RandomFields try to solve this problem by normalizing scores at the end. But for this computation of normal-ized constant becomes computationally heavy in CRF training. This is solved using message passingalgorithms. These algorithms are modified versions of traditional message passing in Viterbi algorithm.

Using all these model for sequence labeling makes better use of context and morphological charac-teristics of the words. These characteristics are captured in features. These features are generated usinghand crafted templates and may not be always useful. Feature induction algorithm helps to choose onlythose features which actually gives gain in likelihood of the model. Using these feature based mod-els, applications like POS tagging, chunking and named entity recognition are addressed with enhancedstrength.

26

Bibliography

[1] Adam Berger. The improved iterative scaling algorithm:a gentle introduction, 1997.

[2] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy approachto natural language processing. Computational Linguistics, 22:39–71, 1996.

[3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for seg-menting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages282–289. Morgan Kaufmann, San Francisco, CA, 2001.

[4] Andrew Mccallum. Efficiently inducing features of conditional random fields. In Nineteenth Confer-ence on Uncertainty in Artificial Intelligence (UAI03), 2003.

[5] Andrew Mccallum, Dayne Freitag, and Fernando Pereira. Maximum entropy markov models forinformation extraction and segmentation. In Proc. 17th International Conf. on Machine Learning,pages 591–598. Morgan Kaufmann, San Francisco, CA, 2000.

[6] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Eric Brill andKenneth Church, editors, Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing, pages 133–142. Association for Computational Linguistics, 1996.

[7] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In NAACL ’03:Proceedings of the 2003 Conference of the North American Chapter of the Association for Compu-tational Linguistics on Human Language Technology, pages 134–141. Association for ComputationalLinguistics, 2003.

27

Graphical Models for Sequence Labeling in NLP

Documents

Transcript of Graphical Models for Sequence Labeling in NLP