Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Feature Selection & Maximum Entropy

Advanced Statistical Methods in NLPLing 572

January 26, 2012

RoadmapFeature selection and weighting

Feature weightingChi-square feature selectionChi-square feature selection example

Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Feature selection: Wrapper techniques

Feature selection: Wrapper techniques Feature scoring

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

Chi SquareTests for presence/absence of relation between

random variables

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Chi Square ExampleCan gender predict shoe choice?

Due to F. Xia

A: male/female Features

Due to F. Xia

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Comparing DistributionsObserved distribution (O):

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

Expected distribution (E):

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 50

Female 9.5 50

Total 19 22 20 25 14 100

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 50

Female 9.5 11 50

Total 19 22 20 25 14 100

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 50

Female 9.5 11 10 50

Total 19 22 20 25 14 100

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 50

Female 9.5 11 10 12.5 50

Total 19 22 20 25 14 100

Due to F. Xia

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+

X2=(6-9.5)2/9.5+(17-11)2/11

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

Calculating X2

Tabulate contigency table of observed values: O

Calculating X2

Compute row, column totals

Calculating X2

Compute table of expected values, given row/colAssuming no association

Calculating X2

Compute table of expected values, given row/colAssuming no association

Compute X2

For 2x2 TableO:

!ci ci

!tk a b

tk c d

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk a+b

tk c+d

total a+c b+d N

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

tk c+d

total a+c b+d N

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk c+d

total a+c b+d N

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N c+d

total a+c b+d N

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

X2 TestTest whether random variables are independent

Null hypothesis: R.V.s are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1)

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

Compute X2 statistic: Compute degrees of freedom

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5

X2 Example

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Exemplars: 1-4 newswire articles300-600 words each

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

Not all documents labeledOnly a small subset belong to topics of interest

Differentiate from other topics AND ‘background’

ApproachX2 feature selection:

Assume terms have binary representation

Assume terms have binary representation Positive class term occurrences from exemplar docs

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Create one feature set per topic to be tracked

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Select documents scoring above thresholdFor each topic

Result: Improved retrieval

HW #4Topic: Feature Selection for kNN

Build a kNN classifier using:Euclidean distance, Cosine Similarity

Write a program to compute X2 on a data set

Use X2 at different significance levels to filter

Compare the effects of different feature filteringon kNN classification

Maximum Entropy

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

f1 f2 f3 fk

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, sign

MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

P(y|x)

λj : feature weights, learned in training

P(y|x)

λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

Example I: (K&M 2003)Consider a coin flip

What values of P(X=H), P(X=T)maximize H(X)?

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?

What if you know P(X=H) =0.3?P(X=T)=0.7

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

Now what is maxent model?p(dans)=p(en)=

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

Example III: POS (K&M, 2003)

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Documents

Transcript of Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for Natural Language Processing Ling 571 January 3, 2011 Gina-Anne Levow.

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

KM C364e-20181002091502...572— co EMPLOYEE: HAGE PIN LOWER 572 772 772 AUGER FL'-Gh'lNG 94- AUGER sorroM - 572 572. AUGER AUGER • 572 7AlUNGS AUGER - 572. 772 • 572 HOPPE* BRACKET

1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.

K nearest neighbor and Rocchio algorithm LING 572 Fei Xia 1/11/2007.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Nearest Neighbor Ling 572 Advanced Statistical Methods in NLP January 12, 2012.

LING 2000 - 2006 NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

NLP Practitioner Heart of NLP - NLP Courses

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

(Nlp) Nlp Secrets

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

applyipo.com · icici lombard general insurance company ltd aditya birla sun life insurance company ... 100% 572 572 572 572 572 572 572 572 ... limited a/c aditya birla sunlife banking

Features & Unification Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.