Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Post on 03-Jan-2016

217 views 0 download

Transcript of Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

1

Feature Selection & Maximum Entropy

Advanced Statistical Methods in NLPLing 572

January 26, 2012

2

RoadmapFeature selection and weighting

Feature weightingChi-square feature selectionChi-square feature selection example

HW #4

Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples

3

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

4

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|

5

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

6

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Feature selection: Wrapper techniques

7

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Feature selection: Wrapper techniques Feature scoring

8

Feature WeightingFor text classification, typical weights include:

9

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

10

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

11

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

12

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

13

Chi SquareTests for presence/absence of relation between

random variables

14

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

15

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

16

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

17

Chi Square ExampleCan gender predict shoe choice?

Due to F. Xia

18

Chi Square ExampleCan gender predict shoe choice?

A: male/female Features

Due to F. Xia

19

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

20

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

21

Comparing DistributionsObserved distribution (O):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

22

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

23

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

24

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 50

Female 9.5 50

Total 19 22 20 25 14 100

Due to F. Xia

25

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 50

Female 9.5 11 50

Total 19 22 20 25 14 100

Due to F. Xia

26

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 50

Female 9.5 11 10 50

Total 19 22 20 25 14 100

Due to F. Xia

27

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 50

Female 9.5 11 10 12.5 50

Total 19 22 20 25 14 100

Due to F. Xia

28

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

29

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

30

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

31

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+

32

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11

33

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

34

Calculating X2

Tabulate contigency table of observed values: O

35

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

36

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

37

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Compute X2

38

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

39

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk

tk

total

40

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk a+b

tk c+d

total a+c b+d N

41

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

a+b

tk c+d

total a+c b+d N

42

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk c+d

total a+c b+d N

43

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N c+d

total a+c b+d N

44

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

45

X2 TestTest whether random variables are independent

46

X2 TestTest whether random variables are independent

Null hypothesis: R.V.s are independent

47

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:

48

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

49

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1)

50

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

51

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

52

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic: Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

53

Requirements for X2 TestEvents assumed independent, same distribution

54

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

55

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

56

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5

57

X2 Example

58

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

59

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

60

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Exemplars: 1-4 newswire articles300-600 words each

61

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

62

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

Not all documents labeledOnly a small subset belong to topics of interest

Differentiate from other topics AND ‘background’

63

ApproachX2 feature selection:

64

ApproachX2 feature selection:

Assume terms have binary representation

65

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docs

66

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

67

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

68

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Create one feature set per topic to be tracked

69

Tracking ApproachBuild vector space model

70

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

71

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

72

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Select documents scoring above thresholdFor each topic

73

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Select documents scoring above thresholdFor each topic

Result: Improved retrieval

74

HW #4Topic: Feature Selection for kNN

Build a kNN classifier using:Euclidean distance, Cosine Similarity

Write a program to compute X2 on a data set

Use X2 at different significance levels to filter

Compare the effects of different feature filteringon kNN classification

75

Maximum Entropy

76

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

77

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

78

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

79

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

80

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

81

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

82

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

83

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

84

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

85

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

86

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

c

f1 f2 f3 fk

88

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

89

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

90

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

91

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

92

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

94

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

95

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

96

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

97

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

98

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

99

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, sign

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

100

MaxEnt OverviewPrediction:

P(y|x)

101

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

102

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

103

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

104

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

105

Maximum Entropy Principle

106

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

107

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

108

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

109

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

110

Example I: (K&M 2003)Consider a coin flip

H(X)

111

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?

112

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

113

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?

114

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?P(X=T)=0.7

115

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

116

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

117

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

118

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

119

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

120

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=

121

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

122

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

123

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

124

125

Example III: POS (K&M, 2003)

126

Example III: POS (K&M, 2003)

127

Example III: POS (K&M, 2003)

128

Example III: POS (K&M, 2003)

129

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

130

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Etc