Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

130
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1

Transcript of Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Page 1: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

1

Feature Selection & Maximum Entropy

Advanced Statistical Methods in NLPLing 572

January 26, 2012

Page 2: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

2

RoadmapFeature selection and weighting

Feature weightingChi-square feature selectionChi-square feature selection example

HW #4

Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples

Page 3: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

3

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Page 4: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

4

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|

Page 5: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

5

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Page 6: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

6

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Feature selection: Wrapper techniques

Page 7: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

7

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

Feature selection: Wrapper techniques Feature scoring

Page 8: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

8

Feature WeightingFor text classification, typical weights include:

Page 9: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

9

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Page 10: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

10

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Page 11: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

11

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

Page 12: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

12

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

Page 13: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

13

Chi SquareTests for presence/absence of relation between

random variables

Page 14: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

14

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Page 15: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

15

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

Page 16: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

16

Chi SquareTests for presence/absence of relation between

random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Page 17: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

17

Chi Square ExampleCan gender predict shoe choice?

Due to F. Xia

Page 18: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

18

Chi Square ExampleCan gender predict shoe choice?

A: male/female Features

Due to F. Xia

Page 19: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

19

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

Page 20: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

20

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Page 21: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

21

Comparing DistributionsObserved distribution (O):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

Page 22: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

22

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

Page 23: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

23

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 24: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

24

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 50

Female 9.5 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 25: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

25

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 50

Female 9.5 11 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 26: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

26

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 50

Female 9.5 11 10 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 27: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

27

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 50

Female 9.5 11 10 12.5 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 28: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

28

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 29: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

29

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

Page 30: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

30

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

Page 31: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

31

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+

Page 32: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

32

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11

Page 33: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

33

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

Page 34: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

34

Calculating X2

Tabulate contigency table of observed values: O

Page 35: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

35

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Page 36: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

36

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Page 37: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

37

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Compute X2

Page 38: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

38

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

Page 39: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

39

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk

tk

total

Page 40: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

40

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk a+b

tk c+d

total a+c b+d N

Page 41: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

41

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

a+b

tk c+d

total a+c b+d N

Page 42: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

42

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk c+d

total a+c b+d N

Page 43: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

43

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N c+d

total a+c b+d N

Page 44: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

44

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

Page 45: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

45

X2 TestTest whether random variables are independent

Page 46: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

46

X2 TestTest whether random variables are independent

Null hypothesis: R.V.s are independent

Page 47: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

47

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:

Page 48: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

48

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

Page 49: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

49

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1)

Page 50: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

50

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Page 51: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

51

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

Page 52: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

52

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic: Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

Page 53: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

53

Requirements for X2 TestEvents assumed independent, same distribution

Page 54: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

54

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Page 55: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

55

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Page 56: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

56

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5

Page 57: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

57

X2 Example

Page 58: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

58

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Page 59: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

59

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Page 60: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

60

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Exemplars: 1-4 newswire articles300-600 words each

Page 61: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

61

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

Page 62: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

62

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

Not all documents labeledOnly a small subset belong to topics of interest

Differentiate from other topics AND ‘background’

Page 63: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

63

ApproachX2 feature selection:

Page 64: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

64

ApproachX2 feature selection:

Assume terms have binary representation

Page 65: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

65

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docs

Page 66: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

66

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Page 67: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

67

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Page 68: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

68

ApproachX2 feature selection:

Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Create one feature set per topic to be tracked

Page 69: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

69

Tracking ApproachBuild vector space model

Page 70: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

70

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Page 71: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

71

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Page 72: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

72

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Select documents scoring above thresholdFor each topic

Page 73: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

73

Tracking ApproachBuild vector space model

Feature weighting: tf*idfwith some modifications

Distance measure: Cosine similarity

Select documents scoring above thresholdFor each topic

Result: Improved retrieval

Page 74: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

74

HW #4Topic: Feature Selection for kNN

Build a kNN classifier using:Euclidean distance, Cosine Similarity

Write a program to compute X2 on a data set

Use X2 at different significance levels to filter

Compare the effects of different feature filteringon kNN classification

Page 75: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

75

Maximum Entropy

Page 76: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

76

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

Page 77: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

77

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

Page 78: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

78

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Page 79: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

79

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

Page 80: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

80

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

Page 81: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

81

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Page 82: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

82

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

Page 83: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

83

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

Page 84: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

84

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

Page 85: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

85

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

Page 86: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

86

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Page 87: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

c

f1 f2 f3 fk

Page 88: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

88

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

Page 89: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

89

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

Page 90: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

90

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

Page 91: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

91

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Page 92: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

92

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Page 93: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

Page 94: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

94

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

Page 95: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

95

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

Page 96: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

96

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 97: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

97

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 98: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

98

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 99: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

99

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, sign

Page 100: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

100

Page 101: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

MaxEnt OverviewPrediction:

P(y|x)

101

Page 102: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

102

Page 103: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

103

Page 104: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

104

Page 105: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

105

Page 106: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Maximum Entropy Principle

106

Page 107: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

107

Page 108: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

108

Page 109: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

109

Page 110: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

110

Page 111: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example I: (K&M 2003)Consider a coin flip

H(X)

111

Page 112: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?

112

Page 113: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

113

Page 114: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?

114

Page 115: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?P(X=T)=0.7

115

Page 116: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

116

Page 117: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

117

Page 118: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

118

Page 119: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

119

Page 120: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

120

Page 121: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=

121

Page 122: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

122

Page 123: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

123

Page 124: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

124

Page 125: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

125

Example III: POS (K&M, 2003)

Page 126: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

126

Example III: POS (K&M, 2003)

Page 127: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

127

Example III: POS (K&M, 2003)

Page 128: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

128

Example III: POS (K&M, 2003)

Page 129: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

129

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

Page 130: Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

130

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Etc