Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.
-
Upload
nancy-poole -
Category
Documents
-
view
217 -
download
0
Transcript of Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.
1
Feature Selection & Maximum Entropy
Advanced Statistical Methods in NLPLing 572
January 26, 2012
2
RoadmapFeature selection and weighting
Feature weightingChi-square feature selectionChi-square feature selection example
HW #4
Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples
3
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
4
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|
5
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
6
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
Feature selection: Wrapper techniques
7
Feature Selection RecapProblem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:
Global & local approachesFeature extraction:
New features in r’ transformations of features in r
Feature selection: Wrapper techniques Feature scoring
8
Feature WeightingFor text classification, typical weights include:
9
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
10
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
11
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
12
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
tfidf = tf*idf
13
Chi SquareTests for presence/absence of relation between
random variables
14
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
15
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
16
Chi SquareTests for presence/absence of relation between
random variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
17
Chi Square ExampleCan gender predict shoe choice?
Due to F. Xia
18
Chi Square ExampleCan gender predict shoe choice?
A: male/female Features
Due to F. Xia
19
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
20
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
21
Comparing DistributionsObserved distribution (O):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Due to F. Xia
22
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Due to F. Xia
23
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 50
Female 50
Total 19 22 20 25 14 100
Due to F. Xia
24
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 50
Female 9.5 50
Total 19 22 20 25 14 100
Due to F. Xia
25
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 50
Female 9.5 11 50
Total 19 22 20 25 14 100
Due to F. Xia
26
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 50
Female 9.5 11 10 50
Total 19 22 20 25 14 100
Due to F. Xia
27
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 50
Female 9.5 11 10 12.5 50
Total 19 22 20 25 14 100
Due to F. Xia
28
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Due to F. Xia
29
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
30
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
31
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+
32
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11
33
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026
34
Calculating X2
Tabulate contigency table of observed values: O
35
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
36
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
37
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
Compute X2
38
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
39
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk
tk
total
40
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk a+b
tk c+d
total a+c b+d N
41
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
a+b
tk c+d
total a+c b+d N
42
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk c+d
total a+c b+d N
43
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N c+d
total a+c b+d N
44
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N (c+d)(b+d)/N c+d
total a+c b+d N
45
X2 TestTest whether random variables are independent
46
X2 TestTest whether random variables are independent
Null hypothesis: R.V.s are independent
47
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
48
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
49
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1)
50
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
51
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
52
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic: Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
If probability is low – below some significance level Can reject null hypothesis
53
Requirements for X2 TestEvents assumed independent, same distribution
54
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
55
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
56
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
Sufficient values per cell: > 5
57
X2 Example
58
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
59
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-
4)Define a topicCreate a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
60
X2 ExampleShared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-
4)Define a topicCreate a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
Exemplars: 1-4 newswire articles300-600 words each
61
ChallengesMany news articles look alike
Create a profile (feature representation)Highlights terms strongly associated with current
topic Differentiate from all other topics
62
ChallengesMany news articles look alike
Create a profile (feature representation)Highlights terms strongly associated with current
topic Differentiate from all other topics
Not all documents labeledOnly a small subset belong to topics of interest
Differentiate from other topics AND ‘background’
63
ApproachX2 feature selection:
64
ApproachX2 feature selection:
Assume terms have binary representation
65
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docs
66
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
67
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for termsRetain terms with highest X2 scores
Keep top N terms
68
ApproachX2 feature selection:
Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for termsRetain terms with highest X2 scores
Keep top N terms
Create one feature set per topic to be tracked
69
Tracking ApproachBuild vector space model
70
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
71
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
72
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
Select documents scoring above thresholdFor each topic
73
Tracking ApproachBuild vector space model
Feature weighting: tf*idfwith some modifications
Distance measure: Cosine similarity
Select documents scoring above thresholdFor each topic
Result: Improved retrieval
74
HW #4Topic: Feature Selection for kNN
Build a kNN classifier using:Euclidean distance, Cosine Similarity
Write a program to compute X2 on a data set
Use X2 at different significance levels to filter
Compare the effects of different feature filteringon kNN classification
75
Maximum Entropy
76
Maximum Entropy“MaxEnt”:
Popular machine learning technique for NLP
First uses in NLP circa 1996 – Rosenfeld, Berger
Applied to a wide range of tasks
Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….
77
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial
78
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
79
Readings & CommentsSeveral readings:
(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture
Going forward: Techniques more complex
Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement
80
Notation NoteNot entirely consistent:
We’ll use: input = x; output=y; pair = (x,y)
Consistent with Berger, 1996
Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)
Klein/Manning, ‘03: input = d; output=c; pair = (c,d)
81
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
82
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)
83
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc
84
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to
learn a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate
P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequency
85
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency
Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …
86
Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn
a model Θ s.t. given a new x, can predict label y.
Different types of models: Joint models (aka generative models) estimate P(x,y)
by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative
frequencyConditional (aka discriminative) models estimate P(y|
x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex
Naïve Bayes Model
Naïve Bayes Model assumes features f are independent of each other, given the class C
c
f1 f2 f3 fk
88
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
89
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
90
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
91
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
92
Naïve Bayes ModelMakes assumption of conditional independence
of features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about P(“cuts”|politics,”budget”) ?= pcuts
Would like a model that doesn’t assume
Model ParametersOur model:
c*= argmaxc P(c)ΠjP(fj|c)
Types of parametersTwo:
P(C): Class priorsP(fj|c): Class conditional feature probabilities
Features in total |C|+|VC|, if features are words in vocabulary V
94
Weights in Naïve Bayes
c1 c2 c3 … ck
f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)
f2 P(f2|c1) P(f2|c2) …
… …
f|V| P(f|V||,c1)
95
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weights
96
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
97
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
98
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
99
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, sign
Weights in Naïve Bayes and
Maximum EntropyNaïve Bayes:
P(f|y) are probabilities in [0,1] , weightsP(y|x) =
MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =
100
MaxEnt OverviewPrediction:
P(y|x)
101
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
102
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
103
MaxEnt OverviewPrediction:
P(y|x)
fj (x,y): binary feature function, indicating presence of feature in instance x of class y
λj : feature weights, learned in training
Prediction: Compute P(y|x), pick highest y
104
Weights in MaxEnt
c1 c2 c3 … ck
f1 λ1 λ8…
f2 λ2 …
… …
f|V| λ6
105
Maximum Entropy Principle
106
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
107
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
108
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
109
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish
between the probability of two events, the best strategy is to consider them equally likely
110
Example I: (K&M 2003)Consider a coin flip
H(X)
111
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?
112
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
113
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?
114
Example I: (K&M 2003)Consider a coin flip
H(X)
What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?P(X=T)=0.7
115
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint:
116
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?
117
Example II: MT (Berger, 1996)Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}
Constraint: p(dans)+p(en)+p(à)+p(au cours de)
+p(pendant)=1
If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours
de)=p(pendant)=1/5
118
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint
119
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
120
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=
121
Example II: MT (Berger, 1996)What we find out that translator uses dans or en
30%?Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=
122
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
123
Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??
Not intuitively obvious…
124
125
Example III: POS (K&M, 2003)
126
Example III: POS (K&M, 2003)
127
Example III: POS (K&M, 2003)
128
Example III: POS (K&M, 2003)
129
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbs
130
Example IIIProblem: Too uniform
What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36
Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36
Etc