Linear Models (I)
-
Upload
shad-adams -
Category
Documents
-
view
29 -
download
0
description
Transcript of Linear Models (I)
Linear Models (I)
Rong Jin
Review of Information Theory What is information? What is entropy?
Average information Minimum coding length Important inequality
1( ) logi
ii
H P pp
1 1( ) log logi i
i ii i
H P p pp q
Distribution for Generating Symbols
Distribution for Coding Symbols
Review of Information Theory (cont’d) Mutual information
Measure the correlation between two random variables Symmetric
Kullback-Leibler distance
Difference between two distributions
,
( , )( ; ) ( ) ( | ) ( , ) log
( ) ( )x y
P x yI X Y H X H X Y P x y
P x P y
~( ) ( )
( , ) ( ) log [log ]( ) ( )D
D DD M D x Px
M M
P x P xKL P P P x E
P x P x
Outline Classification problems Information theory for text classification Gaussian generative Naïve Bayes Logistic regression
Classification ProblemsYXf :
XInput Y Output?
• Given input X={x1, x2, …, xm}
• Predict the class label y
• y{-1,1}, binary class classification problems
• y {1, 2, 3, …, c}, multiple class classification problems
• Goal: need to learn the function: YXf :
Examples of Classification Problems Text categorization:
Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … Class label: ‘politics’ and ‘non-politics’
Image Classification:
Input features: color histogram, texture distribution, edge distribution, … Class label: ‘bird image’ and ‘non-bird image’
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …
Topic: politics
Which is a bird image?
Learning Setup for Classification Problems Training examples:
Identical Independent Distribution (i.i.d.)
Training examples are similar to testing examples Goal
Find a model or a function that is consistent with the training data
1 1 2 2{ , , , ,..., , }train n nD x y x y x y
Information Theory for Text Classification
If coding distribution is similar to the generating distribution short coding length good compression rate
1 1( ) log logi i
i ii i
H P p pp q
Distribution for Generating Symbols
Distribution for Coding Symbols
Compression Algorithm for TC
Compression Model M1
Compression Model M2
Politics
Sports
New Document
16K bits
10K bits
Topic:
Sports
Probabilistic Models for Classification Problems Apply statistical inference methods
Key: finding the best parameters Maximum likelihood (MLE) approach
Log-likelihood of data
Find the parameters that maximizes the log-likelihood
1( ) log ( | ; )
ntrain i ii
l D p y x
*1
max ( ) log ( | ; )n
train i iil D p y x
Training Examples
{ , }i ix y
Learning a Statistical Model
Prediction
p(y|x;)
Generative Models Not directly estimate p(y|x;) Using Bayes rule
Estimate p(xly;) instead of p(y|x;)
Why p(xly;)? Most well known distributions are p(xl). Allocate a separate set of parameters for each class
{1, 2,…, c}
p(xly;) p(xly) Describes the special input patterns for each class y
( ; ) ( | ; )( | ; )
( , ; )
p y p x yp y x
p y x
Gaussian Generative Model (I) Assume a Gaussian model for each class One dimension case
Results for MLE
2
22
1 2
( )1( | ; ) exp
22
{ , ,..., }, ={ , , ( )}
y
yy
c k k k
xp x y
p y k
2{ | } { | }
1 1, ( )
| { | } | | { | } |
| { | } |( )
i i
n nk i k i ki y k i y k
i i
i
x xi y k i y k
i y kp y k
n
( ; ) ( | ; )( | ; )
( , ; )
p y p x yp y x
p y x
Example
1.7, 0.1, 0.5
1.5, 0.2, 0.5male male male
female female female
p
p
• Height histogram for males and females.
• Using Gaussian generative model
• P(male|1.8) = ? , P(female|1.4) = ?
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
Gaussian Generative Model (II) Consider multiple input features
X={x1, x2, …, xm} Multi-variate Gaussian distribution
y is a mm covariance matrix Results for MLE
Problem: Singularity of y : too many parameters
1
/ 2 1/ 2
1 1
1 1( | ; ) ~ ( , ) exp
22 | |
( , , ( 1),..., , , ( ))
T
y y y y ymy
c c
p x y N x x
p y p y k
{ | }
,{ , | , }
1,
| { | } |
1( )( )
|{ | } | | { | } |
i
s t
ny ii y y
i
ni j i i j jy s y s ys t y y y y
s t
xi y y
x xs y y t y y
Overfitting Issue Complex model Insufficient training
Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples
Total number parameters for a full Gaussian model is 5 means 500 parameters 5 covariance matrices 50,000 parameters 50,500 parameters insufficient training data
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Another Example of Overfitting
-6 -4 -2 0 2 4 6-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Another Example of Overfitting
-8 -6 -4 -2 0 2 4 6 8-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Naïve Bayes Simplify the model complexity
Diagonalize the covariance matrix y
Simplified Gaussian distribution
Feature independence assumption Naïve Bayes assumption
2
211/ 2 2
1
1 21 1
( )1( | ; ) exp ( | ; )
22
{ , ,..., }, { , ,..., , }
i i mm y i i
iimim y
ii
m i i i i ic c
xp x y p x y
Naïve Bayes A terrible estimator for But it is a very reasonable estimator for
Why?
The ratio of likelihood is more important
Naïve Bayes does a reasonable job on the estimation of ratio
( | ; )p x y
( | ; )p y x
' 1 ' 1
( ; ) ( | ; ) 1( | ; )
( '; ) ( | '; )( '; ) ( | '; )( '; ) ( | '; )
c cy y
p y p x yp y x
p y p x yp y p x yp y p x y
( | '; )
( | '; )
p x y
p x y
The Ratio of Likelihood Binary class
Both classes share the similar variance
2 2
2 21
21
2 2
12 221
( 1) ( | 1) ( 1)log log
( 1) ( | 1) ( 1)
2( 1)log
( 1)
2 ( ,..., )
i i i im
i ii
i i i i im
ii
i imc
ii
x xp y p x y p y
p y p x y p y
xp y
p y
x diag
( 1)log
( 1)
p y
p y
1{ ,..., }m
• A linear model !
Decision Boundary
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20
5
10
15
20
25
30
35
40
Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female
• Gaussian Generative Models == Finding a linear decision boundary
• Why not do it directly?