Linear Models (I)

23
Linear Models (I) Rong Jin

description

Linear Models (I). Rong Jin. Review of Information Theory. What is information? What is entropy? Average information Minimum coding length Important inequality. Distribution for Generating Symbols. Distribution for Coding Symbols. Review of Information Theory (cont’d). - PowerPoint PPT Presentation

Transcript of Linear Models (I)

Page 1: Linear Models (I)

Linear Models (I)

Rong Jin

Page 2: Linear Models (I)

Review of Information Theory What is information? What is entropy?

Average information Minimum coding length Important inequality

1( ) logi

ii

H P pp

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Page 3: Linear Models (I)

Review of Information Theory (cont’d) Mutual information

Measure the correlation between two random variables Symmetric

Kullback-Leibler distance

Difference between two distributions

,

( , )( ; ) ( ) ( | ) ( , ) log

( ) ( )x y

P x yI X Y H X H X Y P x y

P x P y

~( ) ( )

( , ) ( ) log [log ]( ) ( )D

D DD M D x Px

M M

P x P xKL P P P x E

P x P x

Page 4: Linear Models (I)

Outline Classification problems Information theory for text classification Gaussian generative Naïve Bayes Logistic regression

Page 5: Linear Models (I)

Classification ProblemsYXf :

XInput Y Output?

• Given input X={x1, x2, …, xm}

• Predict the class label y

• y{-1,1}, binary class classification problems

• y {1, 2, 3, …, c}, multiple class classification problems

• Goal: need to learn the function: YXf :

Page 6: Linear Models (I)

Examples of Classification Problems Text categorization:

Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … Class label: ‘politics’ and ‘non-politics’

Image Classification:

Input features: color histogram, texture distribution, edge distribution, … Class label: ‘bird image’ and ‘non-bird image’

Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …

Topic: politics

Which is a bird image?

Page 7: Linear Models (I)

Learning Setup for Classification Problems Training examples:

Identical Independent Distribution (i.i.d.)

Training examples are similar to testing examples Goal

Find a model or a function that is consistent with the training data

1 1 2 2{ , , , ,..., , }train n nD x y x y x y

Page 8: Linear Models (I)

Information Theory for Text Classification

If coding distribution is similar to the generating distribution short coding length good compression rate

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Page 9: Linear Models (I)

Compression Algorithm for TC

Compression Model M1

Compression Model M2

Politics

Sports

New Document

16K bits

10K bits

Topic:

Sports

Page 10: Linear Models (I)

Probabilistic Models for Classification Problems Apply statistical inference methods

Key: finding the best parameters Maximum likelihood (MLE) approach

Log-likelihood of data

Find the parameters that maximizes the log-likelihood

1( ) log ( | ; )

ntrain i ii

l D p y x

*1

max ( ) log ( | ; )n

train i iil D p y x

Training Examples

{ , }i ix y

Learning a Statistical Model

Prediction

p(y|x;)

Page 11: Linear Models (I)

Generative Models Not directly estimate p(y|x;) Using Bayes rule

Estimate p(xly;) instead of p(y|x;)

Why p(xly;)? Most well known distributions are p(xl). Allocate a separate set of parameters for each class

{1, 2,…, c}

p(xly;) p(xly) Describes the special input patterns for each class y

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Page 12: Linear Models (I)

Gaussian Generative Model (I) Assume a Gaussian model for each class One dimension case

Results for MLE

2

22

1 2

( )1( | ; ) exp

22

{ , ,..., }, ={ , , ( )}

y

yy

c k k k

xp x y

p y k

2{ | } { | }

1 1, ( )

| { | } | | { | } |

| { | } |( )

i i

n nk i k i ki y k i y k

i i

i

x xi y k i y k

i y kp y k

n

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Page 13: Linear Models (I)

Example

1.7, 0.1, 0.5

1.5, 0.2, 0.5male male male

female female female

p

p

• Height histogram for males and females.

• Using Gaussian generative model

• P(male|1.8) = ? , P(female|1.4) = ?

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

Page 14: Linear Models (I)

Gaussian Generative Model (II) Consider multiple input features

X={x1, x2, …, xm} Multi-variate Gaussian distribution

y is a mm covariance matrix Results for MLE

Problem: Singularity of y : too many parameters

1

/ 2 1/ 2

1 1

1 1( | ; ) ~ ( , ) exp

22 | |

( , , ( 1),..., , , ( ))

T

y y y y ymy

c c

p x y N x x

p y p y k

{ | }

,{ , | , }

1,

| { | } |

1( )( )

|{ | } | | { | } |

i

s t

ny ii y y

i

ni j i i j jy s y s ys t y y y y

s t

xi y y

x xs y y t y y

Page 15: Linear Models (I)

Overfitting Issue Complex model Insufficient training

Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples

Total number parameters for a full Gaussian model is 5 means 500 parameters 5 covariance matrices 50,000 parameters 50,500 parameters insufficient training data

Page 16: Linear Models (I)

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Page 17: Linear Models (I)

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 18: Linear Models (I)

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 19: Linear Models (I)

Another Example of Overfitting

-8 -6 -4 -2 0 2 4 6 8-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 20: Linear Models (I)

Naïve Bayes Simplify the model complexity

Diagonalize the covariance matrix y

Simplified Gaussian distribution

Feature independence assumption Naïve Bayes assumption

2

211/ 2 2

1

1 21 1

( )1( | ; ) exp ( | ; )

22

{ , ,..., }, { , ,..., , }

i i mm y i i

iimim y

ii

m i i i i ic c

xp x y p x y

Page 21: Linear Models (I)

Naïve Bayes A terrible estimator for But it is a very reasonable estimator for

Why?

The ratio of likelihood is more important

Naïve Bayes does a reasonable job on the estimation of ratio

( | ; )p x y

( | ; )p y x

' 1 ' 1

( ; ) ( | ; ) 1( | ; )

( '; ) ( | '; )( '; ) ( | '; )( '; ) ( | '; )

c cy y

p y p x yp y x

p y p x yp y p x yp y p x y

( | '; )

( | '; )

p x y

p x y

Page 22: Linear Models (I)

The Ratio of Likelihood Binary class

Both classes share the similar variance

2 2

2 21

21

2 2

12 221

( 1) ( | 1) ( 1)log log

( 1) ( | 1) ( 1)

2( 1)log

( 1)

2 ( ,..., )

i i i im

i ii

i i i i im

ii

i imc

ii

x xp y p x y p y

p y p x y p y

xp y

p y

x diag

( 1)log

( 1)

p y

p y

1{ ,..., }m

• A linear model !

Page 23: Linear Models (I)

Decision Boundary

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

• Gaussian Generative Models == Finding a linear decision boundary

• Why not do it directly?