Linear Models (I)

Post on 31-Dec-2015

29 views 0 download

Tags:

description

Linear Models (I). Rong Jin. Review of Information Theory. What is information? What is entropy? Average information Minimum coding length Important inequality. Distribution for Generating Symbols. Distribution for Coding Symbols. Review of Information Theory (cont’d). - PowerPoint PPT Presentation

Transcript of Linear Models (I)

Linear Models (I)

Rong Jin

Review of Information Theory What is information? What is entropy?

Average information Minimum coding length Important inequality

1( ) logi

ii

H P pp

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Review of Information Theory (cont’d) Mutual information

Measure the correlation between two random variables Symmetric

Kullback-Leibler distance

Difference between two distributions

,

( , )( ; ) ( ) ( | ) ( , ) log

( ) ( )x y

P x yI X Y H X H X Y P x y

P x P y

~( ) ( )

( , ) ( ) log [log ]( ) ( )D

D DD M D x Px

M M

P x P xKL P P P x E

P x P x

Outline Classification problems Information theory for text classification Gaussian generative Naïve Bayes Logistic regression

Classification ProblemsYXf :

XInput Y Output?

• Given input X={x1, x2, …, xm}

• Predict the class label y

• y{-1,1}, binary class classification problems

• y {1, 2, 3, …, c}, multiple class classification problems

• Goal: need to learn the function: YXf :

Examples of Classification Problems Text categorization:

Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … Class label: ‘politics’ and ‘non-politics’

Image Classification:

Input features: color histogram, texture distribution, edge distribution, … Class label: ‘bird image’ and ‘non-bird image’

Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …

Topic: politics

Which is a bird image?

Learning Setup for Classification Problems Training examples:

Identical Independent Distribution (i.i.d.)

Training examples are similar to testing examples Goal

Find a model or a function that is consistent with the training data

1 1 2 2{ , , , ,..., , }train n nD x y x y x y

Information Theory for Text Classification

If coding distribution is similar to the generating distribution short coding length good compression rate

1 1( ) log logi i

i ii i

H P p pp q

Distribution for Generating Symbols

Distribution for Coding Symbols

Compression Algorithm for TC

Compression Model M1

Compression Model M2

Politics

Sports

New Document

16K bits

10K bits

Topic:

Sports

Probabilistic Models for Classification Problems Apply statistical inference methods

Key: finding the best parameters Maximum likelihood (MLE) approach

Log-likelihood of data

Find the parameters that maximizes the log-likelihood

1( ) log ( | ; )

ntrain i ii

l D p y x

*1

max ( ) log ( | ; )n

train i iil D p y x

Training Examples

{ , }i ix y

Learning a Statistical Model

Prediction

p(y|x;)

Generative Models Not directly estimate p(y|x;) Using Bayes rule

Estimate p(xly;) instead of p(y|x;)

Why p(xly;)? Most well known distributions are p(xl). Allocate a separate set of parameters for each class

{1, 2,…, c}

p(xly;) p(xly) Describes the special input patterns for each class y

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Gaussian Generative Model (I) Assume a Gaussian model for each class One dimension case

Results for MLE

2

22

1 2

( )1( | ; ) exp

22

{ , ,..., }, ={ , , ( )}

y

yy

c k k k

xp x y

p y k

2{ | } { | }

1 1, ( )

| { | } | | { | } |

| { | } |( )

i i

n nk i k i ki y k i y k

i i

i

x xi y k i y k

i y kp y k

n

( ; ) ( | ; )( | ; )

( , ; )

p y p x yp y x

p y x

Example

1.7, 0.1, 0.5

1.5, 0.2, 0.5male male male

female female female

p

p

• Height histogram for males and females.

• Using Gaussian generative model

• P(male|1.8) = ? , P(female|1.4) = ?

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

Gaussian Generative Model (II) Consider multiple input features

X={x1, x2, …, xm} Multi-variate Gaussian distribution

y is a mm covariance matrix Results for MLE

Problem: Singularity of y : too many parameters

1

/ 2 1/ 2

1 1

1 1( | ; ) ~ ( , ) exp

22 | |

( , , ( 1),..., , , ( ))

T

y y y y ymy

c c

p x y N x x

p y p y k

{ | }

,{ , | , }

1,

| { | } |

1( )( )

|{ | } | | { | } |

i

s t

ny ii y y

i

ni j i i j jy s y s ys t y y y y

s t

xi y y

x xs y y t y y

Overfitting Issue Complex model Insufficient training

Consider a classification problem of multiple inputs 100 input features 5 classes 1000 training examples

Total number parameters for a full Gaussian model is 5 means 500 parameters 5 covariance matrices 50,000 parameters 50,500 parameters insufficient training data

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Another Example of Overfitting

-6 -4 -2 0 2 4 6-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Another Example of Overfitting

-8 -6 -4 -2 0 2 4 6 8-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Naïve Bayes Simplify the model complexity

Diagonalize the covariance matrix y

Simplified Gaussian distribution

Feature independence assumption Naïve Bayes assumption

2

211/ 2 2

1

1 21 1

( )1( | ; ) exp ( | ; )

22

{ , ,..., }, { , ,..., , }

i i mm y i i

iimim y

ii

m i i i i ic c

xp x y p x y

Naïve Bayes A terrible estimator for But it is a very reasonable estimator for

Why?

The ratio of likelihood is more important

Naïve Bayes does a reasonable job on the estimation of ratio

( | ; )p x y

( | ; )p y x

' 1 ' 1

( ; ) ( | ; ) 1( | ; )

( '; ) ( | '; )( '; ) ( | '; )( '; ) ( | '; )

c cy y

p y p x yp y x

p y p x yp y p x yp y p x y

( | '; )

( | '; )

p x y

p x y

The Ratio of Likelihood Binary class

Both classes share the similar variance

2 2

2 21

21

2 2

12 221

( 1) ( | 1) ( 1)log log

( 1) ( | 1) ( 1)

2( 1)log

( 1)

2 ( ,..., )

i i i im

i ii

i i i i im

ii

i imc

ii

x xp y p x y p y

p y p x y p y

xp y

p y

x diag

( 1)log

( 1)

p y

p y

1{ ,..., }m

• A linear model !

Decision Boundary

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15

20

25

30

35

40

Empirical data for maleFitted distributionfor maleEmpirical data for femaleFitted distribution for female

• Gaussian Generative Models == Finding a linear decision boundary

• Why not do it directly?