Irina Rish - Columbia University

40
Probabilistic Classifiers Irina Rish IBM T.J. Watson Research Center [email protected] , http://www.research.ibm.com/people/r/rish/ February 20, 2002

Transcript of Irina Rish - Columbia University

Page 1: Irina Rish - Columbia University

Probabilistic Classifiers

Irina Rish IBM T.J. Watson Research [email protected], http://www.research.ibm.com/people/r/rish/

February 20, 2002

Page 2: Irina Rish - Columbia University

Probabilistic Classification Bayesian Decision Theory

Bayes decision rule (revisited): Bayes risk, 0/1 loss, optimal classifier, discriminability

Probabilistic classifiers and their decision surfaces: Continuous features (Gaussian distribution)

Discrete (binary) features (+ class-conditional independence)

Parameter estimation “Classical” statistics: maximum-likelihood (ML)

Bayesian statistics: maximum a posteriori (MAP)

Common used classifier: naïve Bayes VERY simple: class-conditional feature independence

VERY efficient (empirically); why and when? – still an open question

Page 3: Irina Rish - Columbia University

Bayesian decision theory Make a decision that minimizes the overall expected cost (loss)

Advantage: theoretically guaranteed optimal decisions

Drawback: probability distributions are assumed to be known (in practice, estimation of those distribution from data can be a hard problem)

Classification problem as an example of a decision problem Given observed properties (features) of an object, find its class. Examples:

Sorting fish by its type (sea bass or salmon) given observed features such as lightness and length

Video character recognition

Face recognition

Document classification using word counts

Guessing user’s intentions (potential buyer or not) by his web transactions

Intrusion detection

Page 4: Irina Rish - Columbia University

Notation and definitions

loss) expected (total risk

risk lconditiona

loss 0/1function loss

)(decisions actionsrule decision

XyprobabilitXXdensity yprobabilit X

vector feature

:(class)nature of state

-)()|)((

given action of - )|()|()|R(

) if 0 and if 1( :Example - ) stategiven decision of(cost - )|(

classifier a is :)(and ,...,A :Example - ofset a is ,..., -

- :)( )|P( has and ,...,1 , :X Discrete -

)|p( has and :X Continuous - space featurein - ),...,(

labels) (class nature of states possible ofset a is ,..., - )P(on distributi with variablerandom a a -

1

1

1

i

i

1

1

dpRR

P

jiji

SA

ASC kDDS

C SSXX

CC

ijji

n

ji

ijij

jijiij

n

a

d

d

n

=•

=•

==≠==•

Ω→==

→•==

ℜ==•

=Ω•

=

α

αωωαλα

λλωαωαλλ

αωωαα

α

ωω

Page 5: Irina Rish - Columbia University

Gaussian (normal) density

−=2

21exp

21

σx-µ

σπp(x)

µσ

µ to from || ,])[(

)(][

22

xxr

σx-µσ

dxxxpxµ

distance sMahalanobi

deviation standardvariance

(mean) value expected

−−=

−−Ε=

−=Ε= ∫∞

∞−

variance!andmean given with p(x) all among ))(( has ) ,N( :) problem(homework property gInterestin

xpHσµ

entropy maximum

),( σµN

constaints asgiven is what besides structure' additional no' impose they sincereasonablemost are onsdistributientropy -max data, from learningWhen

∫−= nats)(in )(ln)())(( dxxpxpxpH

Page 6: Irina Rish - Columbia University
Page 7: Irina Rish - Columbia University

Bayes rule for binary classification

priorlikelihood

evidence

:formula Bayes

x, evidence

priors

)( , - )

y,probabilit - )())( where

)()()

)|

using ) updateGiven

otherwise , choose ),() only Given

1

2211

=

=

=>===•

∑=

ii

n

jii

iii

i

ii

P|p(

P|p(xp

pP|p(P(

P(

)P(ω)if P(ωPCP(

ωω

ωω

ωωω

ω

ωωωω

(x)g(x)g **

otherwise ),|()|P( if choose

2

211

ωωωω

=>=

(x)g(x)g

:rule decision Bayes

*

* xPx

Page 8: Irina Rish - Columbia University

Example: one-dimensional case )i|p(x ω

)| xp( iω

)|()|P( )|()|P(

21*2

21*1

ωωωω

PP

≤=>=

)()|()()|(

2211

21

1

ωωωω PxpPxp

|x)P(ω|x)P(ω

if ω(x)g*

>

>

=

31),

32) 21 == ωω P(P(

:Priors

get we,)(

)())|

in on dependnot does )(Since

pP|p(

P(

p

iii

i

ωωω

ω

=

Page 9: Irina Rish - Columbia University

Optimality of Bayes rule: idea

)|)P(g()|)P(g()|)P(g()(error|Pg 1221 ,, ωωωωω =+==≠=

Ω→≤ S:g(x)any for * )(error|P)(error|P gg

* if )(*

:rule Bayes

1

xxxg<

= ω

Proof – see next page…

gxxxg <= if )(:ruleOther

Page 10: Irina Rish - Columbia University

Optimality of Bayes rule: proof

)()()|()(

S:)g(any for and, ,min Thus,

(

1 ,0 and then , i )2

(

0 ,1 and then , i )1

(,|(|

,,

**

21*

1*

21221

2*

21121

112221

1221*

21

errorPdxxpxerrorPerrorP

)|),P(ω|P(ω)(error|P

)|P)(error|P

pp)g*(|x)P(ω|x)P(ωf

)|P)(error|P

pp)g*(|x)P(ω|x)P(ωf

)|P))P(g*()|P))P(g*(

)|)P(g*()|)P(g*()|)P(g*()(error|P

ggg

g

g

g

pp

g

≤=

Ω→=

=

⇒===≤

=

⇒===>

=+==

==+==≠=

∫∞+

∞−

x

ω

ω

ω

ω

ωωωωωω

ωωωωω

Page 11: Irina Rish - Columbia University

General Bayesian Decision Theory

dpRR

P

xpP|p(xxP(

P

dpRRAS

A

jji

n

ji

iiii

jji

n

ji

jiijji

a

ii

)()|)((min :) (calledrisk overall minimum yields

)|()|(min arg)|R( min arg)(

risk lconditiona minimize always

formula) (Bayes )(

)())| and given action of

theis )|()|()|R( where

)()|)(( :)( the minimizing :)( a :Find

)|( )given decision of(cost

,..., available ofset :Given

)(

*

1

1

1

=

==

=

=

=

=•=•

=

=

α

ωωαλα

ωωωα

ωωαλα

αα

ωαλλωααα

α

αα

risk Bayes

x

risk lconditiona

riskloss expected total rule decision

function loss ) (decisions actions

rule decision Bayes

a

:rule decision Bayes

Page 12: Irina Rish - Columbia University

Zero-one loss classification

)|()|(1)|()|()|()|(

Then

if 0 if 1)|(

n 1,...,i , i.e. tion),classifica(action )g()(Let

1

errorPPPPR

jijiωα

ωαα

iij

jjjj

n

ji

jiij

ii

=−===

=

=≠==

====

∑∑≠=

ωωωωαλα

λλ

error tionclassifica risk lconditiona

:loss one-Zero

x

ji allfor )|P()|P( if )( :

*

≠>=

ji

igωω

ωrule Bayes

dpRR )()|)((min )(

*∫= α

α rateerror minimum achieves rule Bayes

Page 13: Irina Rish - Columbia University

Different errors, different costs

)(cost )|*( :rejectionCorrect )(cost )|*( :alarm False )(cost )|*( :Miss )(cost )|*( :Hit

present signal noise), d(backgroun signal no -:theorydetection signal from Example

111

211

122

222

21

λωλω

λωλω

ωω

∈<•∈>•

∈<•∈>•

xxxPxxxP

xxxPxxxP

||'

bility Discrimina12

σµµ −

=d

Page 14: Irina Rish - Columbia University

Receiver operating characteristic (ROC) curve

)|*( 1ω∈> xxxP

Error costs determine optimal decision (threshold x*)

Page 15: Irina Rish - Columbia University

Cost-based classification

or , )()(

)|()|(

:decision)correct than morecost (errors and Assuming

1121

2212

2

1

22121121

λλλλ

ωω

λλλλ

−−

>

>>

PP

λθωω

λλλλ

ωω

ω

=−−

>

=

)P()P(

)()(

)|()|(

if )( :

1

2

1121

2212

2

1

1*

PP

gruleBayes

+=+=

===

)|( )|()|R()|( )|()|R(

:)|( and , 1,2i ,Let

2221212

2121111

ωλωλαωλωλαλλ

PPPP

ωαωα jiijii

)|()()|()(i.e. ,)|R()|R(

if )( :

2221211121

21

1*

ωλλωλλαα

ω

PP

g

−>−<

=rule Bayes

Page 16: Irina Rish - Columbia University

λθωω

λλλλ

ωω

=−−

>)P()P(

)()(

)|()|(

1

2

1121

2212

2

1

PP

Example: one-dimensional case

increases thresholde then th, i.e. otherwise,than expensive more becomes as yingmisclassif If

2112

12

λλωω>

Page 17: Irina Rish - Columbia University

(x) f 1 (x) f 2 (x) f n

Discriminant functions and classification

0)( if )(g )()()( :

allfor )()( if )(g )classifier Bayesfor )|()( e.g., (

,...,1 ),( :

1

21

*

>=−≡

≠>=−=

=•

fclassifierffffunction ntdiscrimina single

j iffclassifierRf

niffunctions ntdiscrimina

jii

ii

i

ω

ωα

:- -

:tionclassifica category-Two

:-

- :tionclassificacategory-Multi

)(g iω=

)(g)|(

=i

ji

αωαλ

Page 18: Irina Rish - Columbia University

Bayes Discriminant Functions

!increasinglly monotonica is )h( where))(h(by replaced becan )(Every 1)|P( )|()(

:

i*

⋅−=−=

ii

ii

ffRf ωα

tionclassifica loss one-zerofor functionsnt discrimina Bayes

)(log)log)(

)())(

)(

)())|)(

*

*

*

iii

iii

iiii

P|p(fP|p(f

pP|p(P(f

ωωωω

ωωω

+==

==

)( )(log

))log)(

)|()|()( )()()(

2

1

2

1*

21*

*1

*1

*

ωω

ωω

ωω

PP

|p(|p(f

PPffff

+=

−=

−≡

Multi-category:

Two-category:

Decision regions and surfaces:

Page 19: Irina Rish - Columbia University

Discriminant functions: examples

)|()|()|,( CxPCxPCxxP jiji =

Discriminant functionsFeatures

linear (hyperplanes)

hyperquadrics (hyperellipsoids, hyperparabaloids, hyperhyperboloids)

linear (hyperplanes)Binary, conditionally independent

Dis

cret

eCo

ntin

uous

=

)()(21exp

||)2(1

)|( Gaussian teMultivaria

2/12/

td

ip

π

ω

classes allfor

ω=• :matrix covariance Same

arbitrary : case General•

Page 20: Irina Rish - Columbia University

Conditionally independent binary features linear classifier (‘naïve Bayes’-see later)

=

+

−−

−+=

=

−−

===

−=−=

====

−∈−∈=

∏∏

=

=

=

=

)()(log

11

log)1(log

)()(

11

log)()|()()|(log

)|()|(log)(

)1()|( ,)1()|(

).|1( ),|1(

, ,10 ,),...,(

2

1

1

2

1

1

122

11

2

1

1

12

1

11

21

211

ωω

ωω

ωωωω

ωω

ωω

ωω

ωω

PP

qpx

qpx

PP

qp

qp

PxPPxP

xPxPxf

qqxPppxP

ThenxPqxPpLet

classcfeatures,xxxx

n

i i

ii

i

ii

x

i

ixn

i i

i

n

i

xi

xi

n

i

xi

xi

iiii

in

ii

iiii

)()(log

11

log ,)1()1(

log where,)(2

1

10

10 ω

ωPP

qpw

pqqpwwxwxf

n

i i

i

ii

iii

n

iii +

−−

=−−

=+= ∑∑==

:) choose 0,f(x) if :rule (decision f(x)function nt Discrimina 1ω>

Page 21: Irina Rish - Columbia University

Case1: independent Gaussian featureswith same variance for all classes:

2σ=

Note: linear separating surfaces!

Page 22: Irina Rish - Columbia University

= :classes allfor scovariance same havingfeaturesdependent tion togeneraliza :2 Case

Page 23: Irina Rish - Columbia University

Case 3: unequal covariance matricesOne dimension: multiply-connected decision regions

Page 24: Irina Rish - Columbia University

Case 3, many dimensions: hyperquadric surfaces

Page 25: Irina Rish - Columbia University

Bayesian decision theory: In theory: tells you how to make optimal (minimum-risk) decisions In practice: where do you get those probabilities from?

Expert knowledge + learning from data (see next; also, Chapter 3)

Note: some typos in Chapter 2 Page 25, second line after the equation 12: must be Page 27, third line before section 2.3.1:

Page 50 and 51, in Fig. 2.20 and Fig 2.21, replace x-axis label by

Same replacement in problem 9, page 75 Section 2.11 (Bayesian belief networks): contains several mistakes; ignore for now.

Bayesian networks will be covered later.

Summary

))(R(not ,)|)(R( ii αα

in inequality reverse and , and switch 12212 1 λλωω >

)|*( 1ω∈> xxxP

Page 26: Irina Rish - Columbia University

Parameter Estimation

variable?random aor constant physical'' aparameter is :difference calPhilosophi

(MAP)Bayesian and (ML) lstatistica classical :approachesmajor Two

, estimate ),,N(Gaussian is ),|p(x assume :Example

data from estimationparameter learningThen

etc.) l,multinomia Gaussian, , (e.g.approach on distributi parametric fixed afirst consider We

etc.) ce,independen featureor ),|P( of form parametric (e.g., sassumption gsimplifyin :Solution

spaces feature ldimensiona-highin especially hard, is )|P( estimating , Usually

problem estimationdensity e.g., - )|P( and )P(C (estimate) find wish to we

, ),( where,given general,In

iiiii

i

i

ii

jj

=•

=

==•

σµσµω

ω

ω

ωω

ω

x

x

x

xyy,...,yD data training jN1

Page 27: Irina Rish - Columbia University

Maximum likelihood (ML) and Maximum a posteriory (MAP) estimates

prior uniform with MAPML that Note

)P( )|( )P( ˆ

))P(prior with variable,randomunknown an is ( estimate posteriory a Maximum

)|( )l( ˆ

:constant)unknown an is ( estimate likelihood-Maximum

)|p( )|(Then

) waysamein separately estimated be(can

t independen are classes different for that assume also We

),|p(Gaussian for ),( :Example

vectorparameter a is where),,|p(on distributi parametric a Assume

),( where,

samples (i.i.d.) ddistributey identicall andt independen Assume

N

1j

iii

i

j

=•

==

==

=•

=

==

∏=

DP

DP

DP j

jjN1

x

x

x

xyy,...,yD

ωσµ

ω

ω

Page 28: Irina Rish - Columbia University

ML estimate: Gaussian distribution

2

1

222

1

2

11

1

)ˆ(1

1ˆ be wouldestimate Unbiased

1)ˆ(1 i.e. biased, is ˆ that Note

)ˆ(1ˆ ,1ˆ

:unknown are and both

: estimate ,known

µxN

NNµx

N

µxN

xN

µ

xN

µ

N

j

j

N

j

j

N

j

jN

j

j

N

j

j

∑∑

=

=

==

=

−−

=•

≠−=

−Ε•

−==

=

σ

σσσ

σ

σµ

µσ

Page 29: Irina Rish - Columbia University
Page 30: Irina Rish - Columbia University

Bayesian (MAP) estimate with increasing sample size

Page 31: Irina Rish - Columbia University

Parameter estimation: discrete features

ML-estimate: )logmaxargˆ P(D|ΘΘΘ

=

MAP-estimate )()|(logmax ΘΘΘ

PDP

Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa ααθ

)|( iik CkxP ωθ ===

Multinomial P(x|C)

)ML( ,

,

∑ ==

===

kickx

ickxik N

counts

) MAP(,,

,,,

∑∑ ++

=

xx

xx

xxx

XX

XX

X NN

papa

papapa α

αθ

Equivalent sample size(prior knowledge)

C

X

Page 32: Irina Rish - Columbia University

An example: naïve Bayes classifierSimplifying (“naïve”) assumption:feature independence given class

C)|P(x 1C)|P(x 2 C)|P(x n

1x feature nx feature2x feature

P(C)C Class

∏=

=

C)|x,...,P(x n1

1. Bayes(-optimal) classifier:given an (unlabeled) instance , choose most likely class:

)x|iP(Cmax arg BO(x)i

==)x,...,(xx n1=

2. Naïve Bayes classifier:

∏=

===n

j 1

i)C|i)P(xP(Cmax arg NB(x) ji

)xP(i)i)P(CC|xP()x|iP(C ====By Bayes rule , and by independence assumption

Page 33: Irina Rish - Columbia University

State-of-the-art

Optimality results Linear decision surface for binary features (Minsky 61, Duda&Hart 73)

(polynomial for general nominal features - Duda&Hart 1973, Peot 96) Optimality for OR and AND concepts (Domingos&Pazzani 97) No XOR-containing concepts on nominal features (Zhang&Ling 01)

Algorithmic improvements Boosted NB (Elkan 97) is equivalent to multilayer perceptron Augmented NB (TAN, Bayes Nets – e.g., Friedman et al 97) Other improvments (combining with Decision Trees (Kohavi), w/ error-correcting

output coding (ECOC) (Ghani, ICML 2000), etc.

Still, open problems remain: NB error estimate/bounds based on domain properties

Page 34: Irina Rish - Columbia University

Why Naïve Bayes often works well(despite independence assumption)?

Wrong P(C|x) estimates do not imply wrong classification!

Domingos&Pazzani, 97, J. Friedman 97, etc."Statistical diagnosis based on conditional independence does not require it", J. Hilden 84

optclass

True

NB estimate

P(cl

ass|

f)

f)|(classP

f)|P(class

=NBclassNaïve Bayes:f)|(classPmax arg ii

ˆ

=optclassBayes-optimal:f)|P(classmax arg ii

Major questions remain:Which P(c,x) are ‘good’ for NB?

What domain properties “predict” NB accuracy?

Page 35: Irina Rish - Columbia University

General question:characterizing distributions P(X,C) and their approximations Q (X,C) that can be ‘far’ from P(X,C), but yield low classification error

Class 1 Class 2

Optimal decision boundary

True PApproximate P

P(x|C)P(C)

Note: one measure of ‘distance’ between distributions can be relative entropy, or KL-divergence (see hw problem11, chap.3)

∫= dzzQzPzPQPD)()(log)()||(

Page 36: Irina Rish - Columbia University

Case study: using Naïve Bayes for Transaction Recognition Problem

Client Workstation

End-UserTransactions

(EUT)

Remote Procedure

Calls (RPCs)

Server (Web, DB, Lotus Notes)

Session (connection)

OpenDB Search SendMail

RPCs?

Two problems: segmentation and labeling

1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4

Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2

Unsegmented RPC's

Segmented RPC'sand Labeled Tx's Tx2

Page 37: Irina Rish - Columbia University

Representing transactions as feature vectors

ijij p)T|1P(R :Bernoulli ==

∏∏ ==

=M

1j

nijM

1j ijiiMi1

ijp!n

n!)T|n,...,P(n :lMultinomia

)p(1p)T|P(n :Geometric ijn

ijiijij −=

)p(1p)T|P(n :Geometric Shifted ijsn

ijiijijij −= −

i typeof nTransactio

5R 5R3R 2R 2R1R2R

RPC counts ...)0,2,0,1, 3,(1,f =

...) 0, 1, 0, 1, 1, (1,f = RPC occurrences

geometric shifted:)( data to fit Best 2χ

Page 38: Irina Rish - Columbia University

Empirical results

Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:

SVM – 85-87%, Decision Tree – 90-92%

Best-fit distribution (shift. geom) - not necessarily best classifier! (?)

Baseline classifier:Always selects most-frequent transaction

Acc

ura

cy

Training set size

NB + Bernoulli, mult. or geom.

NB + shifted geom.

2%87 ±

3%79 ±

1%10 ±≈

Page 39: Irina Rish - Columbia University

Next lecture on Bayesian topics

April 17, 2002 - lecture on recent ‘hot stuff’:Bayesian networks, HMMs, EM algorithm

Page 40: Irina Rish - Columbia University

Short Preview:From Naïve Bayes to Bayesian Networks

class)|P(f 1 class)|P(f 2class)|P(f n

1f feature nf feature2f feature

ClassNaïve Bayes model:independent features given class

Bayesian network (BN) model: Any joint probability distributions

lung Cancer

Smoking

X-ray

Bronchitis

DyspnoeaP(D|C,B)

P(B|S)

P(S)

P(X|C,S)

P(C|S)

= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)P(S, C, B, X, D)=

CPD:C B D=0 D=10 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1

Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?