Irina Rish - Columbia University

Probabilistic Classifiers

Irina Rish IBM T.J. Watson Research [email protected], http://www.research.ibm.com/people/r/rish/

February 20, 2002

Probabilistic Classification Bayesian Decision Theory

Bayes decision rule (revisited): Bayes risk, 0/1 loss, optimal classifier, discriminability

Probabilistic classifiers and their decision surfaces: Continuous features (Gaussian distribution)

Discrete (binary) features (+ class-conditional independence)

Parameter estimation “Classical” statistics: maximum-likelihood (ML)

Bayesian statistics: maximum a posteriori (MAP)

Common used classifier: naïve Bayes VERY simple: class-conditional feature independence

VERY efficient (empirically); why and when? – still an open question

Bayesian decision theory Make a decision that minimizes the overall expected cost (loss)

Advantage: theoretically guaranteed optimal decisions

Drawback: probability distributions are assumed to be known (in practice, estimation of those distribution from data can be a hard problem)

Classification problem as an example of a decision problem Given observed properties (features) of an object, find its class. Examples:

Sorting fish by its type (sea bass or salmon) given observed features such as lightness and length

Video character recognition

Face recognition

Document classification using word counts

Guessing user’s intentions (potential buyer or not) by his web transactions

Intrusion detection

Notation and definitions

loss) expected (total risk

risk lconditiona

loss 0/1function loss

)(decisions actionsrule decision

XyprobabilitXXdensity yprobabilit X

vector feature

:(class)nature of state

-)()|)((

given action of - )|()|()|R(

) if 0 and if 1( :Example - ) stategiven decision of(cost - )|(

classifier a is :)(and ,...,A :Example - ofset a is ,..., -

- :)( )|P( has and ,...,1 , :X Discrete -

)|p( has and :X Continuous - space featurein - ),...,(

labels) (class nature of states possible ofset a is ,..., - )P(on distributi with variablerandom a a -

1

1

1

i

i

1

1

dpRR

P

jiji

SA

ASC kDDS

C SSXX

CC

ijji

n

ji

ijij

jijiij

n

a

d

d

n

∫

∑

=•

=•

==≠==•

Ω→==

→•==

ℜ==•

=Ω•

=

α

αωωαλα

λλωαωαλλ

αωωαα

α

ωω

Gaussian (normal) density

−=2

21exp

21

σx-µ

σπp(x)

µσ

µ to from || ,])[(

)(][

22

xxr

σx-µσ

dxxxpxµ

distance sMahalanobi

deviation standardvariance

(mean) value expected

−−=

−−Ε=

−=Ε= ∫∞

∞−

variance!andmean given with p(x) all among ))(( has ) ,N( :) problem(homework property gInterestin

xpHσµ

entropy maximum

),( σµN

constaints asgiven is what besides structure' additional no' impose they sincereasonablemost are onsdistributientropy -max data, from learningWhen

∫−= nats)(in )(ln)())(( dxxpxpxpH

Bayes rule for binary classification

priorlikelihood

evidence

:formula Bayes

x, evidence

priors

)( , - )

y,probabilit - )())( where

)()()

)|

using ) updateGiven

otherwise , choose ),() only Given

1

2211

−

=

=

•

=>===•

∑=

ii

n

jii

iii

i

ii

P|p(

P|p(xp

pP|p(P(

P(

)P(ω)if P(ωPCP(

ωω

ωω

ωωω

ω

ωωωω

(x)g(x)g **

otherwise ),|()|P( if choose

2

211

ωωωω

=>=

(x)g(x)g

:rule decision Bayes

*

* xPx

Example: one-dimensional case )i|p(x ω

)| xp( iω

)|()|P( )|()|P(

21*2

21*1

ωωωω

PP

≤=>=

)()|()()|(

2211

21

1

ωωωω PxpPxp

|x)P(ω|x)P(ω

if ω(x)g*

>

>

=

31),

32) 21 == ωω P(P(

:Priors

get we,)(

)())|

in on dependnot does )(Since

pP|p(

P(

p

iii

i

ωωω

ω

=

Optimality of Bayes rule: proof

)()()|()(

S:)g(any for and, ,min Thus,

(

1 ,0 and then , i )2

(

0 ,1 and then , i )1

(,|(|

,,

**

21*

1*

21221

2*

21121

112221

1221*

21

errorPdxxpxerrorPerrorP

)|),P(ω|P(ω)(error|P

)|P)(error|P

pp)g*(|x)P(ω|x)P(ωf

)|P)(error|P

pp)g*(|x)P(ω|x)P(ωf

)|P))P(g*()|P))P(g*(

)|)P(g*()|)P(g*()|)P(g*()(error|P

ggg

g

g

g

pp

g

≤=

Ω→=

=

⇒===≤

=

⇒===>

=+==

==+==≠=

∫∞+

∞−

x

ω

ω

ω

ω

ωωωωωω

ωωωωω

General Bayesian Decision Theory

dpRR

P

xpP|p(xxP(

P

dpRRAS

A

jji

n

ji

iiii

jji

n

ji

jiijji

a

ii

)()|)((min :) (calledrisk overall minimum yields

)|()|(min arg)|R( min arg)(

risk lconditiona minimize always

formula) (Bayes )(

)())| and given action of

theis )|()|()|R( where

)()|)(( :)( the minimizing :)( a :Find

)|( )given decision of(cost

,..., available ofset :Given

)(

*

1

1

1

∫

∑

∑

∫

=

•

==

•

=

=

=

→

=•=•

=

=

α

ωωαλα

ωωωα

ωωαλα

αα

ωαλλωααα

α

αα

risk Bayes

x

risk lconditiona

riskloss expected total rule decision

function loss ) (decisions actions

rule decision Bayes

a

:rule decision Bayes

Zero-one loss classification

)|()|(1)|()|()|()|(

Then

if 0 if 1)|(

n 1,...,i , i.e. tion),classifica(action )g()(Let

1

errorPPPPR

jijiωα

ωαα

iij

jjjj

n

ji

jiij

ii

=−===

=

=≠==

====

∑∑≠=

ωωωωαλα

λλ

error tionclassifica risk lconditiona

:loss one-Zero

x

ji allfor )|P()|P( if )( :

*

≠>=

ji

igωω

ωrule Bayes

dpRR )()|)((min )(

*∫= α

α rateerror minimum achieves rule Bayes

Different errors, different costs

)(cost )|*( :rejectionCorrect )(cost )|*( :alarm False )(cost )|*( :Miss )(cost )|*( :Hit

present signal noise), d(backgroun signal no -:theorydetection signal from Example

111

211

122

222

21

λωλω

λωλω

ωω

∈<•∈>•

∈<•∈>•

−

xxxPxxxP

xxxPxxxP

||'

bility Discrimina12

σµµ −

=d

Receiver operating characteristic (ROC) curve

)|*( 1ω∈> xxxP

Error costs determine optimal decision (threshold x*)

Cost-based classification

or , )()(

)|()|(

:decision)correct than morecost (errors and Assuming

1121

2212

2

1

22121121

λλλλ

ωω

λλλλ

−−

>

>>

PP

λθωω

λλλλ

ωω

ω

=−−

>

=

)P()P(

)()(

)|()|(

if )( :

1

2

1121

2212

2

1

1*

PP

gruleBayes

+=+=

===

)|( )|()|R()|( )|()|R(

:)|( and , 1,2i ,Let

2221212

2121111

ωλωλαωλωλαλλ

PPPP

ωαωα jiijii

)|()()|()(i.e. ,)|R()|R(

if )( :

2221211121

21

1*

ωλλωλλαα

ω

PP

g

−>−<

=rule Bayes

λθωω

λλλλ

ωω

=−−

>)P()P(

)()(

)|()|(

1

2

1121

2212

2

1

PP

Example: one-dimensional case

increases thresholde then th, i.e. otherwise,than expensive more becomes as yingmisclassif If

2112

12

λλωω>

(x) f 1 (x) f 2 (x) f n

Discriminant functions and classification

0)( if )(g )()()( :

allfor )()( if )(g )classifier Bayesfor )|()( e.g., (

,...,1 ),( :

1

21

*

>=−≡

•

≠>=−=

=•

fclassifierffffunction ntdiscrimina single

j iffclassifierRf

niffunctions ntdiscrimina

jii

ii

i

ω

ωα

:- -

:tionclassifica category-Two

:-

- :tionclassificacategory-Multi

)(g iω=

)(g)|(

=i

ji

αωαλ

Bayes Discriminant Functions

!increasinglly monotonica is )h( where))(h(by replaced becan )(Every 1)|P( )|()(

:

i*

⋅−=−=

ii

ii

ffRf ωα

tionclassifica loss one-zerofor functionsnt discrimina Bayes

)(log)log)(

)())(

)(

)())|)(

*

*

*

iii

iii

iiii

P|p(fP|p(f

pP|p(P(f

ωωωω

ωωω

+==

==

)( )(log

))log)(

)|()|()( )()()(

2

1

2

1*

21*

*1

*1

*

ωω

ωω

ωω

PP

|p(|p(f

PPffff

+=

−=

−≡

Multi-category:

Two-category:

Decision regions and surfaces:

Discriminant functions: examples

)|()|()|,( CxPCxPCxxP jiji =

Discriminant functionsFeatures

linear (hyperplanes)

hyperquadrics (hyperellipsoids, hyperparabaloids, hyperhyperboloids)

linear (hyperplanes)Binary, conditionally independent

Dis

cret

eCo

ntin

uous

−

=

)()(21exp

||)2(1

)|( Gaussian teMultivaria

2/12/

td

ip

π

ω

classes allfor

ω=• :matrix covariance Same

arbitrary : case General•

Conditionally independent binary features linear classifier (‘naïve Bayes’-see later)

=

+

−−

−+=

=

−−

===

−=−=

====

−∈−∈=

∑

∏

∏∏

=

−

=

=

−

=

−

)()(log

11

log)1(log

)()(

11

log)()|()()|(log

)|()|(log)(

)1()|( ,)1()|(

).|1( ),|1(

, ,10 ,),...,(

2

1

1

2

1

1

122

11

2

1

1

12

1

11

21

211

ωω

ωω

ωωωω

ωω

ωω

ωω

ωω

PP

qpx

qpx

PP

qp

qp

PxPPxP

xPxPxf

qqxPppxP

ThenxPqxPpLet

classcfeatures,xxxx

n

i i

ii

i

ii

x

i

ixn

i i

i

n

i

xi

xi

n

i

xi

xi

iiii

in

ii

iiii

)()(log

11

log ,)1()1(

log where,)(2

1

10

10 ω

ωPP

qpw

pqqpwwxwxf

n

i i

i

ii

iii

n

iii +

−−

=−−

=+= ∑∑==

:) choose 0,f(x) if :rule (decision f(x)function nt Discrimina 1ω>

Case1: independent Gaussian featureswith same variance for all classes:

2σ=

Note: linear separating surfaces!

= :classes allfor scovariance same havingfeaturesdependent tion togeneraliza :2 Case

Case 3: unequal covariance matricesOne dimension: multiply-connected decision regions

Case 3, many dimensions: hyperquadric surfaces

Bayesian decision theory: In theory: tells you how to make optimal (minimum-risk) decisions In practice: where do you get those probabilities from?

Expert knowledge + learning from data (see next; also, Chapter 3)

Note: some typos in Chapter 2 , second line after the equation 12: must be Page 27, third line before section 2.3.1:

Page 50 and 51, in Fig. 2.20 and Fig 2.21, replace x-axis label by

Same replacement in problem 9, page 75 Section 2.11 (Bayesian belief networks): contains several mistakes; ignore for now.

Bayesian networks will be covered later.

Summary

))(R(not ,)|)(R( ii αα

in inequality reverse and , and switch 12212 1 λλωω >

)|*( 1ω∈> xxxP

Parameter Estimation

variable?random aor constant physical'' aparameter is :difference calPhilosophi

(MAP)Bayesian and (ML) lstatistica classical :approachesmajor Two

, estimate ),,N(Gaussian is ),|p(x assume :Example

data from estimationparameter learningThen

etc.) l,multinomia Gaussian, , (e.g.approach on distributi parametric fixed afirst consider We

etc.) ce,independen featureor ),|P( of form parametric (e.g., sassumption gsimplifyin :Solution

spaces feature ldimensiona-highin especially hard, is )|P( estimating , Usually

problem estimationdensity e.g., - )|P( and )P(C (estimate) find wish to we

, ),( where,given general,In

iiiii

i

i

ii

jj

•

•

•

=•

•

•

•

=

==•

σµσµω

ω

ω

ωω

ω

x

x

x

xyy,...,yD data training jN1

Maximum likelihood (ML) and Maximum a posteriory (MAP) estimates

prior uniform with MAPML that Note

)P( )|( )P( ˆ

))P(prior with variable,randomunknown an is ( estimate posteriory a Maximum

)|( )l( ˆ

:constant)unknown an is ( estimate likelihood-Maximum

)|p( )|(Then

) waysamein separately estimated be(can

t independen are classes different for that assume also We

),|p(Gaussian for ),( :Example

vectorparameter a is where),,|p(on distributi parametric a Assume

),( where,

samples (i.i.d.) ddistributey identicall andt independen Assume

N

1j

iii

i

j

=•

==

•

==

•

=•

•

=

•

==

•

∏=

DP

DP

DP j

jjN1

x

x

x

xyy,...,yD

ωσµ

ω

ω

ML estimate: Gaussian distribution

2

1

222

1

2

11

1

)ˆ(1

1ˆ be wouldestimate Unbiased

1)ˆ(1 i.e. biased, is ˆ that Note

)ˆ(1ˆ ,1ˆ

:unknown are and both

1ˆ

: estimate ,known

µxN

NNµx

N

µxN

xN

µ

xN

µ

N

j

j

N

j

j

N

j

jN

j

j

N

j

j

∑

∑

∑∑

∑

=

=

==

=

−−

=•

≠−=

−Ε•

−==

•

=

•

σ

σσσ

σ

σµ

µσ

Bayesian (MAP) estimate with increasing sample size

Parameter estimation: discrete features

ML-estimate: )logmaxargˆ P(D|ΘΘΘ

=

MAP-estimate )()|(logmax ΘΘΘ

PDP

Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa ααθ

)|( iik CkxP ωθ ===

Multinomial P(x|C)

)ML( ,

,

∑ ==

===

kickx

ickxik N

Nθ

counts

) MAP(,,

,,,

∑∑ ++

=

xx

xx

xxx

XX

XX

X NN

papa

papapa α

αθ

Equivalent sample size(prior knowledge)

C

X

An example: naïve Bayes classifierSimplifying (“naïve”) assumption:feature independence given class

C)|P(x 1C)|P(x 2 C)|P(x n

1x feature nx feature2x feature

P(C)C Class

∏=

=

C)|x,...,P(x n1

1. Bayes(-optimal) classifier:given an (unlabeled) instance , choose most likely class:

)x|iP(Cmax arg BO(x)i

==)x,...,(xx n1=

2. Naïve Bayes classifier:

∏=

===n

j 1

i)C|i)P(xP(Cmax arg NB(x) ji

)xP(i)i)P(CC|xP()x|iP(C ====By Bayes rule , and by independence assumption

State-of-the-art

Optimality results Linear decision surface for binary features (Minsky 61, Duda&Hart 73)

(polynomial for general nominal features - Duda&Hart 1973, Peot 96) Optimality for OR and AND concepts (Domingos&Pazzani 97) No XOR-containing concepts on nominal features (Zhang&Ling 01)

Algorithmic improvements Boosted NB (Elkan 97) is equivalent to multilayer perceptron Augmented NB (TAN, Bayes Nets – e.g., Friedman et al 97) Other improvments (combining with Decision Trees (Kohavi), w/ error-correcting

output coding (ECOC) (Ghani, ICML 2000), etc.

Still, open problems remain: NB error estimate/bounds based on domain properties

Why Naïve Bayes often works well(despite independence assumption)?

Wrong P(C|x) estimates do not imply wrong classification!

Domingos&Pazzani, 97, J. Friedman 97, etc."Statistical diagnosis based on conditional independence does not require it", J. Hilden 84

optclass

True

NB estimate

P(cl

ass|

f)

f)|(classP

f)|P(class

=NBclassNaïve Bayes:f)|(classPmax arg ii

ˆ

=optclassBayes-optimal:f)|P(classmax arg ii

Major questions remain:Which P(c,x) are ‘good’ for NB?

What domain properties “predict” NB accuracy?

General question:characterizing distributions P(X,C) and their approximations Q (X,C) that can be ‘far’ from P(X,C), but yield low classification error

Class 1 Class 2

Optimal decision boundary

True PApproximate P

P(x|C)P(C)

Note: one measure of ‘distance’ between distributions can be relative entropy, or KL-divergence (see hw problem11, chap.3)

∫= dzzQzPzPQPD)()(log)()||(

Case study: using Naïve Bayes for Transaction Recognition Problem

Client Workstation

End-UserTransactions

(EUT)

Remote Procedure

Calls (RPCs)

Server (Web, DB, Lotus Notes)

Session (connection)

OpenDB Search SendMail

RPCs?

Two problems: segmentation and labeling

1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4

Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2

Unsegmented RPC's

Segmented RPC'sand Labeled Tx's Tx2

Representing transactions as feature vectors

ijij p)T|1P(R :Bernoulli ==

∏∏ ==

=M

1j

nijM

1j ijiiMi1

ijp!n

n!)T|n,...,P(n :lMultinomia

)p(1p)T|P(n :Geometric ijn

ijiijij −=

)p(1p)T|P(n :Geometric Shifted ijsn

ijiijijij −= −

i typeof nTransactio

5R 5R3R 2R 2R1R2R

RPC counts ...)0,2,0,1, 3,(1,f =

...) 0, 1, 0, 1, 1, (1,f = RPC occurrences

geometric shifted:)( data to fit Best 2χ

Empirical results

Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:

SVM – 85-87%, Decision Tree – 90-92%

Best-fit distribution (shift. geom) - not necessarily best classifier! (?)

Baseline classifier:Always selects most-frequent transaction

Acc

ura

cy

Training set size

NB + Bernoulli, mult. or geom.

NB + shifted geom.

2%87 ±

3%79 ±

1%10 ±≈

Next lecture on Bayesian topics

April 17, 2002 - lecture on recent ‘hot stuff’:Bayesian networks, HMMs, EM algorithm

Irina Rish - Columbia University

Documents

Transcript of Irina Rish - Columbia University