Irina Rish - Columbia University
Transcript of Irina Rish - Columbia University
Probabilistic Classifiers
Irina Rish IBM T.J. Watson Research [email protected], http://www.research.ibm.com/people/r/rish/
February 20, 2002
Probabilistic Classification Bayesian Decision Theory
Bayes decision rule (revisited): Bayes risk, 0/1 loss, optimal classifier, discriminability
Probabilistic classifiers and their decision surfaces: Continuous features (Gaussian distribution)
Discrete (binary) features (+ class-conditional independence)
Parameter estimation “Classical” statistics: maximum-likelihood (ML)
Bayesian statistics: maximum a posteriori (MAP)
Common used classifier: naïve Bayes VERY simple: class-conditional feature independence
VERY efficient (empirically); why and when? – still an open question
Bayesian decision theory Make a decision that minimizes the overall expected cost (loss)
Advantage: theoretically guaranteed optimal decisions
Drawback: probability distributions are assumed to be known (in practice, estimation of those distribution from data can be a hard problem)
Classification problem as an example of a decision problem Given observed properties (features) of an object, find its class. Examples:
Sorting fish by its type (sea bass or salmon) given observed features such as lightness and length
Video character recognition
Face recognition
Document classification using word counts
Guessing user’s intentions (potential buyer or not) by his web transactions
Intrusion detection
Notation and definitions
loss) expected (total risk
risk lconditiona
loss 0/1function loss
)(decisions actionsrule decision
XyprobabilitXXdensity yprobabilit X
vector feature
:(class)nature of state
-)()|)((
given action of - )|()|()|R(
) if 0 and if 1( :Example - ) stategiven decision of(cost - )|(
classifier a is :)(and ,...,A :Example - ofset a is ,..., -
- :)( )|P( has and ,...,1 , :X Discrete -
)|p( has and :X Continuous - space featurein - ),...,(
labels) (class nature of states possible ofset a is ,..., - )P(on distributi with variablerandom a a -
1
1
1
i
i
1
1
dpRR
P
jiji
SA
ASC kDDS
C SSXX
CC
ijji
n
ji
ijij
jijiij
n
a
d
d
n
∫
∑
=•
=•
==≠==•
Ω→==
→•==
ℜ==•
=Ω•
=
α
αωωαλα
λλωαωαλλ
αωωαα
α
ωω
Gaussian (normal) density
−=2
21exp
21
σx-µ
σπp(x)
µσ
µ to from || ,])[(
)(][
22
xxr
σx-µσ
dxxxpxµ
distance sMahalanobi
deviation standardvariance
(mean) value expected
−−=
−−Ε=
−=Ε= ∫∞
∞−
variance!andmean given with p(x) all among ))(( has ) ,N( :) problem(homework property gInterestin
xpHσµ
entropy maximum
),( σµN
constaints asgiven is what besides structure' additional no' impose they sincereasonablemost are onsdistributientropy -max data, from learningWhen
∫−= nats)(in )(ln)())(( dxxpxpxpH
Bayes rule for binary classification
priorlikelihood
evidence
:formula Bayes
x, evidence
priors
)( , - )
y,probabilit - )())( where
)()()
)|
using ) updateGiven
otherwise , choose ),() only Given
1
2211
−
=
=
•
=>===•
∑=
ii
n
jii
iii
i
ii
P|p(
P|p(xp
pP|p(P(
P(
)P(ω)if P(ωPCP(
ωω
ωω
ωωω
ω
ωωωω
(x)g(x)g **
otherwise ),|()|P( if choose
2
211
ωωωω
=>=
(x)g(x)g
:rule decision Bayes
*
* xPx
Example: one-dimensional case )i|p(x ω
)| xp( iω
)|()|P( )|()|P(
21*2
21*1
ωωωω
PP
≤=>=
)()|()()|(
2211
21
1
ωωωω PxpPxp
|x)P(ω|x)P(ω
if ω(x)g*
>
>
=
31),
32) 21 == ωω P(P(
:Priors
get we,)(
)())|
in on dependnot does )(Since
pP|p(
P(
p
iii
i
ωωω
ω
=
Optimality of Bayes rule: idea
)|)P(g()|)P(g()|)P(g()(error|Pg 1221 ,, ωωωωω =+==≠=
Ω→≤ S:g(x)any for * )(error|P)(error|P gg
* if )(*
:rule Bayes
1
xxxg<
= ω
Proof – see next page…
gxxxg <= if )(:ruleOther
1ω
Optimality of Bayes rule: proof
)()()|()(
S:)g(any for and, ,min Thus,
(
1 ,0 and then , i )2
(
0 ,1 and then , i )1
(,|(|
,,
**
21*
1*
21221
2*
21121
112221
1221*
21
errorPdxxpxerrorPerrorP
)|),P(ω|P(ω)(error|P
)|P)(error|P
pp)g*(|x)P(ω|x)P(ωf
)|P)(error|P
pp)g*(|x)P(ω|x)P(ωf
)|P))P(g*()|P))P(g*(
)|)P(g*()|)P(g*()|)P(g*()(error|P
ggg
g
g
g
pp
g
≤=
Ω→=
=
⇒===≤
=
⇒===>
=+==
==+==≠=
∫∞+
∞−
x
ω
ω
ω
ω
ωωωωωω
ωωωωω
General Bayesian Decision Theory
dpRR
P
xpP|p(xxP(
P
dpRRAS
A
jji
n
ji
iiii
jji
n
ji
jiijji
a
ii
)()|)((min :) (calledrisk overall minimum yields
)|()|(min arg)|R( min arg)(
risk lconditiona minimize always
formula) (Bayes )(
)())| and given action of
theis )|()|()|R( where
)()|)(( :)( the minimizing :)( a :Find
)|( )given decision of(cost
,..., available ofset :Given
)(
*
1
1
1
∫
∑
∑
∫
=
•
==
•
=
=
=
→
=•=•
=
=
α
ωωαλα
ωωωα
ωωαλα
αα
ωαλλωααα
α
αα
risk Bayes
x
risk lconditiona
riskloss expected total rule decision
function loss ) (decisions actions
rule decision Bayes
a
:rule decision Bayes
Zero-one loss classification
)|()|(1)|()|()|()|(
Then
if 0 if 1)|(
n 1,...,i , i.e. tion),classifica(action )g()(Let
1
errorPPPPR
jijiωα
ωαα
iij
jjjj
n
ji
jiij
ii
=−===
=
=≠==
====
∑∑≠=
ωωωωαλα
λλ
error tionclassifica risk lconditiona
:loss one-Zero
x
ji allfor )|P()|P( if )( :
*
≠>=
ji
igωω
ωrule Bayes
dpRR )()|)((min )(
*∫= α
α rateerror minimum achieves rule Bayes
Different errors, different costs
)(cost )|*( :rejectionCorrect )(cost )|*( :alarm False )(cost )|*( :Miss )(cost )|*( :Hit
present signal noise), d(backgroun signal no -:theorydetection signal from Example
111
211
122
222
21
λωλω
λωλω
ωω
∈<•∈>•
∈<•∈>•
−
xxxPxxxP
xxxPxxxP
||'
bility Discrimina12
σµµ −
=d
Receiver operating characteristic (ROC) curve
)|*( 1ω∈> xxxP
Error costs determine optimal decision (threshold x*)
Cost-based classification
or , )()(
)|()|(
:decision)correct than morecost (errors and Assuming
1121
2212
2
1
22121121
λλλλ
ωω
λλλλ
−−
>
>>
PP
λθωω
λλλλ
ωω
ω
=−−
>
=
)P()P(
)()(
)|()|(
if )( :
1
2
1121
2212
2
1
1*
PP
gruleBayes
+=+=
===
)|( )|()|R()|( )|()|R(
:)|( and , 1,2i ,Let
2221212
2121111
ωλωλαωλωλαλλ
PPPP
ωαωα jiijii
)|()()|()(i.e. ,)|R()|R(
if )( :
2221211121
21
1*
ωλλωλλαα
ω
PP
g
−>−<
=rule Bayes
λθωω
λλλλ
ωω
=−−
>)P()P(
)()(
)|()|(
1
2
1121
2212
2
1
PP
Example: one-dimensional case
increases thresholde then th, i.e. otherwise,than expensive more becomes as yingmisclassif If
2112
12
λλωω>
(x) f 1 (x) f 2 (x) f n
Discriminant functions and classification
0)( if )(g )()()( :
allfor )()( if )(g )classifier Bayesfor )|()( e.g., (
,...,1 ),( :
1
21
*
>=−≡
•
≠>=−=
=•
fclassifierffffunction ntdiscrimina single
j iffclassifierRf
niffunctions ntdiscrimina
jii
ii
i
ω
ωα
:- -
:tionclassifica category-Two
:-
- :tionclassificacategory-Multi
)(g iω=
)(g)|(
=i
ji
αωαλ
Bayes Discriminant Functions
!increasinglly monotonica is )h( where))(h(by replaced becan )(Every 1)|P( )|()(
:
i*
⋅−=−=
ii
ii
ffRf ωα
tionclassifica loss one-zerofor functionsnt discrimina Bayes
)(log)log)(
)())(
)(
)())|)(
*
*
*
iii
iii
iiii
P|p(fP|p(f
pP|p(P(f
ωωωω
ωωω
+==
==
)( )(log
))log)(
)|()|()( )()()(
2
1
2
1*
21*
*1
*1
*
ωω
ωω
ωω
PP
|p(|p(f
PPffff
+=
−=
−≡
Multi-category:
Two-category:
Decision regions and surfaces:
Discriminant functions: examples
)|()|()|,( CxPCxPCxxP jiji =
Discriminant functionsFeatures
linear (hyperplanes)
hyperquadrics (hyperellipsoids, hyperparabaloids, hyperhyperboloids)
linear (hyperplanes)Binary, conditionally independent
Dis
cret
eCo
ntin
uous
−
=
)()(21exp
||)2(1
)|( Gaussian teMultivaria
2/12/
td
ip
π
ω
classes allfor
ω=• :matrix covariance Same
arbitrary : case General•
Conditionally independent binary features linear classifier (‘naïve Bayes’-see later)
=
+
−−
−+=
=
−−
===
−=−=
====
−∈−∈=
∑
∏
∏∏
=
−
=
=
−
=
−
)()(log
11
log)1(log
)()(
11
log)()|()()|(log
)|()|(log)(
)1()|( ,)1()|(
).|1( ),|1(
, ,10 ,),...,(
2
1
1
2
1
1
122
11
2
1
1
12
1
11
21
211
ωω
ωω
ωωωω
ωω
ωω
ωω
ωω
PP
qpx
qpx
PP
qp
qp
PxPPxP
xPxPxf
qqxPppxP
ThenxPqxPpLet
classcfeatures,xxxx
n
i i
ii
i
ii
x
i
ixn
i i
i
n
i
xi
xi
n
i
xi
xi
iiii
in
ii
iiii
)()(log
11
log ,)1()1(
log where,)(2
1
10
10 ω
ωPP
qpw
pqqpwwxwxf
n
i i
i
ii
iii
n
iii +
−−
=−−
=+= ∑∑==
:) choose 0,f(x) if :rule (decision f(x)function nt Discrimina 1ω>
Case1: independent Gaussian featureswith same variance for all classes:
2σ=
Note: linear separating surfaces!
= :classes allfor scovariance same havingfeaturesdependent tion togeneraliza :2 Case
Case 3: unequal covariance matricesOne dimension: multiply-connected decision regions
Case 3, many dimensions: hyperquadric surfaces
Bayesian decision theory: In theory: tells you how to make optimal (minimum-risk) decisions In practice: where do you get those probabilities from?
Expert knowledge + learning from data (see next; also, Chapter 3)
Note: some typos in Chapter 2 Page 25, second line after the equation 12: must be Page 27, third line before section 2.3.1:
Page 50 and 51, in Fig. 2.20 and Fig 2.21, replace x-axis label by
Same replacement in problem 9, page 75 Section 2.11 (Bayesian belief networks): contains several mistakes; ignore for now.
Bayesian networks will be covered later.
Summary
))(R(not ,)|)(R( ii αα
in inequality reverse and , and switch 12212 1 λλωω >
)|*( 1ω∈> xxxP
Parameter Estimation
variable?random aor constant physical'' aparameter is :difference calPhilosophi
(MAP)Bayesian and (ML) lstatistica classical :approachesmajor Two
, estimate ),,N(Gaussian is ),|p(x assume :Example
data from estimationparameter learningThen
etc.) l,multinomia Gaussian, , (e.g.approach on distributi parametric fixed afirst consider We
etc.) ce,independen featureor ),|P( of form parametric (e.g., sassumption gsimplifyin :Solution
spaces feature ldimensiona-highin especially hard, is )|P( estimating , Usually
problem estimationdensity e.g., - )|P( and )P(C (estimate) find wish to we
, ),( where,given general,In
iiiii
i
i
ii
jj
•
•
•
=•
•
•
•
=
==•
σµσµω
ω
ω
ωω
ω
x
x
x
xyy,...,yD data training jN1
Maximum likelihood (ML) and Maximum a posteriory (MAP) estimates
prior uniform with MAPML that Note
)P( )|( )P( ˆ
))P(prior with variable,randomunknown an is ( estimate posteriory a Maximum
)|( )l( ˆ
:constant)unknown an is ( estimate likelihood-Maximum
)|p( )|(Then
) waysamein separately estimated be(can
t independen are classes different for that assume also We
),|p(Gaussian for ),( :Example
vectorparameter a is where),,|p(on distributi parametric a Assume
),( where,
samples (i.i.d.) ddistributey identicall andt independen Assume
N
1j
iii
i
j
=•
==
•
==
•
=•
•
=
•
==
•
∏=
DP
DP
DP j
jjN1
x
x
x
xyy,...,yD
ωσµ
ω
ω
ML estimate: Gaussian distribution
2
1
222
1
2
11
1
)ˆ(1
1ˆ be wouldestimate Unbiased
1)ˆ(1 i.e. biased, is ˆ that Note
)ˆ(1ˆ ,1ˆ
:unknown are and both
1ˆ
: estimate ,known
µxN
NNµx
N
µxN
xN
µ
xN
µ
N
j
j
N
j
j
N
j
jN
j
j
N
j
j
∑
∑
∑∑
∑
=
=
==
=
−−
=•
≠−=
−Ε•
−==
•
=
•
σ
σσσ
σ
σµ
µσ
Bayesian (MAP) estimate with increasing sample size
Parameter estimation: discrete features
ML-estimate: )logmaxargˆ P(D|ΘΘΘ
=
MAP-estimate )()|(logmax ΘΘΘ
PDP
Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa ααθ
)|( iik CkxP ωθ ===
Multinomial P(x|C)
)ML( ,
,
∑ ==
===
kickx
ickxik N
Nθ
counts
) MAP(,,
,,,
∑∑ ++
=
xx
xx
xxx
XX
XX
X NN
papa
papapa α
αθ
Equivalent sample size(prior knowledge)
C
X
An example: naïve Bayes classifierSimplifying (“naïve”) assumption:feature independence given class
C)|P(x 1C)|P(x 2 C)|P(x n
1x feature nx feature2x feature
P(C)C Class
∏=
=
C)|x,...,P(x n1
1. Bayes(-optimal) classifier:given an (unlabeled) instance , choose most likely class:
)x|iP(Cmax arg BO(x)i
==)x,...,(xx n1=
2. Naïve Bayes classifier:
∏=
===n
j 1
i)C|i)P(xP(Cmax arg NB(x) ji
)xP(i)i)P(CC|xP()x|iP(C ====By Bayes rule , and by independence assumption
State-of-the-art
Optimality results Linear decision surface for binary features (Minsky 61, Duda&Hart 73)
(polynomial for general nominal features - Duda&Hart 1973, Peot 96) Optimality for OR and AND concepts (Domingos&Pazzani 97) No XOR-containing concepts on nominal features (Zhang&Ling 01)
Algorithmic improvements Boosted NB (Elkan 97) is equivalent to multilayer perceptron Augmented NB (TAN, Bayes Nets – e.g., Friedman et al 97) Other improvments (combining with Decision Trees (Kohavi), w/ error-correcting
output coding (ECOC) (Ghani, ICML 2000), etc.
Still, open problems remain: NB error estimate/bounds based on domain properties
Why Naïve Bayes often works well(despite independence assumption)?
Wrong P(C|x) estimates do not imply wrong classification!
Domingos&Pazzani, 97, J. Friedman 97, etc."Statistical diagnosis based on conditional independence does not require it", J. Hilden 84
optclass
True
NB estimate
P(cl
ass|
f)
f)|(classP
f)|P(class
=NBclassNaïve Bayes:f)|(classPmax arg ii
ˆ
=optclassBayes-optimal:f)|P(classmax arg ii
Major questions remain:Which P(c,x) are ‘good’ for NB?
What domain properties “predict” NB accuracy?
General question:characterizing distributions P(X,C) and their approximations Q (X,C) that can be ‘far’ from P(X,C), but yield low classification error
Class 1 Class 2
Optimal decision boundary
True PApproximate P
P(x|C)P(C)
Note: one measure of ‘distance’ between distributions can be relative entropy, or KL-divergence (see hw problem11, chap.3)
∫= dzzQzPzPQPD)()(log)()||(
Case study: using Naïve Bayes for Transaction Recognition Problem
Client Workstation
End-UserTransactions
(EUT)
Remote Procedure
Calls (RPCs)
Server (Web, DB, Lotus Notes)
Session (connection)
OpenDB Search SendMail
RPCs?
Two problems: segmentation and labeling
1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4
Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2
Unsegmented RPC's
Segmented RPC'sand Labeled Tx's Tx2
Representing transactions as feature vectors
ijij p)T|1P(R :Bernoulli ==
∏∏ ==
=M
1j
nijM
1j ijiiMi1
ijp!n
n!)T|n,...,P(n :lMultinomia
)p(1p)T|P(n :Geometric ijn
ijiijij −=
)p(1p)T|P(n :Geometric Shifted ijsn
ijiijijij −= −
i typeof nTransactio
5R 5R3R 2R 2R1R2R
RPC counts ...)0,2,0,1, 3,(1,f =
...) 0, 1, 0, 1, 1, (1,f = RPC occurrences
geometric shifted:)( data to fit Best 2χ
Empirical results
Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:
SVM – 85-87%, Decision Tree – 90-92%
Best-fit distribution (shift. geom) - not necessarily best classifier! (?)
Baseline classifier:Always selects most-frequent transaction
Acc
ura
cy
Training set size
NB + Bernoulli, mult. or geom.
NB + shifted geom.
2%87 ±
3%79 ±
1%10 ±≈
Next lecture on Bayesian topics
April 17, 2002 - lecture on recent ‘hot stuff’:Bayesian networks, HMMs, EM algorithm
Short Preview:From Naïve Bayes to Bayesian Networks
class)|P(f 1 class)|P(f 2class)|P(f n
1f feature nf feature2f feature
ClassNaïve Bayes model:independent features given class
Bayesian network (BN) model: Any joint probability distributions
lung Cancer
Smoking
X-ray
Bronchitis
DyspnoeaP(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)P(S, C, B, X, D)=
CPD:C B D=0 D=10 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1
Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?