Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule...

22
Outline Bayesian Decision Theory Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Hong Chang (ICT, CAS) Bayesian Decision Theory

Transcript of Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule...

Page 1: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

Outline

Bayesian Decision Theory

Hong Chang

Institute of Computing Technology,Chinese Academy of Sciences

Machine Learning Methods (Fall 2012)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 2: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

Outline

Outline I

1 Introduction

2 Losses and Risks

3 Discriminant Functions

4 Bayesian Networks

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 3: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Coin Tossing Example

Outcome of tossing a coin ∈ {head,tail}Random variable X :

X =

{1 if outcome is head0 if outcome is tail

X is Bernoulli-distributed:

P(X ) = pX0 (1− p0)

1−X

where the parameter p0 is the probability that the outcome is head,i.e., p0 = P(X = 1).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 4: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Estimation and Prediction

Estimation of parameter p0 from sample X = {x (i)}Ni=1:

p̂0 =]heads]tosses

=

∑Ni=1 x (i)

N

Prediction of outcome of next toss:

Predicted outcome =

{head if p0 > 1/2tail otherwise

by choosing the more probable outcome, which minimizes theprobability of error (=1-probability of our choice for the predictedoutcome).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 5: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Classification as Bayesian Decision

Credit scoring example:Inputs: income and savings, or x = (x1, x2)

T

Output: risk ∈ {low,high}, or C ∈ {0, 1}Prediction:

Choose =

{C = 1 if P(C = 1|x) > 0.5C = 0 otherwise

or equivalently

Choose =

{C = 1 if P(C = 1|x) > P(C = 0|x)C = 0 otherwise

Probability of error:

1−max(P(C = 1|x),P(C = 0|x))

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 6: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayes’ Rule

Bayes’ rule:

Posterior P(C|x) = likelihood× priorevidence

=p(x|C)P(C)

p(x)

Some useful properties:P(C = 1) + P(C = 0) = 1p(x) = p(x|C = 1)P(C = 1) + p(x|C = 0)P(C = 0)P(C = 0|x) + P(C = 1|x) = 1

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 7: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayes’ Rule for K > 2 Classes

Bayes’ rule for general case (K mutually exclusive and exhaustiveclasses):

P(Ci |x) =p(x|Ci)P(Ci)

p(x)

=p(x|Ci)P(Ci)∑K

k=1 p(x|Ck )P(Ck )

Optimal decision rule for Bayes’ classifier:

Choose Ci if P(Ci |x) = maxk

P(Ck |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 8: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Losses and Risks

Different decisions or actions may not be equally good or costly.Action αi : decision to assign the input x to class Ci

Loss λik : loss incurred for taking action αi when the actual state if Ck

Expected risk for taking action αi :

R(αi |x) =K∑

k=1

λik P(Ck |x)

Optimal decision rule with minimum expected risk:

Choose αi if R(αi |x) = mink

R(αk |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 9: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

0-1 Loss

All correct decisions have no loss and all errors have unit cost:

λik =

{C = 1 if i = kC = 0 if i 6= k

Expected risk:

R(αi |x) =K∑

k=1

λik P(Ck |x)

=∑k 6=i

P(Ck |x) = 1− P(Ci |x)

Optimal decision rule with minimum expected expected risk (or,equivalently, highest posterior probability):

Choose αi if P(Ci |x) = maxk

P(Ck |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 10: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Reject Option

If the certainty of a decision is low but misclassiifcation has very highcost, the action of reject (αK+1) is preferred.Loss function:

λik =

0 if i = kλ if i = k + 11 otherwise

where 0 < λ < 1 is the loss incurred for choosing the action of reject.Expected risk:

R(λi |x) ={ ∑K

k=1 λP(Ck |x) = λ if i = k + 1∑k 6=i P(Ck |x) = 1− P(Ci |x) if i ∈ {1, . . . ,K}

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 11: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Reject Option (2)

Optimal decision rule:{Choose Ci if R(αi |x) = min1≤k≤K R(αk |x) < R(αK+1|x)Reject otherwise

Equivalent form of optimal decision rule:{Choose Ci if P(Ci |x) = min1≤k≤K P(Ck |x) > 1− λReject otherwise

This approach is meaningful only if 0 < λ < 1:If λ = 0, we always reject (a reject is as good as a correct classification).If λ ≥ 1, we never reject (a reject is as costly as or more costly than amisclassification).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 12: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Discriminant Functions

One way of performing classification is through a set of discriminantfunctions.Classification rule:

Choose Ci if gi(x) = maxk

gk (x)

Different ways of defining the discriminant functions:gi(x) = −R(αi |x)gi(x) = P(Ci |x)gi(x) = p(x|Ci)P(Ci)

For the two-class case, we may define a single discriminant function:

g(x) = g1(x)− g2(x)

with the following classification rule:

Choose{

C1 if g(x) > 0C2 otherwise

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 13: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Decision Regions

The feature space is divided into K decision regions R1, . . . ,RK ,where

Ri = {x|gi(x) = maxk

gk (x)}

The decision regions are separated by decision boundaries where tiesoccur among the largest discriminant functions.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 14: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayesian Networks

A.k.a. belief networks, probabilistic networks, or more generallygraphical models.Node (or vertex): random variableEdge (or arc or link): direct influence between variablesStructure: directed acyclic graph (DAG) formed by nodes and edgesParameters: probabilities and conditional probabilities

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 15: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Causal Graph and Diagnostic Inference

Causal graph: rain is the cause of wet grass.Diagnostic inference: knowing that the grass is wet, what is theprobability that rain is the cause?Bayes’ rule:

P(R|W ) =P(W |R)P(R)

P(W )

=0.9× 0.4

0.9× 0.4 + 0.2× 0.6= 0.75 > P(R) = 0.4

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 16: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Causal Inference

Causal or predictive inference: if the sprinkler is on, what is theprobability that the grass is wet?

P(W |S) = P(W |R,S)P(R|S) + P(W | ∼ R,S)P(∼ R|S)

= P(W |R,S)P(R) + P(W | ∼ R,S)P(∼ R)

= 0.95× 0.4 + 0.9× 0.6= 0.92

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 17: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Diagnostic Inference

Diagnostic inference: if the grass is wet, what is the probability that thesprinkler is on?

P(S|W ) =P(W |S)P(S)

P(W )

where

P(W ) = P(W |R,S)P(R,S) + P(W | ∼ R,S)P(∼ R,S) +

P(W |R,∼ S)P(R,∼ S) + P(W | ∼ R,∼ S)P(∼ R,∼ S)

= P(W |R,S)P(R)P(S) + P(W | ∼ R,S)P(∼ R)P(S) +

P(W |R,∼ S)P(R)P(∼ S) + P(W | ∼ R,∼ S)P(∼ R)P(∼ S)

= 0.52

soP(S|W ) =

0.92× 0.20.52

= 0.35 > P(S) = 0.2

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 18: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Explaining Away

Bayes’ rule:

P(S|R,W ) =P(W |R,S)P(S|R)

P(W |R)=

P(W |R,S)P(S)

P(W |R)= 0.21

Explaining away:

0.21 = P(S|R,W ) < P(S|W ) = 0.35

Knowing that it has rained decreases that probability that the sprinkleris on.Knowing that the grass is wet, rain and sprinkler become dependent:

P(S|R,W ) 6= P(S|W )

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 19: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Dependent Causes

Causal inference:

P(W |C) = P(W |R,S,C)P(R,S|C) + P(W | ∼ R,S,C)P(∼ R,S|C) +

P(W |R,∼ S,C)P(R,∼ S|C) + P(W | ∼ R,∼ S,C)P(∼ R,∼ S|C)

= P(W |R,S)P(R|C)P(S|C) + P(W | ∼ R,S)P(∼ R|C)P(S|C) +

P(W |R,∼ S)P(R|C)P(∼ S|C) + P(W | ∼ R,∼ S)P(∼ R|C)P(∼ S|C)

Independence: W and C are independent given R and S; R and Sare independent given C.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 20: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Local Structures

The network represents conditional independence statements.The joint distribution can be broken down into local structures:

P(C,S,R,W ,F ) = P(C)P(S|C)P(R|C)P(W |S,R)P(F |R)

In general,

P(X1, . . . ,Xd ) =d∏

i=1

P(Xi |parents(Xi))

where Xi is either continuous or discrete with ≥ 2 possible values.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 21: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayesian Network for Classification

Bayes’ rule inverts the edge:

P(C|x) = P(x|C)P(C)

P(x)

Classification as diagnostic inference.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 22: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Naive Bayes’ Classifier

Given C, the input variables xj are independent:

p(x|C) =d∏

j=1

p(xj |C)

The Naive Bayes’ classifier ignores possible dependencies among theinput variables and reduces a multivariate problem to a group ofunivariate problems.

Hong Chang (ICT, CAS) Bayesian Decision Theory