Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule...
Transcript of Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule...
Outline
Bayesian Decision Theory
Hong Chang
Institute of Computing Technology,Chinese Academy of Sciences
Machine Learning Methods (Fall 2012)
Hong Chang (ICT, CAS) Bayesian Decision Theory
Outline
Outline I
1 Introduction
2 Losses and Risks
3 Discriminant Functions
4 Bayesian Networks
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Coin Tossing Example
Outcome of tossing a coin ∈ {head,tail}Random variable X :
X =
{1 if outcome is head0 if outcome is tail
X is Bernoulli-distributed:
P(X ) = pX0 (1− p0)
1−X
where the parameter p0 is the probability that the outcome is head,i.e., p0 = P(X = 1).
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Estimation and Prediction
Estimation of parameter p0 from sample X = {x (i)}Ni=1:
p̂0 =]heads]tosses
=
∑Ni=1 x (i)
N
Prediction of outcome of next toss:
Predicted outcome =
{head if p0 > 1/2tail otherwise
by choosing the more probable outcome, which minimizes theprobability of error (=1-probability of our choice for the predictedoutcome).
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Classification as Bayesian Decision
Credit scoring example:Inputs: income and savings, or x = (x1, x2)
T
Output: risk ∈ {low,high}, or C ∈ {0, 1}Prediction:
Choose =
{C = 1 if P(C = 1|x) > 0.5C = 0 otherwise
or equivalently
Choose =
{C = 1 if P(C = 1|x) > P(C = 0|x)C = 0 otherwise
Probability of error:
1−max(P(C = 1|x),P(C = 0|x))
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayes’ Rule
Bayes’ rule:
Posterior P(C|x) = likelihood× priorevidence
=p(x|C)P(C)
p(x)
Some useful properties:P(C = 1) + P(C = 0) = 1p(x) = p(x|C = 1)P(C = 1) + p(x|C = 0)P(C = 0)P(C = 0|x) + P(C = 1|x) = 1
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayes’ Rule for K > 2 Classes
Bayes’ rule for general case (K mutually exclusive and exhaustiveclasses):
P(Ci |x) =p(x|Ci)P(Ci)
p(x)
=p(x|Ci)P(Ci)∑K
k=1 p(x|Ck )P(Ck )
Optimal decision rule for Bayes’ classifier:
Choose Ci if P(Ci |x) = maxk
P(Ck |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Losses and Risks
Different decisions or actions may not be equally good or costly.Action αi : decision to assign the input x to class Ci
Loss λik : loss incurred for taking action αi when the actual state if Ck
Expected risk for taking action αi :
R(αi |x) =K∑
k=1
λik P(Ck |x)
Optimal decision rule with minimum expected risk:
Choose αi if R(αi |x) = mink
R(αk |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
0-1 Loss
All correct decisions have no loss and all errors have unit cost:
λik =
{C = 1 if i = kC = 0 if i 6= k
Expected risk:
R(αi |x) =K∑
k=1
λik P(Ck |x)
=∑k 6=i
P(Ck |x) = 1− P(Ci |x)
Optimal decision rule with minimum expected expected risk (or,equivalently, highest posterior probability):
Choose αi if P(Ci |x) = maxk
P(Ck |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Reject Option
If the certainty of a decision is low but misclassiifcation has very highcost, the action of reject (αK+1) is preferred.Loss function:
λik =
0 if i = kλ if i = k + 11 otherwise
where 0 < λ < 1 is the loss incurred for choosing the action of reject.Expected risk:
R(λi |x) ={ ∑K
k=1 λP(Ck |x) = λ if i = k + 1∑k 6=i P(Ck |x) = 1− P(Ci |x) if i ∈ {1, . . . ,K}
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Reject Option (2)
Optimal decision rule:{Choose Ci if R(αi |x) = min1≤k≤K R(αk |x) < R(αK+1|x)Reject otherwise
Equivalent form of optimal decision rule:{Choose Ci if P(Ci |x) = min1≤k≤K P(Ck |x) > 1− λReject otherwise
This approach is meaningful only if 0 < λ < 1:If λ = 0, we always reject (a reject is as good as a correct classification).If λ ≥ 1, we never reject (a reject is as costly as or more costly than amisclassification).
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Discriminant Functions
One way of performing classification is through a set of discriminantfunctions.Classification rule:
Choose Ci if gi(x) = maxk
gk (x)
Different ways of defining the discriminant functions:gi(x) = −R(αi |x)gi(x) = P(Ci |x)gi(x) = p(x|Ci)P(Ci)
For the two-class case, we may define a single discriminant function:
g(x) = g1(x)− g2(x)
with the following classification rule:
Choose{
C1 if g(x) > 0C2 otherwise
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Decision Regions
The feature space is divided into K decision regions R1, . . . ,RK ,where
Ri = {x|gi(x) = maxk
gk (x)}
The decision regions are separated by decision boundaries where tiesoccur among the largest discriminant functions.
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayesian Networks
A.k.a. belief networks, probabilistic networks, or more generallygraphical models.Node (or vertex): random variableEdge (or arc or link): direct influence between variablesStructure: directed acyclic graph (DAG) formed by nodes and edgesParameters: probabilities and conditional probabilities
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Causal Graph and Diagnostic Inference
Causal graph: rain is the cause of wet grass.Diagnostic inference: knowing that the grass is wet, what is theprobability that rain is the cause?Bayes’ rule:
P(R|W ) =P(W |R)P(R)
P(W )
=0.9× 0.4
0.9× 0.4 + 0.2× 0.6= 0.75 > P(R) = 0.4
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Causal Inference
Causal or predictive inference: if the sprinkler is on, what is theprobability that the grass is wet?
P(W |S) = P(W |R,S)P(R|S) + P(W | ∼ R,S)P(∼ R|S)
= P(W |R,S)P(R) + P(W | ∼ R,S)P(∼ R)
= 0.95× 0.4 + 0.9× 0.6= 0.92
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Diagnostic Inference
Diagnostic inference: if the grass is wet, what is the probability that thesprinkler is on?
P(S|W ) =P(W |S)P(S)
P(W )
where
P(W ) = P(W |R,S)P(R,S) + P(W | ∼ R,S)P(∼ R,S) +
P(W |R,∼ S)P(R,∼ S) + P(W | ∼ R,∼ S)P(∼ R,∼ S)
= P(W |R,S)P(R)P(S) + P(W | ∼ R,S)P(∼ R)P(S) +
P(W |R,∼ S)P(R)P(∼ S) + P(W | ∼ R,∼ S)P(∼ R)P(∼ S)
= 0.52
soP(S|W ) =
0.92× 0.20.52
= 0.35 > P(S) = 0.2
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Explaining Away
Bayes’ rule:
P(S|R,W ) =P(W |R,S)P(S|R)
P(W |R)=
P(W |R,S)P(S)
P(W |R)= 0.21
Explaining away:
0.21 = P(S|R,W ) < P(S|W ) = 0.35
Knowing that it has rained decreases that probability that the sprinkleris on.Knowing that the grass is wet, rain and sprinkler become dependent:
P(S|R,W ) 6= P(S|W )
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Dependent Causes
Causal inference:
P(W |C) = P(W |R,S,C)P(R,S|C) + P(W | ∼ R,S,C)P(∼ R,S|C) +
P(W |R,∼ S,C)P(R,∼ S|C) + P(W | ∼ R,∼ S,C)P(∼ R,∼ S|C)
= P(W |R,S)P(R|C)P(S|C) + P(W | ∼ R,S)P(∼ R|C)P(S|C) +
P(W |R,∼ S)P(R|C)P(∼ S|C) + P(W | ∼ R,∼ S)P(∼ R|C)P(∼ S|C)
Independence: W and C are independent given R and S; R and Sare independent given C.
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Local Structures
The network represents conditional independence statements.The joint distribution can be broken down into local structures:
P(C,S,R,W ,F ) = P(C)P(S|C)P(R|C)P(W |S,R)P(F |R)
In general,
P(X1, . . . ,Xd ) =d∏
i=1
P(Xi |parents(Xi))
where Xi is either continuous or discrete with ≥ 2 possible values.
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayesian Network for Classification
Bayes’ rule inverts the edge:
P(C|x) = P(x|C)P(C)
P(x)
Classification as diagnostic inference.
Hong Chang (ICT, CAS) Bayesian Decision Theory
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Naive Bayes’ Classifier
Given C, the input variables xj are independent:
p(x|C) =d∏
j=1
p(xj |C)
The Naive Bayes’ classifier ignores possible dependencies among theinput variables and reduces a multivariate problem to a group ofunivariate problems.
Hong Chang (ICT, CAS) Bayesian Decision Theory