02 bayesian decision theory
-
Upload
institute-of-space-technology-ist -
Category
Engineering
-
view
109 -
download
3
Transcript of 02 bayesian decision theory
Bayesian Decision Theory
Dr Khurram Khurshid
Pattern Recognition
Probability Theory
• A random experiment is one whose outcome is not predictable with certainty in advance
• Sample Space (S): The set of all possible outcomes• A sample space is discrete if it consists of a finite (or countably
infinite) set of outcome; otherwise it is continuous
Probability Theory
• Any subset (say A) of S is an event.
• Events are sets so we can consider their complements, intersection, union and so forth
Probability Theory
• Example: Consider the experiment where two coins are simultaneously tossed. The various elementary events are
• The subset is the same as “Head occurred at least once” and qualifies as an event
• The subset is the same as “Both heads occur simultaneously” and qualifies as an event
, , ,S HH HT TH TT
, ,A HH HT TH
B HH
Probability Theory
• Union: “Does an outcome belong to A or B”
• Intersection: “Does an outcome belong to A and B”
• Complement: “Does an outcome falls outside A”
, ,A B A B HH HT TH
A B AB HH
A TT
Probability: Definition
• Classical definition: The Probability of an event E is defined a-priori without actual experimentation as
• Provided all these outcomes are equally likely.
• Relative frequency definition: The probability of an event E is defined as
• NE is the number of occurrences of E and N is the total number of trials
Number of outcomes favorable to E( )
Total number of possible outcomesP E
( ) lim E
N
NP E
N
Probability: Example
• Two coin example:Probability that at least one heads occurs = P(A) = 3/4
Probability that two heads occurs simultaneously = P(B) =1/4Probability that two tails occur simultaneously = 1/4Note that P(S) = 1
• Consider a box with n white and m red balls. In this case, there are two elementary outcomes: white ball or red ball.
• Probability of “selecting a white ball” = n
n m
Probability
• If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is sure to occur, P(A2) = 1. Also probability is a non-negative number
• If A and B are mutually exclusive events then probability of their union is same as the sum of their probabilities
0 ( ) 1P A
, then ( ) ( ) ( )If A B P A B P A P B
Probability
• If A and B are not mutually exclusive then
• Conditional Probability
( ) ( ) ( ) ( )P A B P A P B P A B
( )( | )
( )
P ABP A B
P B
P(A | B) = Probability of “the event A given that B has occurred”
Example
• A box contains 6 white and 4 black balls. Remove two balls at random without replacement. What is the probability that the first one is white and the second one is black?
Let W1 = “first ball removed is white” B2 = “second ball removed is black”
1 2( ) ?P W B
Example
• Using the conditional probability rule
1 2 2 1 2 1 1( ) ( ) ( | ) ( )P W B P B W P B W P W
1 2 1
2 1
6 6 4 4( ) , ( | )
6 4 10 4 5 96 4 4
Hence, ( ) . 0.26610 9 15
P W P B W
P B W
Independence of events
• A and B are said to be independent events, if
• Suppose A and B are independent, then
( ) ( ). ( )P AB P A P B
( ) ( ) ( )( | ) ( )
( ) ( )
P AB P A P BP A B P A
P B P B
Thus if A and B are independent, the event that B has occurred does not shed any more light into the event A. It makes no difference to A whether B has occurred or not.
Mutually Exclusive and Independent events?
• If two events A and B are independent:– P(A and B) = P(A)P(B)
• If A and B are mutually exclusive:– P(A and B) = 0
Clearly, if A and B are nontrivial events (P(A) and P(B) are nonzero), then they cannot be both independent and mutually exclusive.
• Consider a fair coin and a fair six-sided die. – Let event A be obtaining heads – Let event B be rolling a 6
• We can reasonably assume that events A and B are independent, because the outcome of one does not affect the outcome of the other– P(A and B) = (1/2)(1/6) = 1/12
• Since this value is not zero, then events A and B cannot be mutually exclusive.
Mutually Exclusive and Independent events?
• Consider a fair six-sided die as before, only in addition to the numbers 1
through 6 on each face, we have the property that the even-numbered
faces are colored red, and the odd-numbered faces are colored green.
– Let event A be rolling a green face
– Let event B be rolling a 6
– P(A) = 1/2 and P(B) = 1/6
• Events A and B cannot simultaneously occur, since rolling a 6 means
the face is red, and rolling a green face means the number showing is
odd
• Mutually exclusive pair of nontrivial events are also necessarily
dependent events
Mutually Exclusive and Independent events?
• If A and B are mutually exclusive, then if A
occurs, then B cannot also occur; and vice
versa.
• This stands in contrast to saying the outcome of
A does not affect the outcome of B, which is
independence of events.
Mutually Exclusive and Independent events?
Baye’s theorem
( )( | ) ( ) ( | ) ( )
( )
( )( | ) ( ) ( | ) ( )
( )
Thus
( | ) ( ) ( | ) ( )
( | ) ( )( | )
( )
P ABP A B P AB P A B P B
P B
P ABP B A P AB P B A P A
P A
P A B P B P B A P A
P B A P AP A B
P B
Bayes Theorem
Baye’s theorem
1
( | ) ( ) ( | ) ( )( | )
( ) ( | ) ( )
i i i ii n
i ii
P B A P A P B A P AP A B
P B P B A P A
• The general form of Bayes' theorem is
Bayes’ Theorem: Example
• Two boxes B1 and B2 contain 100 and 200 light bulbs
respectively. The first box (B1) has 15 defective bulbs and the
second has 5 defective bulbs. Suppose a box is selected at
random and one bulb is picked out.
a) What is the probability it is defective?
b) Suppose the bulb we tested was defective. What is the
probability it came from box 1.
Bayes’ Theorem: Example
• Solution: Part (a)
Note that box B1 has 85 good and 15 defective bulbs. Similarly box B2
has 195 good and 5 defective bulbs.
B1 and B2 form the portions of the box. Let D = “Defective bulb is picked out”.
1 1 2 2( ) ( | ) ( ) ( | ) ( )P D P D B P B P D B P B
Bayes’ Theorem: Example
Since the box is selected at random, they are equally likely
The probability of event D is:
Thus, there is about 9% probability that a bulb picked at random is defective
1 2
15 5( | ) 0.15, ( | ) 0.025.
100 200P D B P D B
1 2
1( ) ( )
2P B P B
1 1 2 2( ) ( | ) ( ) ( | ) ( )
1 1( ) 0.15 0.025 0.0875
2 2
P D P D B P B P D B P B
P D
Notice that initially P(B1) = P(B2) = 0.5, then we picked out a box at random and tested a bulb that turned out to be defective. Can this information shed some light about the fact that we might have picked up box 1?
Bayes’ Theorem: Example
• Part (b): Suppose the bulb we tested was defective. What is the probability it came from box 1?
• Solution:
1( | ) ?P B D
1 11
( | ) ( ) 0.15 1/ 2( | ) 0.8571
( ) 0.0875
P D B P BP B D
P D
Bayes’ Theorem: Example
Indeed it is more likely at this point that we must have
chosen box 1 in favor of box 2. (Recall box1 has three
times more defective bulbs compared to box2).
1( | ) 0.857 0.5P B D
Random Variable
When the value of a variable is the outcome of a statistical experiment, that variable is a random variable
When the value of a variable is the outcome of a statistical experiment, that variable is a random variable
Statistical Experiment: • The experiment can have more than one possible outcome • Each possible outcome can be specified in advance. • The outcome of the experiment depends on chance
Random variables
Discrete Random Variables
Discrete random variables take on integer values
Tossing a dice {1,2,3,4,5,6}
Continuous Random Variables
Continuous random variables can take on any value within a
range of values
We flip a coin many times and compute the average number
of heads
Probability distribution
A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence
A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence
Discrete Probability Distribution
Flip a coin two times – Four possible outcomes
HH, HT, TH, TT
Let variable X represent the number of heads that result
from the coin flips
X can take on the values 0, 1, or 2; and X is a discrete
random variable.
Discrete Probability Distribution
Number of heads x Probability P(x)
0 0.25
1 0.5
2 0.25
0.5
0 1 2
0.250.25
P(x)
Heads
The probability distribution for discrete random variable is called Probability Mass Function (PMF)
i ip x P X x i ip x P X x
Probability mass function
The probability distribution or probability mass function
(PMF) of a discrete random variable X is a function that gives
the probability p(xi) that the random variable equals xi, for each
value xi:
It satisfies the following conditions:
0 1 ip x 0 1 ip x
1 ii
p x 1 ii
p x
i ip x P X x i ip x P X x
Cumulative distribution function
1 ( 0) ( 1)
0.25 0.5
0.75
P X P X P X
( )i
ix xP X x p x
( )
iix x
P X x p x
What is the probability of getting 1 or fewer heads?
Number of heads x
Probability P(x)
0 0.25
1 0.5
2 0.25
Example
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
21 1 0.2
10p P X
42 2 0.4
10p P X
43 3 0.4
10p P X
Probability Mass Function PDF
Grade
Random Variable: Grades of the students
Example
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
Probability Mass Function Property 1 2 3 1i
i
p x p p p
Cumulative Distribution Function
( )i
ix xP X x p x
( )
iix x
P X x p x
22 ( ) 1 2 0.2 0.4 0.6
iix
P X p x p p
CDF
Grade
3
3 ( ) 1 2 3 1i
ixP X p x p p p
Continuous Probability Distributions
The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).
The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).
Probability Density Function
• The probability that a continuous random variable
will assume a particular value is always zero
• The probability that a continuous random variable
falls in the interval between a and b is equal to
the area under the pdf curve between a and b
b
aP a X b f x dx
Example
• Suppose that the random variable X is the diameter
of a randomly chosen cylinder manufactured by the
company.
• It can take any value between 49.5 and 50.5
• Its a continuous random variable
Example
• Suppose that the diameter of a metal cylinder has a p.d.f2( ) 1.5 6( 50.2) for 49.5 50.5
( ) 0, elsewhere
f x x x
f x
( )f x
x49.5 50.5
Example
The probability that a metal cylinder has a diameter between 49.8 and 50.1 mm can be calculated to be
50.1 2 3 50.149.849.8
3
3
(1.5 6( 50.0) ) [1.5 2( 50.0) ]
[1.5 50.1 2(50.1 50.0) ]
[1.5 49.8 2(49.8 50.0) ]
75.148 74.716 0.432
x dx x x
( )f x
x49.5 50.549.8 50.1
Cumulative Distribution Function
The Cumulative Distribution Function (CDF) is a function
giving the probability that the random variable X is less than or
equal to x
Formally
The cumulative distribution function F(x) is defined to be:
,
x
F x P X x ,
x
F x P X x
Cumulative Distribution Function
For a discrete random variable, the cumulative distribution function is found by summing up the probabilities as in the example below.
For a continuous random variable, the cumulative distribution function is the integral of its probability density function f(x).
,
( ) ( )
i i
i ix x x x
x
F x P X x P X x p x
x
P X x f t dt
Discrete vs. Continuous RVsDiscrete Random Variable Continuous Random Variable
Finite Sample Space e.g. {0, 1, 2, 3}
Infinite Sample Space e.g. [0,1], [2.1, 5.3]
1
1. ( ) 0, for all
2. ( ) 1
i
ii
p x i
p x
i ip x P X x
Cumulative Distribution Function (CDF)
f x
Probability Density Function (PDF)
X
R
X
Rxxf
dxxf
Rxxf
X
in not is if ,0)( 3.
1)( 2.
in allfor , 0)( 1.
P X x
( )i
ix xP X x p x
xP X x f t dt
b
aP a X b f x dx
Probability Mass Function (PMF)
What is bayesian decision theory
• Mathematical foundation for decision making.
• Using probabilistic approach to help making decision (classification) so as to minimize the risk (cost).
• The decision problem is viewed in term of probabilities and it is assumed that all of the relevant probabilities are known
• It makes a compromise between the decisions and their costs and finds an optimal solution
Fish sorting - revisited
How to design it?
Notations
:},,,{ 21 ci a state of nature
:)( iP prior probability
:x feature vector
:)|( ip x class-conditionaldensity
:)|( xiP posterior probability
Decision rule
• In our example, the fish on the conveyer belt may be salmon or sea bass and we are not certain about it. Thus it can be described in terms of probability as the outcome of the experiment is not certain
• So, we assign a random variable ω to describe the type of fish
ω = ω1 for sea bassω = ω2 for salmon
State of nature
Decision Rule based on Prior Information• Prior probability: It is based on our knowledge without doing
any experimentation
• For example: if fishermen catch as much sea bass as salmon in a season then the catch and their appearance on conveyer belt is equiprobable:
• In general, sea bass and salmon may appear with any non-zero probabilities
1 2
1 2
( ) ( ) 0.5 and
( ) ( ) 1
P P
P P
1 2
1 2
( ) 0and ( ) 0 but
( ) ( ) 1
P P
P P
Decision Rule based on Prior Information• An optimal decision rule based on only prior
probabilities (without seeing the fish) is
• This rule makes sense if we wish to judge one fish but it is odd if we are judging many fish
• If P(ω 1) >> P(ω 2) then our decision will be right most of the time
• The probability of error is the minimum of both probabilities
1 1 2 2Decide if ( ) ( );otherwise decide P P
1 2( ) min ( ), ( )P error P P
Decision Rule based on Data
• Data can be images or features extracted (e.g. lightness of fish) from images
• In our case, different fish yield different lightness readings, and we express this variability in probabilistic terms
• It is expressed as a conditional probability also called class-conditional probability function i.e.
1 1
2 2
( | ) The pdf of given that state of nature is
( | ) The pdf of given that state of nature is
p x x
p x x
Decision Rule based on Data
Decision Rule based on Data
• We can integrate the new knowledge i.e. class conditional probability function to prior knowledge, to compute posteriori probability or posterior using Baye’s rule
( | ) ( )( | )
( )i i
i
p x PP x
p x
2
1
where ( ) ( | ) ( )i ii
p x p x P
Decision Rule based on Data
• Bayes’ rule can also be expressed as
• Evidence can be considered as a scale factor
x likelihood priorposterior
evidence
( | ) ( )( | )
( )i i
i
p PP
p
x
xx
unimportant inmaking decision
unimportant inmaking decision
Decision Rule based on Data
• An optimal decision rule based on posterior probabilities is
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
1 1 2 21 2
1 1 2 2
1 1 2 2
( | ) ( ) ( | ) ( )( | ) , ( | )
( ) ( )
Decision rule becomes
( | ) ( ) ( | ) ( )
( ) ( )
( | ) ( ) ( | ) ( )
p x P p x PP x P x
p x p x
p x P p x P
p x p x
p x P p x P
Decision Rule based on Data
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
Decision Rule based on Data )(
)()|()|(
x
xx
p
PpP ii
i
Decide i if P(i|x) > P(j|x) j i
Decide i if p(x|i)P(i) > p(x|j)P(j) j i
Special cases:• P(1)=P(2)= =P(c)
• p(x|1)=p(x|2) = = p(x|c)
Example
P(1)=P(2)
R2 R1
Special cases:1. P(1)=P(2)
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
Special cases:1. P(1)=P(2)
Decide 1 if p(x|> p(x|2); otherwise decide 1
2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2
Example
P(1)=2/3P(2)=1/3
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Decision Rule based on Data
Example
R1R1
R2
R2
P(1)=2/3P(2)=1/3
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Classification error
The probability of error is the minimum of both probabilities
1 1 2 2Decide if ( | ) ( | ) ;otherwise decide P x P x
1 2
2 1
1 2
( | ) if we decide ( | )
( | ) if we decide
( | ) min ( | ), ( | )
P xp error x
P x
p error x P x P x
Confusion Matrix
• To evaluate the classifier, we make a table of following type
Predicted Class
Sea bass Salmon
State of nature or ground truth
Sea bass N1 N2
Salmon N3 N4
1 2
3 4
1 4
2 3
Correct classificationsClassification rate =
Total number of examples
Incorrect classificationsClassification error =
Total number of examples
seabass
salmon
seabass salmon
seab
N N N
N N N
N N
N N
N N
N
ass salmonN
Bayesian Decision theory
• The previous discussion can be extended to multidimensional features and multiclass problems
• Mathematical formulation
1 2 3
1 2 3
, d-dimensional feature space
, , , , represent states of nature
, , , , represnet associated actions
d
c
a
c
a
x
Bayesian Decision theory
• Bayes’ rule becomes
• The optimal decision rule becomes
1
( | ) ( )( | )
( )
where ( ) ( | ) ( )
j jj
c
j jj
p PP
p
p p P
xx
x
x x
Decide if ( | ) ( | )j j iP P j i x x
Conditional risk
• A loss function states exactly how costly each action is, and can be expressed as:
:)|( jiij The loss incurred for taking action i when the true state of nature is j.
We want to minimize the expected loss in making decision.
Risk
Conditional risk
c
jjjii PR
1
)|()|()|( xx
c
jjij P
1
)|( x
Given x, the expected loss
(risk) associated with taking action
i.
Given x, the expected loss
(risk) associated with taking action
i.
Loss Function
• How to define it?• Example: 0-1 loss is defined as
otherwise1
with assiciateddecision correct a is 0)|( ji
ji
0,( | ) , , 1, 2, ,
1,i j
i ji j c
i j
Loss Function
Conditional risk based on 0-1 loss can be determined as
1
( | ) ( | ) ( | )
( | )
1 ( | )
c
i i j jj
jj i
i
R P
P
P
x x
x
x
Bayesian Decision Rule:
)|(minarg)( xx iRi
)|(minarg)( xx iRi
Two category classification
• Mathematical formulation– 1 = action corresponding to deciding that true nature
of state is 1
– 2 = action corresponding to deciding that true nature of state is 2
– ij(i | j) = loss incurred for deciding i when true state of nature is j
Two category classification
},{ 21 },{ 21
Act
ion
State of Nature
1 2
1 11 12
2 21 22
Loss Function
)|()|()|( 2121111 xxx PPR
)|()|()|( 2221212 xxx PPR
Two category classification
• The optimal decision rule can be described as
• In terms of posteriori probabilities
1 1 2 2Decide if ( | ) ( | ) otherwise R R x x
)|()|()|()|( 212111222121 xxxx PPPP
1 21 11 1 12 22 2
1 21 11 1 1 12 22 2 2
Decide if ( | ) ( | )
or
Decide if ( | ) ( ) ( | ) ( )
P P
p P p P
x x
x x
Two category classification
• Alternatively
1
12 221 2
2 21 11 1
Decide if
( | ) ( )
( | ) ( )
p P
p P
x
x
Likelihood Ratio
Decision boundaries