02 bayesian decision theory

Bayesian Decision Theory

Dr Khurram Khurshid

Pattern Recognition

Probability Theory

• A random experiment is one whose outcome is not predictable with certainty in advance

• Sample Space (S): The set of all possible outcomes• A sample space is discrete if it consists of a finite (or countably

infinite) set of outcome; otherwise it is continuous

Probability Theory

• Any subset (say A) of S is an event.

• Events are sets so we can consider their complements, intersection, union and so forth

Probability Theory

• Example: Consider the experiment where two coins are simultaneously tossed. The various elementary events are

• The subset is the same as “Head occurred at least once” and qualifies as an event

• The subset is the same as “Both heads occur simultaneously” and qualifies as an event

, , ,S HH HT TH TT

, ,A HH HT TH

B HH

Probability Theory

• Union: “Does an outcome belong to A or B”

• Intersection: “Does an outcome belong to A and B”

• Complement: “Does an outcome falls outside A”

, ,A B A B HH HT TH

A B AB HH

A TT

Probability: Definition

• Classical definition: The Probability of an event E is defined a-priori without actual experimentation as

• Provided all these outcomes are equally likely.

• Relative frequency definition: The probability of an event E is defined as

• NE is the number of occurrences of E and N is the total number of trials

Number of outcomes favorable to E( )

Total number of possible outcomesP E

( ) lim E

N

NP E

N

Probability: Example

• Two coin example:Probability that at least one heads occurs = P(A) = 3/4

Probability that two heads occurs simultaneously = P(B) =1/4Probability that two tails occur simultaneously = 1/4Note that P(S) = 1

• Consider a box with n white and m red balls. In this case, there are two elementary outcomes: white ball or red ball.

• Probability of “selecting a white ball” = n

n m

Probability

• If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is sure to occur, P(A2) = 1. Also probability is a non-negative number

• If A and B are mutually exclusive events then probability of their union is same as the sum of their probabilities

0 ( ) 1P A

, then ( ) ( ) ( )If A B P A B P A P B

Probability

• If A and B are not mutually exclusive then

• Conditional Probability

( ) ( ) ( ) ( )P A B P A P B P A B

( )( | )

( )

P ABP A B

P B

P(A | B) = Probability of “the event A given that B has occurred”

Example

• A box contains 6 white and 4 black balls. Remove two balls at random without replacement. What is the probability that the first one is white and the second one is black?

Let W1 = “first ball removed is white” B2 = “second ball removed is black”

1 2( ) ?P W B

Example

• Using the conditional probability rule

1 2 2 1 2 1 1( ) ( ) ( | ) ( )P W B P B W P B W P W

1 2 1

2 1

6 6 4 4( ) , ( | )

6 4 10 4 5 96 4 4

Hence, ( ) . 0.26610 9 15

P W P B W

P B W

Independence of events

• A and B are said to be independent events, if

• Suppose A and B are independent, then

( ) ( ). ( )P AB P A P B

( ) ( ) ( )( | ) ( )

( ) ( )

P AB P A P BP A B P A

P B P B

Thus if A and B are independent, the event that B has occurred does not shed any more light into the event A. It makes no difference to A whether B has occurred or not.

Mutually Exclusive and Independent events?

• If two events A and B are independent:– P(A and B) = P(A)P(B)

• If A and B are mutually exclusive:– P(A and B) = 0

Clearly, if A and B are nontrivial events (P(A) and P(B) are nonzero), then they cannot be both independent and mutually exclusive.

• Consider a fair coin and a fair six-sided die. – Let event A be obtaining heads – Let event B be rolling a 6

• We can reasonably assume that events A and B are independent, because the outcome of one does not affect the outcome of the other– P(A and B) = (1/2)(1/6) = 1/12

• Since this value is not zero, then events A and B cannot be mutually exclusive.


• Consider a fair six-sided die as before, only in addition to the numbers 1

through 6 on each face, we have the property that the even-numbered

faces are colored red, and the odd-numbered faces are colored green.

– Let event A be rolling a green face

– Let event B be rolling a 6

– P(A) = 1/2 and P(B) = 1/6

• Events A and B cannot simultaneously occur, since rolling a 6 means

the face is red, and rolling a green face means the number showing is

odd

• Mutually exclusive pair of nontrivial events are also necessarily

dependent events


• If A and B are mutually exclusive, then if A

occurs, then B cannot also occur; and vice

versa.

• This stands in contrast to saying the outcome of

A does not affect the outcome of B, which is

independence of events.


Baye’s theorem

( )( | ) ( ) ( | ) ( )

( )

( )( | ) ( ) ( | ) ( )

( )

Thus

( | ) ( ) ( | ) ( )

( | ) ( )( | )

( )

P ABP A B P AB P A B P B

P B

P ABP B A P AB P B A P A

P A

P A B P B P B A P A

P B A P AP A B

P B

Bayes Theorem

Baye’s theorem

1

( | ) ( ) ( | ) ( )( | )

( ) ( | ) ( )

i i i ii n

i ii

P B A P A P B A P AP A B

P B P B A P A

• The general form of Bayes' theorem is

Bayes’ Theorem: Example

• Two boxes B1 and B2 contain 100 and 200 light bulbs

respectively. The first box (B1) has 15 defective bulbs and the

second has 5 defective bulbs. Suppose a box is selected at

random and one bulb is picked out.

a) What is the probability it is defective?

b) Suppose the bulb we tested was defective. What is the

probability it came from box 1.


• Solution: Part (a)

Note that box B1 has 85 good and 15 defective bulbs. Similarly box B2

has 195 good and 5 defective bulbs.

B1 and B2 form the portions of the box. Let D = “Defective bulb is picked out”.

1 1 2 2( ) ( | ) ( ) ( | ) ( )P D P D B P B P D B P B


Since the box is selected at random, they are equally likely

The probability of event D is:

Thus, there is about 9% probability that a bulb picked at random is defective

1 2

15 5( | ) 0.15, ( | ) 0.025.

100 200P D B P D B

1 2

1( ) ( )

2P B P B

1 1 2 2( ) ( | ) ( ) ( | ) ( )

1 1( ) 0.15 0.025 0.0875

2 2

P D P D B P B P D B P B

P D

Notice that initially P(B1) = P(B2) = 0.5, then we picked out a box at random and tested a bulb that turned out to be defective. Can this information shed some light about the fact that we might have picked up box 1?


• Part (b): Suppose the bulb we tested was defective. What is the probability it came from box 1?

• Solution:

1( | ) ?P B D

1 11

( | ) ( ) 0.15 1/ 2( | ) 0.8571

( ) 0.0875

P D B P BP B D

P D


Indeed it is more likely at this point that we must have

chosen box 1 in favor of box 2. (Recall box1 has three

times more defective bulbs compared to box2).

1( | ) 0.857 0.5P B D

Random Variable

When the value of a variable is the outcome of a statistical experiment, that variable is a random variable

When the value of a variable is the outcome of a statistical experiment, that variable is a random variable

Statistical Experiment: • The experiment can have more than one possible outcome • Each possible outcome can be specified in advance. • The outcome of the experiment depends on chance

Random variables

Discrete Random Variables

Discrete random variables take on integer values

Tossing a dice {1,2,3,4,5,6}

Continuous Random Variables

Continuous random variables can take on any value within a

range of values

We flip a coin many times and compute the average number

of heads

Probability distribution

A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence

A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence

Discrete Probability Distribution

Flip a coin two times – Four possible outcomes

HH, HT, TH, TT

Let variable X represent the number of heads that result

from the coin flips

X can take on the values 0, 1, or 2; and X is a discrete

random variable.

Discrete Probability Distribution

Number of heads x Probability P(x)

0 0.25

1 0.5

2 0.25

0.5

0 1 2

0.250.25

P(x)

Heads

The probability distribution for discrete random variable is called Probability Mass Function (PMF)

i ip x P X x i ip x P X x

Probability mass function

The probability distribution or probability mass function

(PMF) of a discrete random variable X is a function that gives

the probability p(xi) that the random variable equals xi, for each

value xi:

It satisfies the following conditions:

0 1 ip x 0 1 ip x

1 ii

p x 1 ii

p x

i ip x P X x i ip x P X x

Cumulative distribution function

1 ( 0) ( 1)

0.25 0.5

0.75

P X P X P X

( )i

ix xP X x p x

( )

iix x

P X x p x

What is the probability of getting 1 or fewer heads?

Number of heads x

Probability P(x)

0 0.25

1 0.5

2 0.25

Example

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

21 1 0.2

10p P X

42 2 0.4

10p P X

43 3 0.4

10p P X

Probability Mass Function PDF

Grade

Random Variable: Grades of the students

Example

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function Property 1 2 3 1i

i

p x p p p

Cumulative Distribution Function

( )i

ix xP X x p x

( )

iix x

P X x p x

22 ( ) 1 2 0.2 0.4 0.6

iix

P X p x p p

CDF

Grade

3

3 ( ) 1 2 3 1i

ixP X p x p p p

Continuous Probability Distributions

The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).

The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).

Probability Density Function

• The probability that a continuous random variable

will assume a particular value is always zero

• The probability that a continuous random variable

falls in the interval between a and b is equal to

the area under the pdf curve between a and b

b

aP a X b f x dx

Example

• Suppose that the random variable X is the diameter

of a randomly chosen cylinder manufactured by the

company.

• It can take any value between 49.5 and 50.5

• Its a continuous random variable

Example

• Suppose that the diameter of a metal cylinder has a p.d.f2( ) 1.5 6( 50.2) for 49.5 50.5

( ) 0, elsewhere

f x x x

f x

( )f x

x49.5 50.5

Example

The probability that a metal cylinder has a diameter between 49.8 and 50.1 mm can be calculated to be

50.1 2 3 50.149.849.8

3

3

(1.5 6( 50.0) ) [1.5 2( 50.0) ]

[1.5 50.1 2(50.1 50.0) ]

[1.5 49.8 2(49.8 50.0) ]

75.148 74.716 0.432

x dx x x

( )f x

x49.5 50.549.8 50.1


The Cumulative Distribution Function (CDF) is a function

giving the probability that the random variable X is less than or

equal to x

Formally

The cumulative distribution function F(x) is defined to be:

,

x

F x P X x ,

x

F x P X x


For a discrete random variable, the cumulative distribution function is found by summing up the probabilities as in the example below.

For a continuous random variable, the cumulative distribution function is the integral of its probability density function f(x).

,

( ) ( )

i i

i ix x x x

x

F x P X x P X x p x

x

P X x f t dt

Discrete vs. Continuous RVsDiscrete Random Variable Continuous Random Variable

Finite Sample Space e.g. {0, 1, 2, 3}

Infinite Sample Space e.g. [0,1], [2.1, 5.3]

1

1. ( ) 0, for all

2. ( ) 1

i

ii

p x i

p x

i ip x P X x

Cumulative Distribution Function (CDF)

f x

Probability Density Function (PDF)

X

R

X

Rxxf

dxxf

Rxxf

X

in not is if ,0)( 3.

1)( 2.

in allfor , 0)( 1.

P X x

( )i

ix xP X x p x

xP X x f t dt

b

aP a X b f x dx

Probability Mass Function (PMF)

What is bayesian decision theory

• Mathematical foundation for decision making.

• Using probabilistic approach to help making decision (classification) so as to minimize the risk (cost).

• The decision problem is viewed in term of probabilities and it is assumed that all of the relevant probabilities are known

• It makes a compromise between the decisions and their costs and finds an optimal solution

Fish sorting - revisited

How to design it?

Notations

:},,,{ 21 ci a state of nature

:)( iP prior probability

:x feature vector

:)|( ip x class-conditionaldensity

:)|( xiP posterior probability

Decision rule

• In our example, the fish on the conveyer belt may be salmon or sea bass and we are not certain about it. Thus it can be described in terms of probability as the outcome of the experiment is not certain

• So, we assign a random variable ω to describe the type of fish

ω = ω1 for sea bassω = ω2 for salmon

State of nature

Decision Rule based on Prior Information• Prior probability: It is based on our knowledge without doing

any experimentation

• For example: if fishermen catch as much sea bass as salmon in a season then the catch and their appearance on conveyer belt is equiprobable:

• In general, sea bass and salmon may appear with any non-zero probabilities

1 2

1 2

( ) ( ) 0.5 and

( ) ( ) 1

P P

P P

1 2

1 2

( ) 0and ( ) 0 but

( ) ( ) 1

P P

P P

Decision Rule based on Prior Information• An optimal decision rule based on only prior

probabilities (without seeing the fish) is

• This rule makes sense if we wish to judge one fish but it is odd if we are judging many fish

• If P(ω 1) >> P(ω 2) then our decision will be right most of the time

• The probability of error is the minimum of both probabilities

1 1 2 2Decide if ( ) ( );otherwise decide P P

1 2( ) min ( ), ( )P error P P

Decision Rule based on Data

• Data can be images or features extracted (e.g. lightness of fish) from images

• In our case, different fish yield different lightness readings, and we express this variability in probabilistic terms

• It is expressed as a conditional probability also called class-conditional probability function i.e.

1 1

2 2

( | ) The pdf of given that state of nature is

( | ) The pdf of given that state of nature is

p x x

p x x


• We can integrate the new knowledge i.e. class conditional probability function to prior knowledge, to compute posteriori probability or posterior using Baye’s rule

( | ) ( )( | )

( )i i

i

p x PP x

p x

2

1

where ( ) ( | ) ( )i ii

p x p x P


• Bayes’ rule can also be expressed as

• Evidence can be considered as a scale factor

x likelihood priorposterior

evidence

( | ) ( )( | )

( )i i

i

p PP

p

x

xx

unimportant inmaking decision

unimportant inmaking decision


• An optimal decision rule based on posterior probabilities is

Decide 1 if P(1|x) > P(2|x); otherwise decide 2

1 1 2 21 2

1 1 2 2

1 1 2 2

( | ) ( ) ( | ) ( )( | ) , ( | )

( ) ( )

Decision rule becomes

( | ) ( ) ( | ) ( )

( ) ( )

( | ) ( ) ( | ) ( )

p x P p x PP x P x

p x p x

p x P p x P

p x p x

p x P p x P


Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Special cases:1. P(1)=P(2)

Decide 1 if p(x|1) > p(x|2); otherwise decide 2

2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2

Decision Rule based on Data )(

)()|()|(

x

xx

p

PpP ii

i

Decide i if P(i|x) > P(j|x) j i

Decide i if p(x|i)P(i) > p(x|j)P(j) j i

Special cases:• P(1)=P(2)= =P(c)

• p(x|1)=p(x|2) = = p(x|c)

Example

P(1)=P(2)

R2 R1


Decide 1 if p(x|> p(x|2); otherwise decide 1



Decide 1 if p(x|> p(x|2); otherwise decide 1


Example

P(1)=2/3P(2)=1/3


Example

R1R1

R2

R2

P(1)=2/3P(2)=1/3


Classification error

The probability of error is the minimum of both probabilities

1 1 2 2Decide if ( | ) ( | ) ;otherwise decide P x P x

1 2

2 1

1 2

( | ) if we decide ( | )

( | ) if we decide

( | ) min ( | ), ( | )

P xp error x

P x

p error x P x P x

Confusion Matrix

• To evaluate the classifier, we make a table of following type

Predicted Class

Sea bass Salmon

State of nature or ground truth

Sea bass N1 N2

Salmon N3 N4

1 2

3 4

1 4

2 3

Correct classificationsClassification rate =

Total number of examples

Incorrect classificationsClassification error =

Total number of examples

seabass

salmon

seabass salmon

seab

N N N

N N N

N N

N N

N N

N

ass salmonN

Bayesian Decision theory

• The previous discussion can be extended to multidimensional features and multiclass problems

• Mathematical formulation

1 2 3

1 2 3

, d-dimensional feature space

, , , , represent states of nature

, , , , represnet associated actions

d

c

a

c

a

x

Conditional risk

• A loss function states exactly how costly each action is, and can be expressed as:

:)|( jiij The loss incurred for taking action i when the true state of nature is j.

We want to minimize the expected loss in making decision.

Risk

Conditional risk

c

jjjii PR

1

)|()|()|( xx

c

jjij P

1

)|( x

Given x, the expected loss

(risk) associated with taking action

i.

Given x, the expected loss

(risk) associated with taking action

i.

Loss Function

• How to define it?• Example: 0-1 loss is defined as

otherwise1

with assiciateddecision correct a is 0)|( ji

ji

0,( | ) , , 1, 2, ,

1,i j

i ji j c

i j

Loss Function

Conditional risk based on 0-1 loss can be determined as

1

( | ) ( | ) ( | )

( | )

1 ( | )

c

i i j jj

jj i

i

R P

P

P

x x

x

x

Bayesian Decision Rule:

)|(minarg)( xx iRi

)|(minarg)( xx iRi

Two category classification

• Mathematical formulation– 1 = action corresponding to deciding that true nature

of state is 1

– 2 = action corresponding to deciding that true nature of state is 2

– ij(i | j) = loss incurred for deciding i when true state of nature is j


},{ 21 },{ 21

Act

ion

State of Nature

1 2

1 11 12

2 21 22

Loss Function

)|()|()|( 2121111 xxx PPR

)|()|()|( 2221212 xxx PPR


• The optimal decision rule can be described as

• In terms of posteriori probabilities

1 1 2 2Decide if ( | ) ( | ) otherwise R R x x

)|()|()|()|( 212111222121 xxxx PPPP

1 21 11 1 12 22 2

1 21 11 1 1 12 22 2 2

Decide if ( | ) ( | )

or

Decide if ( | ) ( ) ( | ) ( )

P P

p P p P

x x

x x


• Alternatively

1

12 221 2

2 21 11 1

Decide if

( | ) ( )

( | ) ( )

p P

p P

x

x

Likelihood Ratio

Decision boundaries

02 bayesian decision theory

Engineering

Transcript of 02 bayesian decision theory