02 bayesian decision theory

72
Bayesian Decision Theory Dr Khurram Khurshid Pattern Recognition

Transcript of 02 bayesian decision theory

Page 1: 02 bayesian decision theory

Bayesian Decision Theory

Dr Khurram Khurshid

Pattern Recognition

Page 2: 02 bayesian decision theory
Page 3: 02 bayesian decision theory
Page 4: 02 bayesian decision theory

Probability Theory

• A random experiment is one whose outcome is not predictable with certainty in advance

• Sample Space (S): The set of all possible outcomes• A sample space is discrete if it consists of a finite (or countably

infinite) set of outcome; otherwise it is continuous

Page 5: 02 bayesian decision theory

Probability Theory

• Any subset (say A) of S is an event.

• Events are sets so we can consider their complements, intersection, union and so forth

Page 6: 02 bayesian decision theory

Probability Theory

• Example: Consider the experiment where two coins are simultaneously tossed. The various elementary events are

• The subset is the same as “Head occurred at least once” and qualifies as an event

• The subset is the same as “Both heads occur simultaneously” and qualifies as an event

, , ,S HH HT TH TT

, ,A HH HT TH

B HH

Page 7: 02 bayesian decision theory

Probability Theory

• Union: “Does an outcome belong to A or B”

• Intersection: “Does an outcome belong to A and B”

• Complement: “Does an outcome falls outside A”

, ,A B A B HH HT TH

A B AB HH

A TT

Page 8: 02 bayesian decision theory

Probability: Definition

• Classical definition: The Probability of an event E is defined a-priori without actual experimentation as

• Provided all these outcomes are equally likely.

• Relative frequency definition: The probability of an event E is defined as

• NE is the number of occurrences of E and N is the total number of trials

Number of outcomes favorable to E( )

Total number of possible outcomesP E

( ) lim E

N

NP E

N

Page 9: 02 bayesian decision theory

Probability: Example

• Two coin example:Probability that at least one heads occurs = P(A) = 3/4

Probability that two heads occurs simultaneously = P(B) =1/4Probability that two tails occur simultaneously = 1/4Note that P(S) = 1

• Consider a box with n white and m red balls. In this case, there are two elementary outcomes: white ball or red ball.

• Probability of “selecting a white ball” = n

n m

Page 10: 02 bayesian decision theory

Probability

• If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is sure to occur, P(A2) = 1. Also probability is a non-negative number

• If A and B are mutually exclusive events then probability of their union is same as the sum of their probabilities

0 ( ) 1P A

, then ( ) ( ) ( )If A B P A B P A P B

Page 11: 02 bayesian decision theory

Probability

• If A and B are not mutually exclusive then

• Conditional Probability

( ) ( ) ( ) ( )P A B P A P B P A B

( )( | )

( )

P ABP A B

P B

P(A | B) = Probability of “the event A given that B has occurred”

Page 12: 02 bayesian decision theory

Example

• A box contains 6 white and 4 black balls. Remove two balls at random without replacement. What is the probability that the first one is white and the second one is black?

Let W1 = “first ball removed is white” B2 = “second ball removed is black”

1 2( ) ?P W B

Page 13: 02 bayesian decision theory

Example

• Using the conditional probability rule

1 2 2 1 2 1 1( ) ( ) ( | ) ( )P W B P B W P B W P W

1 2 1

2 1

6 6 4 4( ) , ( | )

6 4 10 4 5 96 4 4

Hence, ( ) . 0.26610 9 15

P W P B W

P B W

Page 14: 02 bayesian decision theory

Independence of events

• A and B are said to be independent events, if

• Suppose A and B are independent, then

( ) ( ). ( )P AB P A P B

( ) ( ) ( )( | ) ( )

( ) ( )

P AB P A P BP A B P A

P B P B

Thus if A and B are independent, the event that B has occurred does not shed any more light into the event A. It makes no difference to A whether B has occurred or not.

Page 15: 02 bayesian decision theory

Mutually Exclusive and Independent events?

• If two events A and B are independent:– P(A and B) = P(A)P(B)

• If A and B are mutually exclusive:– P(A and B) = 0

Clearly, if A and B are nontrivial events (P(A) and P(B) are nonzero), then they cannot be both independent and mutually exclusive.

Page 16: 02 bayesian decision theory

• Consider a fair coin and a fair six-sided die. – Let event A be obtaining heads – Let event B be rolling a 6

• We can reasonably assume that events A and B are independent, because the outcome of one does not affect the outcome of the other– P(A and B) = (1/2)(1/6) = 1/12

• Since this value is not zero, then events A and B cannot be mutually exclusive.

Mutually Exclusive and Independent events?

Page 17: 02 bayesian decision theory

• Consider a fair six-sided die as before, only in addition to the numbers 1

through 6 on each face, we have the property that the even-numbered

faces are colored red, and the odd-numbered faces are colored green.

– Let event A be rolling a green face

– Let event B be rolling a 6

– P(A) = 1/2 and P(B) = 1/6

• Events A and B cannot simultaneously occur, since rolling a 6 means

the face is red, and rolling a green face means the number showing is

odd

• Mutually exclusive pair of nontrivial events are also necessarily

dependent events

Mutually Exclusive and Independent events?

Page 18: 02 bayesian decision theory

• If A and B are mutually exclusive, then if A

occurs, then B cannot also occur; and vice

versa.

• This stands in contrast to saying the outcome of

A does not affect the outcome of B, which is

independence of events.

Mutually Exclusive and Independent events?

Page 19: 02 bayesian decision theory

Baye’s theorem

( )( | ) ( ) ( | ) ( )

( )

( )( | ) ( ) ( | ) ( )

( )

Thus

( | ) ( ) ( | ) ( )

( | ) ( )( | )

( )

P ABP A B P AB P A B P B

P B

P ABP B A P AB P B A P A

P A

P A B P B P B A P A

P B A P AP A B

P B

Bayes Theorem

Page 20: 02 bayesian decision theory

Baye’s theorem

1

( | ) ( ) ( | ) ( )( | )

( ) ( | ) ( )

i i i ii n

i ii

P B A P A P B A P AP A B

P B P B A P A

• The general form of Bayes' theorem is

Page 21: 02 bayesian decision theory

Bayes’ Theorem: Example

• Two boxes B1 and B2 contain 100 and 200 light bulbs

respectively. The first box (B1) has 15 defective bulbs and the

second has 5 defective bulbs. Suppose a box is selected at

random and one bulb is picked out.

a) What is the probability it is defective?

b) Suppose the bulb we tested was defective. What is the

probability it came from box 1.

Page 22: 02 bayesian decision theory

Bayes’ Theorem: Example

• Solution: Part (a)

Note that box B1 has 85 good and 15 defective bulbs. Similarly box B2

has 195 good and 5 defective bulbs.

B1 and B2 form the portions of the box. Let D = “Defective bulb is picked out”.

1 1 2 2( ) ( | ) ( ) ( | ) ( )P D P D B P B P D B P B

Page 23: 02 bayesian decision theory

Bayes’ Theorem: Example

Since the box is selected at random, they are equally likely

The probability of event D is:

Thus, there is about 9% probability that a bulb picked at random is defective

1 2

15 5( | ) 0.15, ( | ) 0.025.

100 200P D B P D B

1 2

1( ) ( )

2P B P B

1 1 2 2( ) ( | ) ( ) ( | ) ( )

1 1( ) 0.15 0.025 0.0875

2 2

P D P D B P B P D B P B

P D

Page 24: 02 bayesian decision theory

Notice that initially P(B1) = P(B2) = 0.5, then we picked out a box at random and tested a bulb that turned out to be defective. Can this information shed some light about the fact that we might have picked up box 1?

Bayes’ Theorem: Example

• Part (b): Suppose the bulb we tested was defective. What is the probability it came from box 1?

• Solution:

1( | ) ?P B D

1 11

( | ) ( ) 0.15 1/ 2( | ) 0.8571

( ) 0.0875

P D B P BP B D

P D

Page 25: 02 bayesian decision theory

Bayes’ Theorem: Example

Indeed it is more likely at this point that we must have

chosen box 1 in favor of box 2. (Recall box1 has three

times more defective bulbs compared to box2).

1( | ) 0.857 0.5P B D

Page 26: 02 bayesian decision theory

Random Variable

When the value of a variable is the outcome of a statistical experiment, that variable is a random variable

When the value of a variable is the outcome of a statistical experiment, that variable is a random variable

Statistical Experiment: • The experiment can have more than one possible outcome • Each possible outcome can be specified in advance. • The outcome of the experiment depends on chance

Page 27: 02 bayesian decision theory

Random variables

Discrete Random Variables

Discrete random variables take on integer values

Tossing a dice {1,2,3,4,5,6}

Continuous Random Variables

Continuous random variables can take on any value within a

range of values

We flip a coin many times and compute the average number

of heads

Page 28: 02 bayesian decision theory

Probability distribution

A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence

A probability distribution is a table or an equation that links each possible value that a random variable can assume with its probability of occurrence

Page 29: 02 bayesian decision theory

Discrete Probability Distribution

Flip a coin two times – Four possible outcomes

HH, HT, TH, TT

Let variable X represent the number of heads that result

from the coin flips

X can take on the values 0, 1, or 2; and X is a discrete

random variable.

Page 30: 02 bayesian decision theory

Discrete Probability Distribution

Number of heads x Probability P(x)

0 0.25

1 0.5

2 0.25

0.5

0 1 2

0.250.25

P(x)

Heads

The probability distribution for discrete random variable is called Probability Mass Function (PMF)

i ip x P X x i ip x P X x

Page 31: 02 bayesian decision theory

Probability mass function

The probability distribution or probability mass function

(PMF) of a discrete random variable X is a function that gives

the probability p(xi) that the random variable equals xi, for each

value xi:

It satisfies the following conditions:

0 1 ip x 0 1 ip x

1 ii

p x 1 ii

p x

i ip x P X x i ip x P X x

Page 32: 02 bayesian decision theory

Cumulative distribution function

1 ( 0) ( 1)

0.25 0.5

0.75

P X P X P X

( )i

ix xP X x p x

( )

iix x

P X x p x

What is the probability of getting 1 or fewer heads?

Number of heads x

Probability P(x)

0 0.25

1 0.5

2 0.25

Page 33: 02 bayesian decision theory

Example

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

21 1 0.2

10p P X

42 2 0.4

10p P X

43 3 0.4

10p P X

Probability Mass Function PDF

Grade

Random Variable: Grades of the students

Page 34: 02 bayesian decision theory

Example

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function Property 1 2 3 1i

i

p x p p p

Cumulative Distribution Function

( )i

ix xP X x p x

( )

iix x

P X x p x

22 ( ) 1 2 0.2 0.4 0.6

iix

P X p x p p

CDF

Grade

3

3 ( ) 1 2 3 1i

ixP X p x p p p

Page 35: 02 bayesian decision theory

Continuous Probability Distributions

The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).

The probability distribution of a continuous random variable is represented by an equation, called the probability density function (pdf).

Page 36: 02 bayesian decision theory

Probability Density Function

• The probability that a continuous random variable

will assume a particular value is always zero

• The probability that a continuous random variable

falls in the interval between a and b is equal to

the area under the pdf curve between a and b

b

aP a X b f x dx

Page 37: 02 bayesian decision theory

Example

• Suppose that the random variable X is the diameter

of a randomly chosen cylinder manufactured by the

company.

• It can take any value between 49.5 and 50.5

• Its a continuous random variable

Page 38: 02 bayesian decision theory

Example

• Suppose that the diameter of a metal cylinder has a p.d.f2( ) 1.5 6( 50.2) for 49.5 50.5

( ) 0, elsewhere

f x x x

f x

( )f x

x49.5 50.5

Page 39: 02 bayesian decision theory

Example

The probability that a metal cylinder has a diameter between 49.8 and 50.1 mm can be calculated to be

50.1 2 3 50.149.849.8

3

3

(1.5 6( 50.0) ) [1.5 2( 50.0) ]

[1.5 50.1 2(50.1 50.0) ]

[1.5 49.8 2(49.8 50.0) ]

75.148 74.716 0.432

x dx x x

( )f x

x49.5 50.549.8 50.1

Page 40: 02 bayesian decision theory

Cumulative Distribution Function

The Cumulative Distribution Function (CDF) is a function

giving the probability that the random variable X is less than or

equal to x

Formally

The cumulative distribution function F(x) is defined to be:

,

x

F x P X x ,

x

F x P X x

Page 41: 02 bayesian decision theory

Cumulative Distribution Function

For a discrete random variable, the cumulative distribution function is found by summing up the probabilities as in the example below.

For a continuous random variable, the cumulative distribution function is the integral of its probability density function f(x).

,

( ) ( )

i i

i ix x x x

x

F x P X x P X x p x

x

P X x f t dt

Page 42: 02 bayesian decision theory

Discrete vs. Continuous RVsDiscrete Random Variable Continuous Random Variable

Finite Sample Space e.g. {0, 1, 2, 3}

Infinite Sample Space e.g. [0,1], [2.1, 5.3]

1

1. ( ) 0, for all

2. ( ) 1

i

ii

p x i

p x

i ip x P X x

Cumulative Distribution Function (CDF)

f x

Probability Density Function (PDF)

X

R

X

Rxxf

dxxf

Rxxf

X

in not is if ,0)( 3.

1)( 2.

in allfor , 0)( 1.

P X x

( )i

ix xP X x p x

xP X x f t dt

b

aP a X b f x dx

Probability Mass Function (PMF)

Page 43: 02 bayesian decision theory

What is bayesian decision theory

• Mathematical foundation for decision making.

• Using probabilistic approach to help making decision (classification) so as to minimize the risk (cost).

• The decision problem is viewed in term of probabilities and it is assumed that all of the relevant probabilities are known

• It makes a compromise between the decisions and their costs and finds an optimal solution

Page 44: 02 bayesian decision theory

Fish sorting - revisited

How to design it?

Page 45: 02 bayesian decision theory

Notations

:},,,{ 21 ci a state of nature

:)( iP prior probability

:x feature vector

:)|( ip x class-conditionaldensity

:)|( xiP posterior probability

Page 46: 02 bayesian decision theory

Decision rule

• In our example, the fish on the conveyer belt may be salmon or sea bass and we are not certain about it. Thus it can be described in terms of probability as the outcome of the experiment is not certain

• So, we assign a random variable ω to describe the type of fish

ω = ω1 for sea bassω = ω2 for salmon

State of nature

Page 47: 02 bayesian decision theory

Decision Rule based on Prior Information• Prior probability: It is based on our knowledge without doing

any experimentation

• For example: if fishermen catch as much sea bass as salmon in a season then the catch and their appearance on conveyer belt is equiprobable:

• In general, sea bass and salmon may appear with any non-zero probabilities

1 2

1 2

( ) ( ) 0.5 and

( ) ( ) 1

P P

P P

1 2

1 2

( ) 0and ( ) 0 but

( ) ( ) 1

P P

P P

Page 48: 02 bayesian decision theory

Decision Rule based on Prior Information• An optimal decision rule based on only prior

probabilities (without seeing the fish) is

• This rule makes sense if we wish to judge one fish but it is odd if we are judging many fish

• If P(ω 1) >> P(ω 2) then our decision will be right most of the time

• The probability of error is the minimum of both probabilities

1 1 2 2Decide if ( ) ( );otherwise decide P P

1 2( ) min ( ), ( )P error P P

Page 49: 02 bayesian decision theory

Decision Rule based on Data

• Data can be images or features extracted (e.g. lightness of fish) from images

• In our case, different fish yield different lightness readings, and we express this variability in probabilistic terms

• It is expressed as a conditional probability also called class-conditional probability function i.e.

1 1

2 2

( | ) The pdf of given that state of nature is

( | ) The pdf of given that state of nature is

p x x

p x x

Page 50: 02 bayesian decision theory

Decision Rule based on Data

Page 51: 02 bayesian decision theory

Decision Rule based on Data

• We can integrate the new knowledge i.e. class conditional probability function to prior knowledge, to compute posteriori probability or posterior using Baye’s rule

( | ) ( )( | )

( )i i

i

p x PP x

p x

2

1

where ( ) ( | ) ( )i ii

p x p x P

Page 52: 02 bayesian decision theory

Decision Rule based on Data

• Bayes’ rule can also be expressed as

• Evidence can be considered as a scale factor

x likelihood priorposterior

evidence

( | ) ( )( | )

( )i i

i

p PP

p

x

xx

unimportant inmaking decision

unimportant inmaking decision

Page 53: 02 bayesian decision theory

Decision Rule based on Data

• An optimal decision rule based on posterior probabilities is

Decide 1 if P(1|x) > P(2|x); otherwise decide 2

1 1 2 21 2

1 1 2 2

1 1 2 2

( | ) ( ) ( | ) ( )( | ) , ( | )

( ) ( )

Decision rule becomes

( | ) ( ) ( | ) ( )

( ) ( )

( | ) ( ) ( | ) ( )

p x P p x PP x P x

p x p x

p x P p x P

p x p x

p x P p x P

Page 54: 02 bayesian decision theory

Decision Rule based on Data

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Special cases:1. P(1)=P(2)

Decide 1 if p(x|1) > p(x|2); otherwise decide 2

2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2

Page 55: 02 bayesian decision theory

Decision Rule based on Data )(

)()|()|(

x

xx

p

PpP ii

i

Decide i if P(i|x) > P(j|x) j i

Decide i if p(x|i)P(i) > p(x|j)P(j) j i

Special cases:• P(1)=P(2)= =P(c)

• p(x|1)=p(x|2) = = p(x|c)

Page 56: 02 bayesian decision theory

Example

P(1)=P(2)

R2 R1

Special cases:1. P(1)=P(2)

Decide 1 if p(x|> p(x|2); otherwise decide 1

2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2

Special cases:1. P(1)=P(2)

Decide 1 if p(x|> p(x|2); otherwise decide 1

2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2

Page 57: 02 bayesian decision theory

Example

P(1)=2/3P(2)=1/3

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Page 58: 02 bayesian decision theory

Decision Rule based on Data

Page 59: 02 bayesian decision theory

Example

R1R1

R2

R2

P(1)=2/3P(2)=1/3

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Page 60: 02 bayesian decision theory

Classification error

The probability of error is the minimum of both probabilities

1 1 2 2Decide if ( | ) ( | ) ;otherwise decide P x P x

1 2

2 1

1 2

( | ) if we decide ( | )

( | ) if we decide

( | ) min ( | ), ( | )

P xp error x

P x

p error x P x P x

Page 61: 02 bayesian decision theory

Confusion Matrix

• To evaluate the classifier, we make a table of following type

Predicted Class

Sea bass Salmon

State of nature or ground truth

Sea bass N1 N2

Salmon N3 N4

1 2

3 4

1 4

2 3

Correct classificationsClassification rate =

Total number of examples

Incorrect classificationsClassification error =

Total number of examples

seabass

salmon

seabass salmon

seab

N N N

N N N

N N

N N

N N

N

ass salmonN

Page 62: 02 bayesian decision theory

Bayesian Decision theory

• The previous discussion can be extended to multidimensional features and multiclass problems

• Mathematical formulation

1 2 3

1 2 3

, d-dimensional feature space

, , , , represent states of nature

, , , , represnet associated actions

d

c

a

c

a

x

Page 63: 02 bayesian decision theory

Bayesian Decision theory

• Bayes’ rule becomes

• The optimal decision rule becomes

1

( | ) ( )( | )

( )

where ( ) ( | ) ( )

j jj

c

j jj

p PP

p

p p P

xx

x

x x

Decide if ( | ) ( | )j j iP P j i x x

Page 64: 02 bayesian decision theory

Conditional risk

• A loss function states exactly how costly each action is, and can be expressed as:

:)|( jiij The loss incurred for taking action i when the true state of nature is j.

We want to minimize the expected loss in making decision.

Risk

Page 65: 02 bayesian decision theory

Conditional risk

c

jjjii PR

1

)|()|()|( xx

c

jjij P

1

)|( x

Given x, the expected loss

(risk) associated with taking action

i.

Given x, the expected loss

(risk) associated with taking action

i.

Page 66: 02 bayesian decision theory

Loss Function

• How to define it?• Example: 0-1 loss is defined as

otherwise1

with assiciateddecision correct a is 0)|( ji

ji

0,( | ) , , 1, 2, ,

1,i j

i ji j c

i j

Page 67: 02 bayesian decision theory

Loss Function

Conditional risk based on 0-1 loss can be determined as

1

( | ) ( | ) ( | )

( | )

1 ( | )

c

i i j jj

jj i

i

R P

P

P

x x

x

x

Bayesian Decision Rule:

)|(minarg)( xx iRi

)|(minarg)( xx iRi

Page 68: 02 bayesian decision theory

Two category classification

• Mathematical formulation– 1 = action corresponding to deciding that true nature

of state is 1

– 2 = action corresponding to deciding that true nature of state is 2

– ij(i | j) = loss incurred for deciding i when true state of nature is j

Page 69: 02 bayesian decision theory

Two category classification

},{ 21 },{ 21

Act

ion

State of Nature

1 2

1 11 12

2 21 22

Loss Function

)|()|()|( 2121111 xxx PPR

)|()|()|( 2221212 xxx PPR

Page 70: 02 bayesian decision theory

Two category classification

• The optimal decision rule can be described as

• In terms of posteriori probabilities

1 1 2 2Decide if ( | ) ( | ) otherwise R R x x

)|()|()|()|( 212111222121 xxxx PPPP

1 21 11 1 12 22 2

1 21 11 1 1 12 22 2 2

Decide if ( | ) ( | )

or

Decide if ( | ) ( ) ( | ) ( )

P P

p P p P

x x

x x

Page 71: 02 bayesian decision theory

Two category classification

• Alternatively

1

12 221 2

2 21 11 1

Decide if

( | ) ( )

( | ) ( )

p P

p P

x

x

Likelihood Ratio

Page 72: 02 bayesian decision theory

Decision boundaries