2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

21
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Transcript of 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Page 1: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

2. Bayes Decision Theory

Prof. A.L. Yuille

Stat 231. Fall 2004.

Page 2: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Decisions with Uncertainty

• Bayes Decision Theory is a theory for how to make decisions in the presence of uncertainty.

• Input data x.

• Salmon y= +1, Sea Bass y=-1.

• Learn decision rule: f(x) taking values

Page 3: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Decision Rule for Fish.

• Classify fish as Salmon or Sea Bass by decision rule f(x).

Page 4: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Basic Ingredients.

• Assume there are probability distributions for generating the data.

• P(x|y=1) and P(x|y=-1).

• Loss function L(f(x),y) specifies the loss of making decision f(x) when true state is y.

• Distribution P(y). Prior probability on y.

• Joint Distribution P(x,y) = P(x|y) P(y).

Page 5: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Minimize the Risk

• The risk of a decision rule f(x) is:

• Bayes Decision Rule f*(x):

• The Bayes Risk:

Page 6: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Minimize the Risk.

• Write P(x,y) = P(y|x) P(x).• Then we can write the Risk as:

• The best decision for input x is f*(x):

Page 7: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Bayes Rule.

• Posterior distribution P(y|x):

• Likelihood function P(x|y)• Prior P(y).

• Bayes Rule has been controversial (historically) because of the Prior P(y) (subjective?).

• But in Bayes Decision Theory, everything starts from the joint distribution P(x,y).

Page 8: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Risk.

• The Risk is based on averaging over all possible x & y. Average Loss.

• Alternatively, can try to minimize the worst risk over x & y. Minimax Criterion.

• This course uses the Risk, or average loss.

Page 9: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Generative & Discriminative.

• Generative methods aim to determine probability models P(x|y) & P(y).

• Discriminative methods aim directly at estimating the decision rule f(x).

• Vapnik argues for Discriminative Methods: Don’t solve a harder problem than you need to. Only care about the probabilities near the decision boundaries.

Page 10: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Discriminant Functions.

• For two category case the Bayes decision rule depends on the discriminant function:

• The Bayes decision rule is of form:

• Where T is a threshold, which is determined by the loss function.

Page 11: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Two-State Case

• Detect “target” or “non-target”.

• Let loss function pay a penalty of 1 for misclassification, 0 otherwise.

• Risk becomes Error. Bayes Risk becomes Bayes Error.

• Error is the sum of false positives F+ (non- targets classified as targets) and false negatives F- (targets classified as non-targets).

Page 12: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Gaussian Example: 1• Is a bright light flashing?

• n is no. photons emitted by dim or bright light.

Page 13: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

8. Gaussian Example: 2• are Gaussians with

means and s.d. .• Bayes decision rule selects “dim” if ;

• Errors:

Page 14: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Example: Multidimensional Gaussian Distributions.

• Suppose the two classes have Gaussian distributions for P(x|y).

• Different means but same covariance• The discriminant function is a plane:

• Alternatively, seek a planar decision rule without attempting to model the distributions.

• Only care about the data near the decision boundary.

Page 15: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Generative vrs. Discriminant.

• The Generative approach will attempt to estimate the Gaussian distributions from data – and then derive the decision rule.

• The Discriminant approach will seek to estimate the decision rule directly by learning the discriminant plane.

• In practice, we will not know the form of the distributions of the form of the discriminant.

Page 16: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Gaussian.

• Gaussian Case with unequal covariance.

Page 17: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Discriminative Models & Features.

• In practice, the Discriminative methods are usually defined based on features extracted from the data. (E.g. length and brightness of fish).

• Calculate features z=h(x).

• Bayes Decision Theory says that this throws away information.

• Restrict to a sub-class of possible decision rules – those that can be expressed in terms of features z=h(x).

Page 18: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Bayes Decision Rule and Learning.

• Bayes Decision Theory assumes that we know, or can learn, the distributions P(x|y).

• This is often not practical, or extremely difficult. • In real problems, you have a set of classified

data • You can attempt to learn P(x|y=+1) & P(x|y=-1)

from these (next few lectures).• Parametric & Non-parametric approaches.• Question: when do you have enough data to

learn these probabilities accurately?• Depends on the complexity of the model.

Page 19: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Machine Learning.

• Replace Risk by Empirical Risk

• How does minimizing the empirical risk relate to minimizing the true risk?

• Key Issue: When can we generalize? Be confident that the decision rule we have learnt on the training data will yield good results on unseen data?

Page 20: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Machine Learning

• Vapnik’s theory gives a mathematically elegant way of answering these issues.

• It assumes that the data is sampled from an unknown distribution.

• Vapnik’s theory gives bounds for when we can generalize.

• Unfortunately these bounds are very conservative.

• In practice, train on part of dataset and test on other part(s).

Page 21: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Extensions to Multiple Classes

The decision partitionsf the feature space into k subspaces

ji,jiik1i

1 4

35

2

Conceptually straightforward – see Duda, Hart & Stork.