1 Pattern Recognition Chapter 2 Bayesian Decision Theory.

1 Pattern Recognition Chapter 2 Bayesian Decision Theory 2 Fundamental statistical approach to problem classification.Fundamental statistical approach to problem classification. Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions.Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions. Each action is associated with a cost or risk. The simplest risk is the classification error. Design classifiers to recommend actions that minimize some total expected risk. 3 Terminology (using sea bass salmon classification example) State of nature (random variable):State of nature (random variable): 1 for sea bass, 2 for salmon. Probabilities P( 1 ) and P( 2 ) (priors)Probabilities P( 1 ) and P( 2 ) (priors) prior knowledge of how likely is to get a sea bass or a salmon Probability density function p(x) (evidence):Probability density function p(x) (evidence): how frequently we will measure a pattern with feature value x (e.g., x is a lightness measurement) Note: if x and y are different measurements, p(x) and p(y) correspond to different pdfs: p X (x) and p Y (y) 4 Terminology (contd) (using sea bass salmon classification example) Conditional probability density p(x/ j ) (likelihood):Conditional probability density p(x/ j ) (likelihood): how frequently we will measure a pattern with feature value x given that the pattern belongs to class j e.g., lightness distributions between salmon/sea-bass populations 5 Terminology (contd) (using sea bass salmon classification example) Conditional probability P( j /x) (posterior):Conditional probability P( j /x) (posterior): the probability that the fish belongs to class j given measurement x. Note: we will be using an uppercase P(.) to denote a probability mass function (pmf) and a lowercase p(.) to denote a probability density function (pdf). 6 Decision Rule Using Priors Only Decide 1 if P( 1 ) > P( 2 ); otherwise decide 2 Decide 1 if P( 1 ) > P( 2 ); otherwise decide 2 P(error) = min[P( 1 ), P( 2 )] P(error) = min[P( 1 ), P( 2 )] Favours the most likely class (optimum if no other info is available).Favours the most likely class (optimum if no other info is available). This rule would be making the same decision all the times!This rule would be making the same decision all the times! Makes sense to use for judging just one fish Makes sense to use for judging just one fish 7 Decision Rule Using Conditional pdf Using Bayes rule, the posterior probability of category j given measurement x is given by:Using Bayes rule, the posterior probability of category j given measurement x is given by: where (scale factor sum of probs = 1) where (scale factor sum of probs = 1) Decide 1 if P( 1 /x) > P( 2 /x); otherwise decide 2 or Decide 1 if P( 1 /x) > P( 2 /x); otherwise decide 2 or Decide 1 if p(x/ 1 )P( 1 )>p(x/ 2 )P( 2 ) otherwise decide 2 Decide 1 if p(x/ 1 )P( 1 )>p(x/ 2 )P( 2 ) otherwise decide 2 8 Decision Rule Using Conditional pdf (contd) 9 Probability of Error The probability of error is defined as:The probability of error is defined as: The average probability error is given by:The average probability error is given by: The Bayes rule is optimum, that is, it minimizes the average probability error since:The Bayes rule is optimum, that is, it minimizes the average probability error since: P(error/x) = min[P( 1 /x), P( 2 /x)] 10 Where do Probabilities Come From? The Bayesian rule is optimal if the pmf or pdf is known.The Bayesian rule is optimal if the pmf or pdf is known. There are two competitive answers to the above question:There are two competitive answers to the above question: (1) Relative frequency (objective) approach. Probabilities can only come from experiments. (2) Bayesian (subjective) approach. Probabilities may reflect degree of belief and can be based on opinion as well as experiments. 11 Example Classify cars on UNR campus whether they are more or less than $50K:Classify cars on UNR campus whether they are more or less than $50K: C1: price > $50K C2: price < $50K Feature x: height of car From Bayes rule, we know how to compute the posterior probabilities:From Bayes rule, we know how to compute the posterior probabilities: Need to compute p(x/C 1 ), p(x/C 2 ), P(C 1 ), P(C 2 )Need to compute p(x/C 1 ), p(x/C 2 ), P(C 1 ), P(C 2 ) 12 Example (contd) Determine prior probabilitiesDetermine prior probabilities Collect data: ask drivers how much their car was and measure height. e.g., 1209 samples: #C 1 =221 #C 2 =988 13 Example (contd) Determine class conditional probabilities (likelihood)Determine class conditional probabilities (likelihood) Discretize car height into bins and use normalized histogram 14 Example (contd) Calculate the posterior probability for each bin:Calculate the posterior probability for each bin: 15 A More General Theory Use more than one features.Use more than one features. Allow more than two categories.Allow more than two categories. Allow actions other than classifying the input to one of the possible categories (e.g., rejection).Allow actions other than classifying the input to one of the possible categories (e.g., rejection). Introduce a more general error functionIntroduce a more general error function loss function (i.e., associate costs with actions) 16 Terminology Features form a vectorFeatures form a vector A finite set of c categories 1, 2, , cA finite set of c categories 1, 2, , c A finite set l of actions 1, 2, , lA finite set l of actions 1, 2, , l A loss function ( i / j )A loss function ( i / j ) the loss incurred for taking action i when the classification category is j Bayes rule using vector notationBayes rule using vector notation 17 Expected Loss (Conditional Risk) Expected loss (or conditional risk) with taking action i :Expected loss (or conditional risk) with taking action i : The expected loss can be minimized by selecting the action that minimizes the conditional risk.The expected loss can be minimized by selecting the action that minimizes the conditional risk. 18 Overall Risk Overall risk ROverall risk R where (x) determines which action 1, 2, , l to take for every x (i.e., (x) is a decision rule). where (x) determines which action 1, 2, , l to take for every x (i.e., (x) is a decision rule). To minimize R, find a decision rule (x) that chooses the action with the minimum conditional risk R(a i /x) for every x.To minimize R, find a decision rule (x) that chooses the action with the minimum conditional risk R(a i /x) for every x. This rule yields optimal performanceThis rule yields optimal performance conditional risk 19 Bayes Decision Rule The Bayes decision rule minimizes R by:The Bayes decision rule minimizes R by: Computing R( i /x) for every i given an x Choosing the action i with the minimum R( i /x) Bayes risk (i.e., resulting minimum) is the best performance that can be achieved.Bayes risk (i.e., resulting minimum) is the best performance that can be achieved. 20 Example: Two-category classification Two possible actionsTwo possible actions 1 corresponds to deciding 1 2 corresponds to deciding 2 Notation:Notation: ij= ( i, j ) The conditional risks are:The conditional risks are: 21 Example: Two-category classification Decision rule:Decision rule: or (i.e., using likelihood ratio) or > 22 Special Case: Zero-One Loss Function It assigns the same loss to all errors:It assigns the same loss to all errors: The conditional risk corresponding to this loss function:The conditional risk corresponding to this loss function: 23 Special Case: Zero-One Loss Function (contd) The decision rule becomes:The decision rule becomes: What is the overall risk in this case?What is the overall risk in this case? (answer: average probability error) (answer: average probability error) or 24 Example a was determined assuming zero-one loss function a was determined assuming zero-one loss function b was determined assuming b was determined assuming (decision regions) 25 Discriminant Functions Functional structure of a general statistical classifierFunctional structure of a general statistical classifier Assign x to i if: g i (x) > g j (x) for all Assign x to i if: g i (x) > g j (x) for all ( discriminant functions ) pick max 26 Discriminants for Bayes Classifier Using risks:Using risks: g j (x)=-R( j /x) g j (x)=-R( j /x) Using zero-one loss function (i.e., min error rate):Using zero-one loss function (i.e., min error rate): g j (x)=P( j /x) g j (x)=P( j /x) Is the choice of g j unique?Is the choice of g j unique? Replacing g j (x) with f(g j (x)), where f() is monotonically increasing, does not change the classification results. 27 Decision Regions and Boundaries Decision rules divide the feature space in decision regions R 1, R 2, , R cDecision rules divide the feature space in decision regions R 1, R 2, , R c The boundaries of the decision regions are the decision boundaries.The boundaries of the decision regions are the decision boundaries. g 1 (x)=g 2 (x) at the decision boundaries 28 Case of two categories More common to use a single discriminant function (dichotomizer) instead of two:More common to use a single discriminant function (dichotomizer) instead of two: Examples of dichotomizers:Examples of dichotomizers: 29 Discriminant Function for Multivariate Gaussian Assume the following discriminant function:Assume the following discriminant function: N(,) p(x/ i ) 30 Multivariate Gaussian Density: Case I Assumption: i = 2Assumption: i = 2 Features are statistically independent Each feature has the same variance favors the a-priori more likely category 31 Multivariate Gaussian Density: Case I (contd) wi=wi= threshold or bias ) ) 32 Multivariate Gaussian Density: Case I (contd) Comments about this hyperplane:Comments about this hyperplane: It passes through x 0 It is orthogonal to the line linking the means. What happens when P( i )= P( j ) ? If P( i )= P( j ), then x 0 shifts away from the more likely mean. If is very small, the position of the boundary is insensitive to P( i ) and P( j ) 33 Multivariate Gaussian Density: Case I (contd) 34 Multivariate Gaussian Density: Case I (contd) Minimum distance classifierMinimum distance classifier When P( i ) is the same for each of the c classes 35 Multivariate Gaussian Density: Case II Assumption: i = Assumption: i = 36 Multivariate Gaussian Density: Case II (contd) Comments about this hyperplane:Comments about this hyperplane: It passes through x 0 It is NOT orthogonal to the line linking the means. What happens when P( i )= P( j ) ? If P( i )= P( j ), then x 0 shifts away from the more likely mean. 37 Multivariate Gaussian Density: Case II (contd) 38 Multivariate Gaussian Density: Case II (contd) Mahalanobis distance classifierMahalanobis distance classifier When P( i ) is the same for each of the c classes 39 Multivariate Gaussian Density: Case III Assumption: i = arbitraryAssumption: i = arbitrary e.g., hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids etc. hyperquadrics 40 Multivariate Gaussian Density: Case III (contd) disconnecteddecisionregions P( 1 )=P( 2 ) 41 Multivariate Gaussian Density: Case III (contd) disconnecteddecisionregions non-lineardecisionboundaries 42 Multivariate Gaussian Density: Case III (contd) More examples ( arbitrary)More examples ( arbitrary) 43 Multivariate Gaussian Density: Case III (contd) A four category exampleA four category example 44 Example - Case III P( 1 )=P( 2 ) decision boundary: boundary does not pass through midpoint of 1, 2 45 Missing Features Suppose x=(x 1,x 2 ) is a feature vector.Suppose x=(x 1,x 2 ) is a feature vector. What can we do when x 1 is missing during classification?What can we do when x 1 is missing during classification? Maybe use the mean value of all x 1 measurements? But is the largest! 46 Missing Features (cont d) Suppose x=[x g,x b ] (x g : good features, x b : bad features)Suppose x=[x g,x b ] (x g : good features, x b : bad features) Derive the Bayes rule using the good features:Derive the Bayes rule using the good features: p p 47 Noisy Features Suppose x=[x g,x b ] (x g : good features, x b : noisy features)Suppose x=[x g,x b ] (x g : good features, x b : noisy features) Suppose noise is statistically independent and Suppose noise is statistically independent and We know noise model p(x b /x t ) x b : observed feature values, x t : true feature values. Assume statistically independent noiseAssume statistically independent noise if x t were known, x b would be independent of x g, i p 48 Noisy Features (cont d) use independence assumption What happens when p(x b /x t ) is uniform?What happens when p(x b /x t ) is uniform? 49 Receiver Operating Characteristic (ROC) Curve Every classifier employs some kind of a threshold value.Every classifier employs some kind of a threshold value. Changing the threshold affects the performance of the system.Changing the threshold affects the performance of the system. ROC curves can help us distinguish between discriminability and decision bias (i.e., choice of threshold)ROC curves can help us distinguish between discriminability and decision bias (i.e., choice of threshold) 50 Example: Person Authentication Authenticate a person using biometrics (e.g., face image).Authenticate a person using biometrics (e.g., face image). There are two possible distributions:There are two possible distributions: authentic (A) and impostor (I) I A false positive correct acceptance correct rejection false rejection 51 Example: Person Authentication (contd) Possible casesPossible cases (1) correct acceptance (true positive): X belongs to A, and we decide AX belongs to A, and we decide A (2) incorrect acceptance (false positive): X belongs to I, and we decide AX belongs to I, and we decide A (3) correct rejection (true negative): X belongs to I, and we decide IX belongs to I, and we decide I (4) incorrect rejection (false negative): X belongs to A, and we decide IX belongs to A, and we decide I 52 Error vs Threshold x*x*x*x* 53 False Negatives vs Positives 54 Statistical Dependences Between Variables Many times, the only knowledge we have about a distribution is which variables are or are not dependent.Many times, the only knowledge we have about a distribution is which variables are or are not dependent. Such dependencies can be represented graphically using a Bayesian Belief Network (or Belief Net).Such dependencies can be represented graphically using a Bayesian Belief Network (or Belief Net). In essence, Bayesian Nets allow us to represent a joint probability density p(x,y,z,) efficiently using dependency relationships.In essence, Bayesian Nets allow us to represent a joint probability density p(x,y,z,) efficiently using dependency relationships. p(x,y,z,) could be either discrete or continuous.p(x,y,z,) could be either discrete or continuous. 55 Example of Dependencies State of an automobileState of an automobile Engine temperature Brake fluid pressure Tire air pressure Wire voltages Etc. NOT causally related variablesNOT causally related variables Engine oil pressure Tire air pressure Causally related variablesCausally related variables Coolant temperature Engine temperature 56 Definitions and Notation A belief net is usually a Directed Acyclic Graph (DAG)A belief net is usually a Directed Acyclic Graph (DAG) Each node represents one of the system variables.Each node represents one of the system variables. Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous).Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous). 57 Relationships Between Nodes A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X)A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X) Influences could be direct or indirect (e.g., A influences X directly and A influences C indirectly through X).Influences could be direct or indirect (e.g., A influences X directly and A influences C indirectly through X). 58 Parent/Children Nodes Parent nodes P of XParent nodes P of X the nodes before X (connected to X) Children nodes C of X:Children nodes C of X: the nodes after X (X is connected to them) 59 Conditional Probability Tables Every node is associated with a set of weights which represent the prior/conditional probabilities (e.g., P(x i /a j ), i=1,2, j=1,2,3,4)Every node is associated with a set of weights which represent the prior/conditional probabilities (e.g., P(x i /a j ), i=1,2, j=1,2,3,4) probabilities sum to 1 60 Learning There exist algorithms for learning these probabilities from dataThere exist algorithms for learning these probabilities from data 61 Computing Joint Probabilities We can compute the probability of any configuration of variables in the joint density distribution:We can compute the probability of any configuration of variables in the joint density distribution: e.g., P(a 3, b 1, x 2, c 3, d 2 )=P(a 3 )P(b 1 )P(x 2 /a 3,b 1 )P(c 3 /x 2 )P(d 2 /x 2 )= 0.25 x 0.6 x 0.4 x 0.5 x 0.4 = x 0.6 x 0.4 x 0.5 x 0.4 = 0.012 62 Computing the Probability at a Node E.g., determine the probability at DE.g., determine the probability at D 63 Computing the Probability at a Node (contd) E.g., determine the probability at H:E.g., determine the probability at H: = 64 Computing Probability Given Evidence (Bayesian Inference) Determine the probability of some particular configuration of variables given the values of some other variables (evidence).Determine the probability of some particular configuration of variables given the values of some other variables (evidence). e.g., compute P(b 1 /a 2, x 1, c 1 ) e.g., compute P(b 1 /a 2, x 1, c 1 ) 65 Computing Probability Given Evidence (Bayesian Inference)(contd) In general, if X denotes the query variables and e denotes the evidence, thenIn general, if X denotes the query variables and e denotes the evidence, then where =1/P(e) is a constant of proportionality. 66 An Example Classify a fish given that we only have evidence that the fish is light (c 1 ) and was caught in south Atlantic (b 2 ) -- no evidence about what time of the year the fish was caught nor its thickness.Classify a fish given that we only have evidence that the fish is light (c 1 ) and was caught in south Atlantic (b 2 ) -- no evidence about what time of the year the fish was caught nor its thickness. 67 An Example (contd) 68 An Example (contd) 69 An Example (contd) Similarly,Similarly, P(x 2 /c 1,b 2 )= P(x 2 /c 1,b 2 )= Normalize probabilities (not needed necessarily):Normalize probabilities (not needed necessarily): P(x 1 /c 1,b 2 )+ P(x 2 /c 1,b 2 )=1 (=1/0.18) P(x 1 /c 1,b 2 )+ P(x 2 /c 1,b 2 )=1 (=1/0.18) P(x 1 /c 1,b 2 )= 0.63 P(x 1 /c 1,b 2 )= 0.63 P(x 2 /c 1,b 2 )= 0.27 P(x 2 /c 1,b 2 )= 0.27 salmon 70 Another Example: Medical Diagnosis Uppermost nodes: biological agents (bacteria, virus) Intermediate nodes: diseases Lowermost nodes: symptoms Given some evidence (biological agents, symptoms), find most likely disease.Given some evidence (biological agents, symptoms), find most likely disease. 71 Nave Bayes Rule When dependency relationships among features are unknown, we can assume that features are conditionally independent given the category:When dependency relationships among features are unknown, we can assume that features are conditionally independent given the category:P(a,b/x)=P(a/x)P(b/x) Nave Bayes rule :Nave Bayes rule : Simple assumption but usually works well in practice.Simple assumption but usually works well in practice.

1 Pattern Recognition Chapter 2 Bayesian Decision Theory.

Documents

Transcript of 1 Pattern Recognition Chapter 2 Bayesian Decision Theory.