Chapter ML:IV (continued) · 2021. 3. 12. · Chapter ML:IV (continued) IV.Statistical Learning q...

Chapter ML:IV (continued)

IV. Statistical Learningq Probability Basicsq Bayes Classificationq MAP versus ML Hypotheses

ML:IV-57 Statistical Learning © STEIN 2021

Bayes ClassificationGenerative Approach to Classification Problems

Setting:

q X is a set of feature vectors.

q C is a set of classes.

q c : X → C is the (unknown) ideal classifier for X.

q D = (x1, c(x1)), . . . , (xn, c(xn)) ⊆ X × C is a set of examples.

Todo:

q Approximate c(x), which is implicitly given via D, by estimating the underlyingjoint probability P (x, c(x)).


Bayes ClassificationBayes Theorem

Theorem 12 (Bayes [1701-1761])

Let (Ω,P(Ω), P ) be a probability space, and let A1, . . . , Ak be mutually exclusiveevents with Ω = A1 ∪ . . . ∪ Ak, P (Ai) > 0, i = 1, . . . , k. Then for an event B ∈ P(Ω)

with P (B) > 0 holds:

P (Ai | B) =P (Ai) · P (B | Ai)k∑i=1

P (Ai) · P (B | Ai)

P (Ai) is called prior probability of Ai.

P (Ai | B) is called posterior probability of Ai.


Bayes ClassificationBayes Theorem (continued)

Proof (Bayes Theorem)

From the::::::::::::::::conditional

::::::::::::::::::::probabilities for P (B | Ai) and P (Ai | B) follows:

P (Ai | B) =P (B ∩ Ai)

P (B)=

P (Ai) · P (B | Ai)

P (B)


https://webis.de/downloads/lecturenotes/machine-learning/unit-en-probability-basics.pdf#conditional-probability-deductions

Bayes ClassificationBayes Theorem (continued)

Proof (Bayes Theorem)

From the::::::::::::::::conditional

::::::::::::::::::::probabilities for P (B | Ai) and P (Ai | B) follows:

P (Ai | B) =P (B ∩ Ai)

P (B)=

P (Ai) · P (B | Ai)

P (B)

Applying the theorem of:::::::total

:::::::::::::::::probability to P (B),

P (B) =

k∑i=1

P (Ai) · P (B | Ai),

will yield the claim.


https://webis.de/downloads/lecturenotes/machine-learning/unit-en-probability-basics.pdf#conditional-probability-deductions

https://webis.de/downloads/lecturenotes/machine-learning/unit-en-probability-basics.pdf#total-probability

Bayes ClassificationExample: Reasoning About a Disease [Kirchgessner 2009]

1. A1 : Aids P (A1) = 0.001 (prior knowledge about population)

⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999

B : test_pos ⇒ P (B) =∑2

i=1 P (Ai) · P (B | Ai) = 0.031


Bayes ClassificationExample: Reasoning About a Disease [Kirchgessner 2009] (continued)


⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999


i=1 P (Ai) · P (B | Ai) = 0.031




⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999


i=1 P (Ai) · P (B | Ai) = 0.031






⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999


i=1 P (Ai) · P (B | Ai) = 0.031



Simple Bayes formula:

P (Aids | test_pos) = P (A1 | B) =P (A1) · P (B | A1)

P (B)




⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999


i=1 P (Ai) · P (B | Ai) = 0.031





P (B)=

0.001 · 0.98

0.031= 0.032 = 3.2%




⇒ A2 : no_Aids P (A2) = P (A1) = 1− P (A1) = 0.999


i=1 P (Ai) · P (B | Ai) = 0.031





P (B)=

0.001 · 0.98

0.031= 0.032 = 3.2%

Sample size Disease statistics (prior knowledge) Diagnosis (= hypothesis selection)

100 000

100 (Aids)

99 900 (no_Aids)

98 (test_pos)

2 (test_neg)

2 997 (test_pos)

96 903 (test_neg)


Bayes ClassificationCombined Conditional Events: P (Ai | B1, . . . , Bp)

Let P (Ai | B1, . . . , Bp) denote the probability of the occurrence of event Ai given thatthe events (conditions) B1, . . . , Bp are known to have occurred.




Applied to a classification problem:

q Ai corresponds to an event of the kind “class=ci”,the Bj, j = 1, . . . , p, correspond to p events of the kind “attributej=xj”.

q observable connection (in the regular situation) : B1, . . . , Bp | Ai

q reversed connection (in a diagnosis situation) : Ai | B1, . . . , Bp




Applied to a classification problem:

q Ai corresponds to an event of the kind “class=ci”,the Bj, j = 1, . . . , p, correspond to p events of the kind “attributej=xj”.

q observable connection (in the regular situation) : B1, . . . , Bp | Ai

q reversed connection (in a diagnosis situation) : Ai | B1, . . . , Bp

If sufficient data for estimating P (Ai) and P (B1, . . . , Bp | Ai) is provided, thenP (Ai | B1, . . . , Bp) can be computed with the Theorem of Bayes:

P (Ai | B1, . . . , Bp) =P (Ai) · P (B1, . . . , Bp | Ai)

P (B1, . . . , Bp)(?)


Remarks [::::::::::::information

:::::gain

::::for

::::::::::::::classification] :

q How probability theory is applied to classification problem solving:

– Classes and attribute-value pairs are interpreted as events. The relation to an underlyingsample space Ω, Ω = ω1, . . . , ωn, from which the events are subsets, is not considered.

– Observable or measurable and possibly causal connection: It is (or was in the past)regularly observed that in situation Ai (e.g. a disease) the symptoms B1, . . . , Bp occur.One may denote this as forward connection.

– Reversed connection, typically an analysis or diagnosis situation: The symptomsB1, . . . , Bp are observed, and one is interested in the probability that Ai is given or hasoccurred.

– Based on the prior probabilities of the classes (aka class priors), P (class=ci), and the

:::::::::::::::::::::class-conditional

::::::::::::::::probabilities of the observable connections (aka likelihoods),

P (attribute1=x1, . . . ,attributep=xp | class=ci), the::::::::::::::conditional

::::::::class

::::::::::::::::probabilities in an

analysis situation, P (class=ci | attribute1=x1, . . . ,attributep=xp), can be computed with theTheorem of Bayes.

q Note that a class-conditional event “attributej=xj | class=ci” does not necessarily model acause-effect relation: the event “class=ci” may cause—but does not need to cause—theevent “attributej=xj”.


https://webis.de/downloads/lecturenotes/machine-learning/unit-en-decision-trees-impurity.pdf#remarks-information-gain-for-classification

https://webis.de/downloads/lecturenotes/machine-learning/unit-en-ml-evaluating-effectiveness.pdf#class-conditional-probability-function

https://webis.de/downloads/lecturenotes/machine-learning/unit-en-ml-evaluating-effectiveness.pdf#conditional-class-probability-function

Remarks (continued) :

q:::::::::Recap. Alternative and semantically equivalent notations of P (Ai | B1, . . . , Bp) :

1. P (Ai | B1, . . . , Bp)

2. P (Ai | B1 ∧ . . . ∧Bp)

3. P (Ai | B1 ∩ . . . ∩Bp)


https://webis.de/downloads/lecturenotes/machine-learning/unit-en-probability-basics.pdf#syntax-event-notation

Bayes ClassificationNaive Bayes

The compilation of a database from which reliable values for the P (B1, . . . , Bp | Ai)

can be obtained is often infeasible. The way out:

(a) Naive Bayes Assumption: “Given condition Ai, the B1, . . . , Bp are statisticallyindependent” (aka the Bj are conditionally independent). Notation:

P (B1, . . . , Bp | Ai)NB=

p∏j=1

P (Bj | Ai)


Bayes ClassificationNaive Bayes

The compilation of a database from which reliable values for the P (B1, . . . , Bp | Ai)

can be obtained is often infeasible. The way out:

(a) Naive Bayes Assumption: “Given condition Ai, the B1, . . . , Bp are statisticallyindependent” (aka the Bj are conditionally independent). Notation:

P (B1, . . . , Bp | Ai)NB=

p∏j=1

P (Bj | Ai)

(b) Given a set A1, . . . , Ak of alternative events (causes or classes), the mostprobable event under the Naive Bayes assumption, ANB, can be computedwith the Theorem of Bayes (?) :

argmaxAi∈A1,...,Ak

P (Ai) · P (B1, . . . , Bp | Ai)

P (B1, . . . , Bp)NB= argmax

Ai∈A1,...,AkP (Ai) ·

p∏j=1

P (Bj | Ai) = ANB


Remarks:

q Rationale for the Naive Bayes Assumption. Usually the probability P (B1, . . . , Bp | Ai) cannotbe estimated: Suppose that we are given p attributes (features) and that the domains of theattributes contain minimum l values each.

Then, for as many as lp different feature vectors the probabilities P (B1=x1, . . . , Bp=xp

| Ai) arerequired, where Bj=xj

encodes the event where attribute j has the value xj. Moreover, inorder to provide reliable estimates, each possible p-dimensional feature vector (x1, . . . , xp)has to occur in the database sufficiently often.

By contrast, the estimation of the probabilities under the Naive Bayes Assumption,P (Bj=xj

| Ai), can be derived from a significantly smaller database since only p · l different“attributej=xj”-events Bj=xj

are distinguished altogether.

q If the Naive Bayes Assumption applies, then the event ANB will maximize also the posteriorprobability P (Ai | B1, . . . , Bp) as defined by the Theorem of Bayes.

q To identify the most probable event, the denominator in the argmax-term, P (B1, . . . , Bp),needs not to be estimated since it is constant and cannot influence the ranking among theA1, . . . , Ak.


Remarks (continued) :

q Given a set of examples D, then “learning” or “training” a classifier using Naive Bayes meansto estimate the prior probabilities (class priors) P (Ai), with Ai = c(x), (x, c(x)) ∈ D, as well asthe probabilities of the observable connections P (Bj=xj

| Ai), xj ∈ x, j = 1, . . . , p,(x, c(x)) ∈ D, Ai = c(x).

The obtained probabilities are used in the argmax-term for ANB, which hence encodes thelearned hypothesis and functions as a classifier for new feature vectors.

q The hypothesis space H is comprised of all combinations that can be formed from all valuesthat can be chosen for P (Ai) and P (Bj=xj

| Ai). When building a Naive Bayes classifier, thehypothesis space H is not explored, but the sought hypothesis is directly calculated via a dataanalysis of D.

q In general the Naive Bayes classifier is not linear, but if the likelihoods,P (attribute1=x1, . . . ,attributep=xp | class=ci), are from exponential families, the Naive Bayesclassifier corresponds to a linear classifier in a particular feature space. [stackexchange]

q The Naive Bayes classifier belongs to the class of generative models, which modelconditional density functions. By contrast, discriminative models attempt to maximize thequality of the output on a training set. [Wikipedia]


https://stats.stackexchange.com/questions/142215/how-is-naive-bayes-a-linear-classifier?#answer-142258

https://en.wikipedia.org/wiki/Linear_classifier#Generative_models_vs._discriminative_models

Bayes ClassificationNaive Bayes (continued)

In addition to the Naive Bayes Assumption, let the following conditions apply:

(c) The set of the k classes is complete:k∑i=1

P (Ai) = 1, Ai ∈ c(x) | c(x) ∈ D

(d) The Ai are mutually exclusive: P (Ai, Aι) = 0, 1 ≤ i, ι ≤ k, i 6= ι





P (Ai) = 1, Ai ∈ c(x) | c(x) ∈ D


Then holds:

P (B1, . . . , Bp)c,d=

k∑i=1

P (Ai) · P (B1, . . . , Bp | Ai) (:::::::::::theorem

:::of

::::::total

::::::::::::::probability)

NB=

k∑i=1

P (Ai) ·p∏j=1

P (Bj | Ai) (Naive Bayes Assumption)


machine-learning/unit-en-probability-basics.pdf#total-probability




P (Ai) = 1, Ai ∈ c(x) | c(x) ∈ D


Then holds:

P (B1, . . . , Bp)c,d=

k∑i=1

P (Ai) · P (B1, . . . , Bp | Ai) (:::::::::::theorem

:::of

::::::total

::::::::::::::probability)

NB=

k∑i=1

P (Ai) ·p∏j=1

P (Bj | Ai) (Naive Bayes Assumption)

With the Theorem of Bayes (?) it follows for the conditional probabilities:

P (Ai | B1, . . . , Bp) =P (Ai) · P (B1, . . . , Bp | Ai)

P (B1, . . . , Bp)

c,d,NB=

P (Ai) ·∏p

j=1 P (Bj | Ai)∑ki=1 P (Ai) ·

∏pj=1 P (Bj | Ai)


machine-learning/unit-en-probability-basics.pdf#total-probability

Remarks:

q A ranking of the A1, . . . , Ak can be computed via argmaxAi∈A1,...,Ak

P (Ai) ·∏p

j=1 P (Bj | Ai).

q If both (c) completeness and (d) mutually exclusiveness of the Ai can be presumed, the totalof all posterior probabilities must add up to one:

∑ki=1 P (Ai | B1, . . . , Bp) = 1.

As a consequence, P (B1, . . . , Bp) can be estimated and the rank order values for the Ai be“converted” into the respective prior probabilities, P (Ai | B1, . . . , Bp).

The normalization is obtained by dividing a rank order value by the rank order values total,∑ki=1 P (Ai) ·

∏pj=1 P (Bj | Ai).

q The derivation above will in fact yield the true prior probabilities P (Ai | B1, . . . , Bp), if theNaive Bayes assumption along with the completeness and exclusiveness of the Ai hold.


Bayes ClassificationNaive Bayes: Classifier Construction Summary

Let X be a set of feature vectors, C a set of k classes, and D ⊆ X × C a set ofexamples. Then the k classes correspond to the events A1, . . . , Ak, and thep feature values of some x ∈ X correspond to the events B1=x1, . . . , Bp=xp.


Bayes ClassificationNaive Bayes: Classifier Construction Summary

Let X be a set of feature vectors, C a set of k classes, and D ⊆ X × C a set ofexamples. Then the k classes correspond to the events A1, . . . , Ak, and thep feature values of some x ∈ X correspond to the events B1=x1, . . . , Bp=xp.

Construction and application of a Naive Bayes classifier:

1. Estimation of the P (Ai), with Ai = c(x), (x, c(x)) ∈ D.

2. Estimation of the P (Bj=xj | Ai), xj ∈ x, j = 1, . . . , p, (x, c(x)) ∈ D, Ai = c(x).

3. Classification of a feature vector x as ANB, iff

ANB = argmaxAi∈A1,...,Ak

P (Ai) ·∏xj ∈xj=1,...,p

P (Bj=xj | Ai)

4. Given the conditions (c) and (d), computation of the posterior probabilities forANB as normalization of P (ANB) ·

∏xj ∈xj=1,...,p

P (Bj=xj | ANB).


Remarks:

q There are at most p · l different events Bj=xj, if l is an upper bound for the size of the p feature

domains.

q The probabilities, denoted as P (·), are unknown and estimated by the relative frequencies,denoted as P (·).

q The Naive Bayes approach is adequate for example sets D of medium size up to very largesizes.

q Strictly speaking, the Naive Bayes approach presumes that the feature values in D are“statistically independent given the classes of the target concept”. However, experience in thefield of text classification shows that convincing classification results are achieved even if theNaive Bayes Assumption does not hold.

q If, in addition to the rank order values, also posterior probabilities shall be computed, both thecompleteness (c) and the exclusiveness (d) of the target concept classes are required.

– Requirement (c) is also called “Closed World Assumption”.– Requirement (d) is also called “Single Fault Assumption”.


Bayes ClassificationNaive Bayes: Example

Outlook Temperature Humidity Wind EnjoySport

1 sunny hot high weak no2 sunny hot high strong no3 overcast hot high weak yes4 rain mild high weak yes5 rain cold normal weak yes6 rain cold normal strong no7 overcast cold normal strong yes8 sunny mild high weak no9 sunny cold normal weak yes

10 rain mild normal weak yes11 sunny mild normal strong yes12 overcast mild high strong yes13 overcast hot normal weak yes14 rain mild high strong no

Task: Compute the class c(x) of feature vector x = (sunny , cold ,high, strong).


Bayes ClassificationNaive Bayes: Example (continued)

Let “Bj=xj” denotes the event that feature j has value xj. Then, the feature vectorx = (sunny , cold ,high, strong) gives rise to the following four events:

B1=x1 : Outlook=sunny

B2=x2 : Temperature=cold

B3=x3 : Humidity=high

B4=x4 : Wind=strong



Let “Bj=xj” denotes the event that feature j has value xj. Then, the feature vectorx = (sunny , cold ,high, strong) gives rise to the following four events:

B1=x1 : Outlook=sunny

B2=x2 : Temperature=cold

B3=x3 : Humidity=high

B4=x4 : Wind=strong

Computation of ANB for x :

ANB = argmaxAi∈ yes, no

P (Ai) ·∏xj ∈xj=1,...,4

P (Bj=xj | Ai)

= argmaxAi∈ yes, no

P (Ai) · P (Outlook=sunny | Ai) · P (Temperature=cold | Ai) ·

P (Humidity=high | Ai) · P (Wind=strong | Ai)



To classify x altogether 2 + 4 · 2 probabilities are estimated from the data:

q P (EnjoySport=yes) = 914 = 0.64

q P (EnjoySport=no) = 514 = 0.36

q P (Wind=strong | EnjoySport=yes) = 39 = 0.33

q . . .







q . . .

Ü Ranking:

1. P (EnjoySport=no) ·∏

xj ∈x P (Bj=xj | EnjoySport=no) = 0.0206

2. P (EnjoySport=yes) ·∏

xj ∈x P (Bj=xj | EnjoySport=yes) = 0.0053







q . . .

Ü Ranking:

1. P (EnjoySport=no) ·∏

xj ∈x P (Bj=xj | EnjoySport=no) = 0.0206

2. P (EnjoySport=yes) ·∏

xj ∈x P (Bj=xj | EnjoySport=yes) = 0.0053

Ü Probabilities: (subject to conditions (c) and (d))

1. P (EnjoySport=no | x) = 0.02060.0053+0.0206 ≈ 80%

2. P (EnjoySport=yes | x) = 0.00530.0053+0.0206 ≈ 20%


Chapter ML:IV (continued) · 2021. 3. 12. · Chapter ML:IV (continued) IV.Statistical Learning q...

Documents

Transcript of Chapter ML:IV (continued) · 2021. 3. 12. · Chapter ML:IV (continued) IV.Statistical Learning q...