Bayesian Classificationciortuz/ML.ex-book/SLIDES/ML...CMU, 2006 fall, Tom Mitchell, Eric Xing,...

116
Bayesian Classification 0.

Transcript of Bayesian Classificationciortuz/ML.ex-book/SLIDES/ML...CMU, 2006 fall, Tom Mitchell, Eric Xing,...

  • Bayesian Classification

    0.

  • A simple application of

    Bayes’ formula

    CMU, 2006 fall, Tom Mitchell, Eric Xing, midterm, pr. 1.3

    1.

  • Suppose that in answering a question in a multiple choice test, an examineeeither knows the answer, with probability p, or he guesses with probability1− p.Assume that the probability of answering a question correctly is 1 for anexaminee who knows the answer and 1/m for the examinee who guesses,where m is the number of multiple choice alternatives.

    What is the probability that an examinee knew the answer to [such] a ques-tion, given that he has correctly answered it?

    Answer:

    P (knew|correct) = P (correct|knew) · P (knew)P (correct)

    =P (correct|knew) · P (knew)

    P (correct|knew) · P (knew) + P (correct|guessed) · P (guessed) .

    Notice that in the denominator we had to consider the two ways (s)he canget a question correct: by knowing, or by guessing. Plugging in, we get:

    p

    p+1

    m· (1− p)

    =mp

    mp+ 1− p .

    2.

  • Maximum A posteriory Probability(MAP) Hypotheses

    3.

  • Exemplifying

    • the application of Bayes’ theorem

    • the notion of MAP (Maximum A posteriori Probability) hypotheses

    • the computation of expected values for discrete random variables

    and

    • the [use of] sensitivity and specificity of a testin a real-world application

    CMU, 2009 fall, Geoff Gordon, HW1, pr. 2

    4.

  • There is a disease which affects 1 in 500people. A 100.00 dollar blood test canhelp reveal whether a person has thedisease. A positive outcome indicatesthat the person may have the disease.

    The test has perfect sensitivity (truepositive rate), i.e., a person who hasthe disease tests positive 100% of thetime. However, the test has 99%specificity (true negative rate), i.e., ahealthy person tests positive 1% of thetime.

    tn

    c h

    fn fptp

    sensitivity (or: recall):tp

    tp + fn

    specificity: tntn + fp

    5.

  • a. A randomly selected individual is tested and the result is pos-itive.

    What is the probability of the individual having the disease?

    b. There is a second more expensive test which costs 10, 000.00dollars but is exact with 100% sensitivity and specificity.

    If we require all people who test positive with the less expensivetest to be tested with the more expensive test, what is the expectedcost to check whether an individual has the disease?

    c. A pharmaceutical company is attempting to decrease the costof the second (perfect) test.

    How much would it have to make the second test cost, so that thefirst test is no longer needed? That is, at what cost is it cheapersimply to use the perfect test alone, instead of screening with thecheaper test as described in part b?

    6.

  • Answer:

    Let’s define the following random variables:

    B:

    {1/true for persons affected by that disease,0/false otherwise;

    T1: the result of the first test: + (in case of disease) or − (otherwise);T2: the result of the second test: again + or −.

    Known facts:

    P (B) = 1500P (T1 = + | B) = 1, P (T1 = + | B̄) = 1100 ,P (T2 = + | B) = 1, P (T2 = + | B̄) = 0

    h

    c

    tpfp

    tn

    a.

    P (B | T1 = +) TBayes=P (T1 = + | B) · P (B)

    P (T1 = + | B) · P (B) + P (T1 = + | B̄) · P (B̄)

    =1 · 1

    500

    1 · 1500

    +1

    100· 499500

    =100

    599≈ 0.1669⇒

    P (B̄ | T1 = +) = 0.8331 > P (B | T1 = +). Therefore, B̄ is the MAP hypothesis.

    7.

  • b.

    Let’s consider a new random variable:

    C =

    {c1 if the person only takes the first test

    c1 + c2 if the person takes the two tests

    ⇒ P (C = c1) = P (T1 = −) and P (C = c1 + c2) = P (T1 = +)

    ⇒ E[C] = c1 · (1 − P (T1 = +)) + (c1 + c2) · P (T1 = +)= c1 − c1 · P (T1 = +) + c1 · P (T1 = +) + c2 · P (T1 = +)= c1 + c2 · P (T1 = +)

    = 100 + 10000 · 59950000

    = 219.8 ≈ 220$Note: Here above we used

    P (T1 = +)total probability form.

    = P (T1 = + | B) · P (B) + P (T1 = + | B̄) · P (B̄)

    = 1 · 1500

    +1

    100· 499500

    =599

    50000= 0.01198

    8.

  • c.

    cnnot.= the new price for the second test (T ′2)

    cn ≤ E[C ′] = c1 · P (C = c1) + (c1 + cn) · P (C = c1 + cn)= c1 + cn · P (T1 = +) = 100 + cn ·

    599

    50000

    cn = 100 + cn · 0.01198 ⇒ cn ≈ 101.2125.

    9.

  • The “Monty’s haunted house” problem

    • Exemplifying the application of Bayes’ theorem

    • the notion of MAP (Maximum A posteriori Probability) hypotheses

    CMU, 2009 fall, Geoff Gordon, HW1, pr. 1

    10.

  • You are in a haunted house and you are stuck in front of three doors. A ghostappears and tells you: “Your hope is behind one of these doors. There isonly one door that opens to the outside and the two other doors have deadlymonsters behind them. You must choose one door!” You choose the firstdoor. The ghost tells you: “Wait! I will give you some more information.”The ghost opens the second door and shows you that there was a horriblemonster behind it, then asks you: “Would you like to change your mind andtake the third door instead?”

    What’s better: to stick with the first door, or to change to the third door?For each of the following strategies used by the ghost, determine probabilitiesthat the exit is behind the first and the third door, given that the ghostopened the second door.

    a. The ghost always opens a door you have not picked with a monster behindit. If both of the unopened doors hide monsters, he picks each of them withequal probability.

    b. The ghost has a slightly different strategy. If both of the unopened doorshide monsters, he always picks the second door.

    c. Finally, suppose that if both of the unopened doors hide monsters, theghost always picks the third door.

    11.

  • Answer

    What we know:

    P (O = 1) = P (O = 2) = P (O = 3) =1

    3(1)

    G O P (G | O)variant a variant b variant c

    2 1 1/2 1 03 1 1/2 0 12 2 0 0 03 2 1 1 12 3 1 1 13 3 0 0 0

    What we must compute:

    P (O = 1 | G = 2) Bayes F.= P (G = 2 | O = 1) · P (O = 1)P (G = 2 | O = 1) · P (O = 1) + P (G = 2 | O = 3) · P (O = 3)

    P (O = 3 | G = 2) Bayes F.= P (G = 2 | O = 3) · P (O = 3)P (G = 2 | O = 3) · P (O = 3) + P (G = 2 | O = 1) · P (O = 1)

    Notice that we should have added P (G = 2 | O = 2) · P (O = 2) at the denominator, butwe know that it is 0 since P (G = 2 | O = 2) = 0.Therefore, P (O = 3 | G = 2) = 1− P (O = 1 | G = 2).

    12.

  • Variant a:

    P (O = 1 | G = 2) =1

    2· 13

    1

    2· 13+ 1 · 1

    3

    =1

    3P (O = 3 | G = 2) = 1− 1

    3=

    2

    3

    Therefore, we should choose door 3.

    Variant b:

    P (O = 1 | G = 2) =1 · 1

    3

    1 · 13+ 1 · 1

    3

    =1

    2P (O = 3 | G = 2) = 1− 1

    2=

    1

    2

    Therefore, we can choose either door 1 or door 3.

    Variant c:P (O = 1 | G = 2) = 0 P (O = 3 | G = 2) = 1− 0 = 1

    Therefore, we should choose door 1.

    13.

  • Alternative Solution (1)(in Romanian)

    Echivalent, ı̂n cazul variantei a, pentru a determina maximul dintre P (O = 1 |G = 2) şi P (O = 3 | G = 2) ar fi fost suficient, conform formulei lui Bayes, săcomparăm

    P (G = 2 | O = 1) · P (O = 1) şi P (G = 2 | O = 3) · P (O = 3).

    Mai mult, ţinând cont de relaţia (1), aceasta revine la a compara

    P (G = 2 | O = 1) şi P (G = 2 | O = 3).

    Răspunsul poate fi citit imediat din tabelul de mai sus (vezi prima linie şipenultima linie): O = 3 este varianta pentru care se obţine probabilitatea [aposteriori] maximă.

    Altfel spus, O = 3 este ipoteza de probabilitate maximă a posteriori (engl.,maximum a posteriori probability (MAP) hypothesis).

    Absolut similar se poate proceda şi pentru variantele b şi c.

    Observaţie: În cazuri precum cel de mai sus (P (O = 1) = P (O = 2) = P (O =3) = 13), ipoteza MAP coincide cu ipoteza de verosimilitate maximă (engl.,maximul likelihood (ML) hypothesis).

    14.

  • Alternative Solution (2)(in Romanian)

    Echivalent, pentru variantele b şi c putem răspunde la ı̂ntrebare făcândurmătorul raţionament, fără a folosi formula lui Bayes:Ieşirea cea bună se poate găsi ı̂n spatele uneia dintre cele trei uşi(vezi figura alăturată). Cum fantoma a deschis deja uşa cu numărul 2,una dintre aceste situaţii (şi anume, a doua din figură) este eliminată,fiindcă ı̂n spatele ei este un monstru. În continuare, putem raţiona ı̂nfelul următor:

    M M

    MM

    M M

    1 2 3

    Varianta b: Întrucât fantoma alege uşa 2 cu probabilitate 1, putem afirma, (̂ın lipsa

    unor alte informaţii) că ambele variante – 1 şi 3 – au probabilităţi egale, şi anume1

    2.

    Într-adevăr,

    − fie uşa 1, cea aleasă de mine, dă ı̂nspre afară, iar atunci fantoma trebuie, conform principiului P2, săaleagă uşa 2;

    − fie uşa 3 dă ı̂nspre afară, iar atunci, din nou conform principiului P2, fantoma trebuie să aleagă uşa 2;

    − conform principiului P1, nu există o a treia posibilitate;

    − nu dispun de alte informaţii pentru a decide ı̂ntre cele două situaţii de mai sus.

    Varianta c: Ştiind că fantoma nu a deschis uşa 3 (care ar fi opţiunea corespunzătoareprincipiului P2), ci a ales uşa 2, ı̂nseamnă că nu a putut face altfel, deci uşa 3 reprezintăieşirea.

    15.

  • Exemplifying

    • the application of Bayes’ theorem

    • the notion of MAP (Maximum A posteriori Probability) hypotheses

    CMU, 2012 spring, Ziv Bar-Joseph, HW1, pr. 1.5

    16.

  • Mickey tosses a die multiple times, hoping for a 6.The sequence of his 10 tosses is 1, 3, 4, 2, 3, 3, 2, 5, 1, 6.Mickey is suspicious whether the die is biased towards 3 (or fair).

    Conduct a simple analysis based on the Bayes theorem to informon Mickey — to what degree is the die biased?Explain your reasoning.

    Assume in general every 100 dice contain 5 unfair dice that arebiased towards 3 with the probability distribution of the six faces(1, 2, 3, 4, 5, 6) as P = [0.1, 0.1, 0.5, 0.1, 0.1, 0.1].

    17.

  • Solution

    Definition for the notion of Maximum A posteriori Probability:

    hMAPdef.= argmax

    h∈HP (h|D) F.B.= argmax

    h∈H

    P (D|h) · P (h)P (D)

    = argmaxh∈H

    P (D|h) · P (h),

    Let us denote:

    • D = {1, 3, 4, 2, 3, 3, 2, 5, 1, 6}= {x1, x2, . . . , x10}• H = {FD,LD}, where FD is the fair dice and LD is the loaded dice.

    18.

  • P (D|FD) · P (FD) = P (x1, x2, . . . , x10|FD) · P (FD) i.i.d.= (10∏

    i=1

    P (xi|FD)) · P (FD) =(1

    6

    )10

    · 95100

    =1

    210 · 310 ·19

    20

    P (D|LD) · P (LD) = P (x1, x2, . . . , x10|LD) · P (LD) i.i.d.= (10∏

    i=1

    P (xi|LD)) · P (LD)

    =

    (1

    10· 12· 110

    · 110

    · 12· 12· 110

    · 110

    · 110

    · 110

    ·)

    · 5100

    =1

    107 · 23 ·1

    20=

    1

    210 · 57 ·1

    20.

    In order to compare P (D|FD) ·P (FD) and P (D|LD) ·P (LD), it is easier to firstly apply theln:

    lnP (D|FD) · P (FD) > lnP (D|LD) · P (LD) ⇔

    ln19

    310> ln

    1

    57⇔ ln 19− 10 ln 3 > 7 ln 5 ⇔ −8.0417 > −11.2661

    Note: We could have directly computed the so-called log-odds ratio:

    lnP (LD|D)P (FD|D) = ln

    P (D|FD) · P (FD)P (D|LD) · P (LD) = . . . = −3.2244 < 0 so we have to choose FD .

    19.

  • Exemplifying

    ML hypotheses and MAP hypotheses

    using decision trees

    CMU, 2009 spring, Tom Mitchell, midterm, pr. 2.3-4

    20.

  • 2 3 43.52.51.51 X

    Let’s consider the 1-dimensional data set shown above, based on the singlereal-valued attribute X. Notice there are two classes (values of Y ), and fivedata points.

    Consider a special type of decision trees where leaves have probabilistic la-bels. Each leaf node gives the probability of each possible label, where theprobability is the fraction of points at that leaf node with that label.

    For example, a decision tree learned from the data set above with zero splitswould say P (Y = 1) = 3/5 and P (Y = 0) = 2/5. A decision tree with one split(at X = 2.5) would say P (Y = 1) = 1 if X < 2.5, and P (Y = 1) = 1/3 if X ≥ 2.5.

    a. For the above data set, draw a tree that maxi-mizes the likelihood of the data.

    TML = argmaxT PT (D), where

    PT (D)def.= P (D|T ) i.i.d.= ∏5i=1 P (Y = yi|X = xi, T ),

    where yi is the label/classs of the instance xi

    (x1 = 1.5, x2 = 2, x3 = 3, x4 = 3.5, x5 = 3.75.)

    Solution:

    P(Y=1)=1

    P(Y=1)=1

    X>2.5

    X>3.625

    Nu Da

    Nu Da

    P(Y=1)=0

    21.

  • b. Consider a prior probability distribution P (T ) over trees that penalizes the numberof splits in the tree.

    P (T ) ∝(1

    4

    )splits(T )2

    where T is a tree, splits(T ) is the number of splits in T , and ∝ means “is proportionalto”.

    For the same data set, give the MAP tree when using this prior, P (T ), over trees.

    Solution:

    0 nodes:

    P (T0 | D) ∝(3

    5

    )3

    ·(2

    5

    )2

    ·(1

    4

    )0

    =33 · 2255

    =108

    3125= 0.0336 P(Y=1)=3/5

    1 node:

    P (T1 | D) ∝ 12 ·(2

    3

    )2

    · 13·(1

    4

    )1

    =1

    27= 0.037

    P(Y=1)=1

    X>2.5Nu Da

    P(Y=1)=1/32 nodes:

    P (T2) ∝(1

    4

    )4

    ⇒ P (T2 | D) ∝ 1 ·(1

    4

    )4

    =1

    256= 0.0039 ⇒ the MAP tree is T1.

    22.

  • The Naive Bayes and Joint BayesAlgorithms

    23.

  • Exemplyifying the application of

    Naive Bayes and Joint Bayes algorithms;

    the minimum number of parameters

    to be estimated for NB and respectively JB

    CMU, 2008 fall, Eric Xing, HW1 pr. 2

    24.

  • Consider a classification problem, the ta-ble of observations for which is givennearby. X1 and X2 are two binary randomvariables which are the observed vari-ables. Y is the class label which is ob-served for the training data given below.We will use the Naive Bayes classifier andthe Joint Bayes classifier to classify a newinstance after training on the data below,and compare the results.

    X1 X2 Y Counts0 0 0 20 0 1 181 0 0 41 0 1 10 1 0 40 1 1 11 1 0 21 1 1 18

    a. Construct the Naive Bayes classifier given the data above. Useit to classify the instance X1 = 0, X2 = 0.

    b. Construct the Joint Bayes classifier given the data above. Useit to classify the instance X1 = 0, X2 = 0.

    25.

  • c. Compute the probabilities P (Y = 1|X1 = 0, X2 = 0) for the NaiveBayes classifier (let’s denote it PNB(Y = 1|X1 = 0, X2 = 0)) and forthe Joint Bayes classifier (PJB(Y = 1|X1 = 0, X2 = 0)).Why is PNB(Y = 1|X1 = 0, X2 = 0) different from PJB(Y = 1|X1 =0, X2 = 0)? (Hint : Compute P (X1, X2|Y ).)

    d. What happens to the difference be-tween PNB(Y = 1|X1 = 0, X2 = 0) andPJB(Y = 1|X1 = 0, X2 = 0) if the table en-tries are changed to the nearby table?

    (Hint : Will the Naive Bayes assumptionbe more violated or less violated comparedto the previous situation?)

    X1 X2 Y Counts0 0 0 30 0 1 91 0 0 31 0 1 90 1 0 30 1 1 91 1 0 31 1 1 9

    26.

  • e. Compare the number of independent parameters in the twoclassifiers. Instead of just two observed data variables, if therewere n random binary observed variables {X1, . . . , Xn}, what wouldbe the number of parameters required for both classifiers?

    From this, what can you comment about the rate of growth of thenumber of parameters for both models as n → ∞?

    27.

  • Answer

    a.

    ŷNB(X1 = 0, X2 = 0)def.= argmax

    y∈{0,1}

    P (X1 = 0|Y = y) · P (X2 = 0|Y = y) · P (Y = y)

    p0not.= P (X1 = 0|Y = 0) · P (X2 = 0|Y = 0) · P (Y = 0)

    MLE=

    6

    12· 612

    · 1250

    =3

    50=

    6

    100

    p1not.= P (X1 = 0|Y = 1) · P (X2 = 0|Y = 1) · P (Y = 1)

    MLE=

    19

    38· 1938

    · 3850

    =19

    100

    p0 < p1 ⇒ ŷNB(X1 = 0, X2 = 0) = 1.

    28.

  • b.

    ŷJB(X1 = 0, X2 = 0)def.= argmax

    y∈{0,1}

    P (X1 = 0, X2 = 0|Y = y) · P (Y = y)

    p′0not.= P (X1 = 0, X2 = 0 | Y = 0) · P (Y = 0) MLE=

    2

    12· 1250

    =2

    50

    p′1not.= P (X1 = 0, X2 = 0 | Y = 1) · P (Y = 1) MLE=

    18

    38· 3850

    =18

    50

    p′0 < p′1 ⇒ ŷJB(X1 = 0, X2 = 0) = 1.

    29.

  • c.

    PNBnot.= P (Y = 1 | X1 = 0, X2 = 0)

    F. Bayes=

    P (X1 = 0, X2 = 0 | Y = 1) · P (Y = 1)P (X1 = 0, X2 = 0 | Y = 1)P (Y = 1) + P (X1 = 0, X2 = 0 | Y = 0)P (Y = 0)

    indep. cdt.=

    P (X1 = 0|Y = 1) · P (X2 = 0|Y = 1) · P (Y = 1)P (X1 = 0|Y = 0) · P (X2 = 0|Y = 0) · P (Y = 0)+

    P (X1 = 0|Y = 1) · P (X2 = 0|Y = 1) · P (Y = 1)

    =p1

    p0 + p1=

    19

    1006

    100+

    19

    100

    =19

    25

    PJBnot.= P (Y = 1 | X1 = 0, X2 = 0)

    F. Bayes=

    P (X1 = 0, X2 = 0 | Y = 1) · P (Y = 1)P (X1 = 0, X2 = 0 | Y = 1)P (Y = 1) + P (X1 = 0, X2 = 0 | Y = 0)P (Y = 0)

    =p′1

    p′0 + p′1

    =

    18

    502

    50+

    18

    50

    =18

    20

    30.

  • PNB 6= PJB because [in this case] the conditional independence assumptiondoesn’t hold.Indeed,

    P (X1 = 0, X2 = 0 | Y = 0) MLE=2

    12

    P (X1 = 0 | Y = 0) · P (X2 = 0 | Y = 0) MLE=6

    12· 612

    =1

    4

    ⇒ P (X1 = 0, X2 = 0 | Y = 0) 6= P (X1 = 0 | Y = 0) · P (X2 = 0 | Y = 0) ⇒⇒ P (X1, X2 | Y ) 6= P (X1 | Y ) · P (X2 | Y )

    Therefore X1 and X2 are not conditionally independent w.r.t. Y .

    31.

  • d.

    PNB = P (Y = 1 | X1 = 0, X2 = 0)

    =P (X1 = 0, X2 = 0 | Y = 1) · P (Y = 1)

    P (X1 = 0, X2 = 0 | Y = 1)P (Y = 1) + P (X1 = 0, X2 = 0 | Y = 0)P (Y = 0)

    =P (X1 = 0|Y = 1) · P (X2 = 0|Y = 1) · P (Y = 1)

    P (X1 = 0|Y = 0) · P (X2 = 0|Y = 0) · P (Y = 0)+P (X1 = 0|Y = 1) · P (X2 = 0|Y = 1) · P (Y = 1)

    =

    18

    36· 1836

    · 3648

    6

    12· 612

    · 1248

    +18

    36· 1836

    · 3648

    =

    9

    483

    48+

    9

    48

    =9

    12=

    3

    4

    PJB = P (Y = 1 | X1 = 0, X2 = 0)

    =P (X1 = 0, X2 = 0 | Y = 1) · P (Y = 1)

    P (X1 = 0, X2 = 0 | Y = 1)P (Y = 1) + P (X1 = 0, X2 = 0 | Y = 0)P (Y = 0)

    =

    9

    36· 3648

    3

    12· 1248

    +9

    36· 3648

    =

    9

    483

    48+

    9

    48

    =9

    12=

    3

    4. Therefore, in this case PNB = PJB.

    32.

  • In fact, it can be easily shown that for the newly given distribution the con-ditional independence assumption (for X1 and X2 w.r.t. Y ) holds. Therefore,the predictions made by Naive Bayes and Joint Bayes will coincide.

    33.

  • e. For our dataset, Naive Bayes needs to compute the following probabilities:

    P (Y = 0) ⇒ P (Y = 1) = 1− P (Y = 0)P (X1 = 0 | Y = 0) ⇒ P (X1 = 1 | Y = 0) = 1− P (X1 = 0 | Y = 0)P (X1 = 0 | Y = 1) ⇒ P (X1 = 1 | Y = 1) = 1− P (X1 = 0 | Y = 1)P (X2 = 0 | Y = 0) ⇒ P (X2 = 1 | Y = 0) = 1− P (X2 = 0 | Y = 0)P (X2 = 0 | Y = 1) ⇒ P (X2 = 1 | Y = 1) = 1− P (X2 = 0 | Y = 1)

    Therefore, we will need only 5 values in order to fully construct the NaiveBayes classifier.

    In the general case, when n input attributes / variables are given, we need toestimate P (Y ), P (Xi | Y ) and P (Xi | ¬Y ) for i = 1, n, therefore 2n + 1 values /parameters.

    34.

  • For the Joint Bayes, we need to estimate

    P (Y = 0) ⇒ P (Y = 1) = 1− P (Y = 0)P (X1 = 0, X2 = 0 | Y = 0)P (X1 = 0, X2 = 1 | Y = 0)P (X1 = 1, X2 = 0 | Y = 0)

    ⇒ P (X1 = 1, X2 = 1 | Y = 0) =1− (P (X1 = 0, X2 = 0 | Y = 0)+P (X1 = 0, X2 = 1 | Y = 0)+P (X1 = 1, X2 = 0 | Y = 0)).

    P (X1 = 0, X2 = 0 | Y = 1)P (X1 = 0, X2 = 1 | Y = 1)P (X1 = 1, X2 = 0 | Y = 1)

    ⇒ P (X1 = 1, X2 = 1 | Y = 1) =1− (P (X1 = 0, X2 = 0 | Y = 1)+P (X1 = 0, X2 = 1 | Y = 1)+P (X1 = 1, X2 = 0 | Y = 1)).

    For the general case, when n imput variables are given, we will need to estimate P (Y ),P (X̃1, · · · , X̃n | Y ) and P (X̃1, · · · , X̃n | ¬Y ), where

    X̃i ∈ {Xi,¬Xi}∀i ∈ 1, n

    and(X̃1, · · · , X̃n) 6= (¬X1, · · · ,¬Xn).

    Therefore, 2(2n − 1) + 1 = 2n+1 − 1 values / parameters.It can be seen that Naive Bayes uses a linear number of parameters (w.r.t. n, the numberof input attributes), while Joint Bayes uses an exponential number of parameters (w.r.t.the same n).

    35.

  • Exemplifying

    spam filtering using the Naive Bayes algorithm

    CMU, 2009 spring, Ziv Bar-Joseph, midterm, pr. 2

    36.

  • About 2/3 of your email is spam, so you downloaded an opensource spam filter based on word occurrences that uses the NaiveBayes classifier.Assume you collected the following regular and spam mails totrain the classifier, and only three words are informative for thisclassification, i.e., each email is represented as a 3-dimensionalbinary vector whose components indicate whether the respectiveword is contained in the email.

    ‘study’ ‘free’ ‘money’ Category count

    0 0 1 Regular 11 0 0 Regular 21 1 0 Regular 10 1 0 Spam 40 1 1 Spam 4

    37.

  • a. You find that the spam filter uses a prior P (spam) = 0.1. Explain(in one sentence) why this might be sensible.

    b. Compute the Naive Bayes parameters, using Maximum Likeli-hood Estimation (MLE) and applying Laplace’s rule (“add-one”).

    c. Based on the prior and conditional probabilities above, give themodel probability P (spam| s) that the sentence

    s = “money for psychology study”

    is spam.

    38.

  • Answer:

    a. It is worse for regular emails to be classified as spam than it is for spam email to beclassified as regular email.

    b. When estimating the Naive Bayes parameters from training data only using theMLE (maximum likelihood estimation) method we would have:

    P (study|spam)= 08= 0

    P (study|regular)= 34

    P (free|spam)= 88= 1

    P (free|regular)= 14

    P (money|spam)= 48=

    1

    2

    P (money|regular)= 14

    By applying Laplace’s rule (“add-one”) we get:

    P (study|spam)= 0 + 18 + 2

    =1

    10

    P (free|spam)= 8 + 18 + 2

    =9

    10

    P (money|spam)= 4 + 18 + 2

    =1

    2

    P (study|regular)= 3 + 14 + 2

    =2

    3

    P (free|regular)= 1 + 14 + 2

    =1

    3

    P (money|regular)= 1 + 14 + 2

    =1

    3Notice that the occurrence of 2′s at denominators corresponds to the number of valuesfor each of the attributes used to describe the training instances.

    39.

  • c. Classification of the message

    s = “money for psychology study”,

    using the a priori probability P (spam) = 0.1:

    P (spam | s) = P (spam | study,¬free,money)F. Bayes

    =P (study, ¬free, money | spam) · P (spam)

    P (study,¬free,money | spam)P (spam)+ P (study,¬free,money | reg)P (reg)

    P (study, ¬free, money|spam) indep. cdt.= P (study|spam)·P (¬free|spam)·P (money|spam)

    =1

    10· 110

    · 12=

    1

    200

    P (study, ¬free, money|reg) indep. cdt.= P (study|reg)·P (¬free|reg)·P (money|reg)

    =2

    3· 23· 13=

    4

    27

    Therefore,

    P (spam| s) =1

    200· 110

    1

    200· 110

    +4

    27· 910

    ≈ 0.0037

    Notice that this is a small probability.However, without using Laplace’s rule,it would be 0, due to the fact that theword spam did not appear in any of thespam emails in the training data.

    40.

  • Naive Bayes and Joint Bayes:

    application when a joint probabilistic distribution

    (on input + output variables) is provided

    CMU, 2010 spring, T. Mitchell, E. Xing, A. Singh, midterm pr. 2.1

    41.

  • Consider the joint probability distributionover 3 boolean random variables x1, x2, andy given in the nearby figure.

    x1 x2 y P (x1, x2, y)0 0 0 0.150 0 1 0.250 1 0 0.050 1 1 0.081 0 0 0.101 0 1 0.021 1 0 0.201 1 1 0.15

    a. Express P (y = 0 | x1, x2) in terms of P (x1, x2, y = 0) andP (x1, x2, y = 1).

    42.

  • b. Complute the marginal probabilities which will be used by aNaive Bayes classifier. Fill in the following tables.

    y P (y)y = 0y = 1

    P (x1 | y) x1 = 0 x1 = 1y = 0y = 1

    P (x2 | y) x2 = 0 x2 = 1y = 0y = 1

    c. Write out an expression for the value of P (y = 1|x1 = 1, x2 = 0)predicted by the Naive Bayes classifier.

    d. Write out an expression for the value of P (y = 1|x1 = 1, x2 = 0)predicted by the Joint Bayes classifier.

    e. The expressions you wrote down for parts (c) and (d) shouldbe unequal. Explain why.

    43.

  • Answer

    a. Using the definition of conditional probability and then the total probability formula,we get:

    P (y = 0 | x1, x2) =P (x1, x2, y = 0)

    P (x1, x2)=

    P (x1, x2, y = 0)

    P (x1, x2, y = 0) + P (x1, x2, y = 1)

    b.

    y P (y)y = 0 0.5y = 1 0.5

    P (x1 | y) x1 = 0 x1 = 1y = 0 0.40 0.60y = 1 0.66 0.34

    P (x2 | y) x2 = 0 x2 = 1y = 0 0.50 0.50y = 1 0.54 0.46

    Explanations:

    i. P (y) was computed as a marginal probability, stating from the joint probability,P (x1, x2, y):

    P (y = 0) = P (x1 = 0, x2 = 0, y = 0) + P (x1 = 0, x2 = 1, y = 0) +

    +P (x1 = 1, x2 = 0, y = 0) + P (x1 = 1, x2 = 1, y = 0)

    = 0.15 + 0.05 + 0.1 + 0.2 = 0.5

    P (y = 1) = 1− P (y = 0) = 0.5

    44.

  • ii. P (x1 | y) was computed using again the definition of conditional probability and thenthe total probability formula:

    P (x1 = 0 | y = 0) =P (x1 = 0, y = 0)

    P (y = 0)=

    P (x1 = 0, y = 0)

    P (x1 = 0, y = 0) + P (x1 = 1, y = 0)

    P (x1 = 0, y = 0) = P (x1 = 0, x2 = 0, y = 0) + P (x1 = 0, x2 = 1, y = 0) = 0.15 + 0.05 = 0.2

    P (x1 = 1, y = 0) = P (x1 = 1, x2 = 0, y = 0) + P (x1 = 1, x2 = 1, y = 0) = 0.1 + 0.2 = 0.3

    Therefore

    P (x1 = 0 | y = 0) =0.2

    0.2 + 0.3= 0.4,

    andP (x1 = 1 | y = 0) = 1− P (x1 = 0 | y = 0) = 0.6

    Similarly,

    P (x1 = 0 | y = 1) =P (x1 = 0, y = 1)

    P (y = 1)=

    0.25 + 0.08

    0.5= 0.66 ⇒

    P (x1 = 1 | y = 1) = 1− P (x1 = 0 | y = 1) = 0.34

    P (x2 | y) was calculated in a similar way.

    45.

  • c. The Naive Bayes classifier uses the conditional independence assumption. Therefore:

    P (y = 1 | x1 = 1, x2 = 0) Bayes F.=P (x1 = 1, x2 = 0 | y = 1) · P (y = 1)

    P (x1 = 1, x2 = 0)

    =P (x1 = 1, x2 = 0 | y = 1) · P (y = 1)

    P (x1 = 1, x2 = 0 | y = 1)P (y = 1) + P (x1 = 1, x2 = 0 | y = 0)P (y = 0)cdt. indep.

    =

    =P (x1 = 1 | y = 1) · P (x2 = 0 | y = 1) · P (y = 1)

    P (x1 = 1|y = 1)P (x2 = 0|y = 1)P (y = 1) + P (x1 = 1|y = 0)P (x2 = 0|y = 0)P (y = 0)

    =0.34 · 0.54 · 0.5

    0.34 · 0.54 · 0.5 + 0.6 · 0.5 · 0.5 = 0.38

    d. The Joint Bayes classifier doesn’t use the conditional independence assumption.Therefore:

    P (y = 1 | x1 = 1, x2 = 0) =P (x1 = 1, x2 = 0, y = 1)

    P (x1 = 1, x2 = 0, y = 1) + P (x1 = 1, x2 = 0, y = 0)

    =0.02

    0.02 + 0.1= 0.16

    46.

  • e. The values calculated by the Naiv Bayes and respectively the Joint Bayesclassifiers for P (y = 1 | x1 = 1, x2 = 0) are different because the conditionalindependence assumption dos not hold.

    Indeed,

    P (x1 = 0, x2 = 0 | y = 0) =P (x1 = 0, x2 = 0, y = 0)

    P (y = 0)=

    0.15

    0.5= 0.3

    whileP (x1 = 0 | y = 0) · P (x2 = 0 | y = 0) = 0.4 · 0.5 = 0.2 6= 0.3

    Therefore, x1 are x2 are not conditionally independent w.r.t. y.

    47.

  • Exemplifying

    The computation of the error rate for the Naive Bayes algorithm

    CMU, 2010 fall, Aarti Singh, HW1, pr. 4.2

    48.

  • Consider a simple learning problem of determining whether Alice and Bobfrom CA will go to hiking or not Y : Hike ∈ {T, F} given the weather conditionsX1 : Sunny ∈ {T, F} and X2 : Windy ∈ {T, F} by a Naive Bayes classifier.Using training data, we estimated the parameters

    P (Hike) = 0.5P (Sunny | Hike) = 0.8, P (Sunny | ¬Hike) = 0.7P (Windy | Hike) = 0.4, P (Windy | ¬Hike) = 0.5

    Assume that the true distribution of X1, X2, and Y satisfies the Naive Bayesassumption of conditional independence with the above parameters.

    a. What is the joint probability that Alice and Bob go to hiking and the weather issunny and windy, that is P (Sunny,Windy,Hike)?

    Solution:

    P (Sunny,Windy,Hike)cdt. indep.

    = P (Sunny|Hike) · P (Windy|Hike) · P (Hike) = 0.8 · 0.4 · 0.5 = 0.16.

    49.

  • b. What is the expected error rate of the Naive Bayes classifier?(Informally, the expected error rate is the probability that an “observation”/instancerandomly generated according to the true probabilistic distribution of data is incorrectlyclassified by the Naive Bayes algorithm.)

    Solution:

    P (X1, X2, Y ) =X1 X2 Y P (X1|Y ) · P (X2|Y ) · P (Y ) YNB(X1, X2) PNB(Y |X1, X2)F F F 0.3 · 0.5 · 0.5 = 0.075 F 0.555556F F T 0.2 · 0.6 · 0.5 = 0.060 F 0.444444F T F 0.3 · 0.5 · 0.5 = 0.075 F 0.652174F T T 0.2 · 0.4 · 0.5 = 0.040 F 0.347826T F F 0.7 · 0.5 · 0.5 = 0.175 T 0.421686T F T 0.8 · 0.6 · 0.5 = 0.240 T 0.578314T T F 0.7 · 0.5 · 0.5 = 0.175 F 0.522388T T T 0.8 · 0.4 · 0.5 = 0.160 F 0.477612

    Note:Joint probabilities

    corresponding to

    incorrect predic-

    tions are shown in

    bold.

    errordef.= EP [1{YNB(X1,X2) 6=Y }]

    =∑

    X1,X2,Y

    1{YNB(X1,X2) 6=Y } · P (X1, X2, Y )

    = 0.060+ 0.040+ 0.175+ 0.160 = 0.435

    Note:1{ · } is the indicator function;its value is 1 whenever the as-sociated condition (in our case,YNB(X1, X2) 6= Y ) is true, and 0otherwise.

    50.

  • Next, suppose that we gather more information about weather conditions andintroduce a new feature denoting whether the weather is X3 : Rainy or not.Assume that each day the weather in CA can be either Rainy or Sunny. Thatis, it can not be both Sunny and Rainy. (Similarly, it can not be ¬ Sunny and¬Rainy).

    c. In the above new case, are any of the Naive Bayes assumptions violated? Why (not)?What is the joint probability that Alice and Bob go to hiking and the weather is sunny,windy and not rainy, that is P (Sunny,Windy,¬Rainy,Hike)?Solution:

    The conditional independence of variables given the class label assumption of NaiveBayes is violated. Indeed, knowing if the weather is Sunny completely determineswhether it is Rainy or not. Therefore, Sunny and Rainy are clearly NOT condition-ally independent given Hike.

    P (Sunny,Windy,¬Rainy,Hike)= P (¬Rainy|Hike, Sunny,Windy)

    ︸ ︷︷ ︸1

    ·P (Sunny,Windy|Hike) · P (Hike)

    cond. indep.= P (Sunny|Hike) · P (Windy|Hike) · P (Hike)= 0.8 · 0.4 · 0.5 = 0.16.

    51.

  • d. What is the expected error rate when the Naive Bayes classifier uses all three at-tributes? Does the performance of Naive Bayes improve by observing the new attributeRainy? Explain why.

    Solution:

    PNB(X1, X2, X3, Y ) = P (X3|Y )·X1 X2 X3 Y P (X1, X2, Y ) P (X1|Y ) · P (X2|Y ) · P (Y ) YNB(X1, X2, X3) PNB(Y |X1, X2, X3)

    F F F F 0 0.075 · 0.7 = 0.0525 F 0.522388F F F T 0 0.060 · 0.8 = 0.0480 F 0.477612

    F F T F 0.075 0.075 · 0.3 = 0.0225 F 0.652174F F T T 0.060 0.060 · 0.2 = 0.0120 F 0.347826

    F T F F 0 0.075 · 0.7 = 0.0525 F 0.621302F T F T 0 0.040 · 0.8 = 0.0320 F 0.378698

    F T T F 0.075 0.075 · 0.3 = 0.0225 F 0.737705F T T T 0.040 0.040 · 0.2 = 0.0080 F 0.262295

    T F F F 0.175 0.175 · 0.7 = 0.0525 T 0.389507T F F T 0.240 0.240 · 0.8 = 0.1920 T 0.610493

    T F T F 0 0.175 · 0.3 = 0.0525 F 0.522388T F T T 0 0.240 · 0.2 = 0.0480 F 0.477612

    T T F F 0.175 0.175 · 0.7 = 0.1225 T 0.489022T T F T 0.160 0.160 · 0.8 = 0.1280 T 0.510978

    T T T F 0 0.175 · 0.3 = 0.0225 F 0.621302T T T T 0 0.060 · 0.2 = 0.0120 F 0.378698

    52.

  • The new error rate is:EP [1{YNB(X1,X2,X3) 6=Y }] = 0.060 + 0.040 + 0.175 + 0.175 = 0.45 > 0.435 (see questionb).

    Important Remark :

    Please notice that we always compute the error rate with respect to P , thetrue distribution, and not PNB, which is the distribution computed by NaiveBayes by using the conditional independence assumption.

    Here above, the Naive Bayes classifier performance drops because the condi-tional independence assumptions do not hold for the correlated features.

    53.

  • How bad/naive is Naive Bayes?

    CMU, 2010 spring, E. Xing, T. Mitchell, A. Singh, midterm, pr. 2.1

    54.

  • Clearly Naive Bayes makes what, in many cases, are overly strongassumptions. But even if those assumptions aren’t true, is it pos-sible that Naive Bayes is still pretty good? Here we will use asimple example to explore the limitations of Naive Bayes.

    Let X1 and X2 be i.i.d. Bernoulli(0.5) random variables, and letY ∈ {1, 2} be some deterministic function of the values of X1 andX2.

    a. Find a function Y for which the Naive Bayes classifier has a 50%error rate. Given the value of Y , how are X1 and X2 correlated?

    b. Show that for every function Y , the Naive Bayes classifier willperform no worse than the one above. Hint : there are many Yfunctions, but because of symmetries in the problem you only needto analyze a few of them.

    55.

  • Răspuns

    a. Considerăm Y definit conform tabelului alăturat.

    X1 X2 Y0 0 10 1 21 0 21 1 1

    Observaţie: Dacă se consideră valoarea lui Y fixată (fie 1, fie 2), atunci putemsă stabilim o regulă astfel ı̂ncât dacă ı̂l cunoaştem pe X1 să-l determinăm peX2 (şi invers).

    a Altfel spus, X1 este unic determinat de X2 (şi invers) datăfiind o valoare fixată a lui Y . Deci condiţia de independenţă condiţionalăeste ı̂ncălcată. Mai mult, ı̂n acest caz avem maximul posibil de ,,dependenţă“ı̂ntre cele două variabile (̂ın raport cu Y ).

    Pe slide-ul următor vom calcula rata erorii ı̂nregistrate de algoritmul BayesNaiv pe datele din tabelul de mai sus.

    aPentru Y = 1, regula este: X2 are aceeaşi valoare ca şi X1. Pentru Y = 2, regula este: X1 şi X2 au valoricomplementare.

    56.

  • Bayes Naiv estimează valoarea lui Y astfel:

    ŷ = argmaxy∈{1,2}

    P (X1 | Y = y)P (X2 | Y = y)P (Y = y)

    Pentru X1 = 0, X2 = 0, algoritmul compară următoarele două valori:

    p1 = P (X1 = 0 | Y = 1)P (X2 = 0 | Y = 1)P (Y = 1) =1

    2· 12· 12=

    1

    8

    p2 = P (X1 = 0 | Y = 2)P (X2 = 0 | Y = 2)P (Y = 2) =1

    2· 12· 12=

    1

    8

    Cum p1 = p2, algoritmul va alege una dintre ele cu o probabilitate de 0.5.Deoarece valoarea lui Y este 1 din tabel, ı̂nseamnă că algoritmul va alegegreşit ı̂n 50% din cazuri.

    Pentru celelalte 3 cazuri, (X1 = 0, X2 = 1), (X1 = 1, X2 = 0) şi (X1 = 1, X2 = 1),se observă uşor că se obţin de asemenea valori egale, iar algoritmul va alegepentru Y una dintre valorile 1 sau 2 cu o probabilitate de 0.5.

    Deci pentru această definiţie a lui Y rata erorii este de 50%.

    57.

  • b. Vom calcula rata erorii pentru fiecare dintre cele 3 moduri de definire a lui Y carenu a fost studiat.

    Cazul 1 : X1 X2 Y0 0 10 1 11 0 11 1 1

    Este similar cu cazul: Y2222

    • Pentru X1 = 0, X2 = 0, algoritmul compară:

    p1 = P (X1 = 0 | Y = 1)P (X2 = 0 | Y = 1)P (Y = 1) =2

    4· 24· 1 = 1

    4p2 = P (X1 = 0 | Y = 2)P (X2 = 0 | Y = 2)P (Y = 2) = 0 · 0 · 0 = 0

    Cum p1 > p2 algoritmul alege pentru Y valoarea 1, ceea ce este corect.

    • Pentru celelalte 3 cazuri, (X1 = 0, X2 = 1), (X1 = 1, X2 = 0) şi (X1 = 1, X2 = 1), seobservă că se obţin aceleaşi valori pentru p1 şi p2 ca mai sus, deci algoritmul alege(̂ın mod corect) pentru Y valoarea 1.

    Aşadar, am obţinut că rata erorii este ı̂n acest caz 0.

    58.

  • Cazul 2 : X1 X2 Y0 0 10 1 11 0 11 1 2

    Cazurisimilare:

    Y1121

    Y1211

    Y2111

    Y2221

    Y2212

    Y2122

    Y1222

    • Pentru X1 = 0, X2 = 0:

    p1 = P (X1 = 0 | Y = 1)P (X2 = 0 | Y = 1)P (Y = 1) =2

    3· 23· 34=

    1

    3

    p2 = P (X1 = 0 | Y = 2)P (X2 = 0 | Y = 2)P (Y = 2) = 0 · 0 ·1

    4= 0

    ⇒ ŷ = 1

    • Pentru X1 = 0, X2 = 1:

    p1 = P (X1 = 0 | Y = 1)P (X2 = 1 | Y = 1)P (Y = 1) =2

    3· 13· 34=

    1

    6

    p2 = P (X1 = 0 | Y = 2)P (X2 = 1 | Y = 2)P (Y = 2) = 0 · 1 ·1

    4= 0

    ⇒ ŷ = 1

    59.

  • • Pentru X1 = 1, X2 = 0:

    p1 = P (X1 = 1 | Y = 1)P (X2 = 0 | Y = 1)P (Y = 1) =1

    3· 23· 34=

    1

    6

    p2 = P (X1 = 1 | Y = 2)P (X2 = 0 | Y = 2)P (Y = 2) = 1 · 0 ·1

    4= 0

    ⇒ ŷ = 1

    • Pentru X1 = 1, X2 = 1:

    p1 = P (X1 = 1 | Y = 1)P (X2 = 1 | Y = 1)P (Y = 1) =1

    3· 13· 34=

    1

    12

    p2 = P (X1 = 1 | Y = 2)P (X2 = 1 | Y = 2)P (Y = 2) = 1 · 1 ·1

    4=

    1

    4

    ⇒ ŷ = 2

    Deci rata erorii este 0 pentru acestă definiţie a lui Y .

    60.

  • Cazul 3 : X1 X2 Y0 0 10 1 21 0 11 1 2

    Cazuri similare: Y2121

    Y1122

    Y2211

    • Pentru X1 = 0, X2 = 0:

    p1 =1

    2· 1 · 1

    2=

    1

    4, p2 =

    1

    2· 0 · 1

    2= 0 ⇒ p1 > p2 ⇒ ŷ = 1 (corect)

    • Pentru X1 = 0, X2 = 1:

    p1 =1

    2· 0 · 1

    2= 0, p2 =

    1

    2· 1 · 1

    2=

    1

    4⇒ p1 < p2 ⇒ ŷ = 2 (corect)

    • Pentru X1 = 1, X2 = 0:

    p1 =1

    2· 1 · 1

    2=

    1

    4, p2 =

    1

    2· 0 · 1

    2= 0 ⇒ p1 > p2 ⇒ ŷ = 1 (corect)

    • Pentru X1 = 1, X2 = 1:

    p1 =1

    2· 0 · 1

    2= 0, p2 =

    1

    2· 1 · 1

    2=

    1

    4⇒ p1 < p2 ⇒ ŷ = 2 (corect)

    Prin urmare, rata erorii este 0 şi ı̂n acest caz.

    61.

  • Cazul 4 : (cel de la punctul a) X1 X2 Y0 0 10 1 21 0 21 1 1

    Este similar cu cazul: Y2112

    Concluzie: Doar pentru 2 moduri (cazul 4) de definire a lui Y rata erorii estede 50%; pentru celelalte 14 moduri (cazurile 1, 2, 3) rata erorii este 0.

    62.

  • Unlike Naive Bayes,the Joint Bayes classifier has 0 training error rate for

    all boolean functions(and even all mathematically defined functions on categorical attributes);

    “in-between Naive and Joint” Bayesian classifiers

    CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph, HW3, pr. 1.2

    63.

  • Suppose we have a function Y = (A∧B)∨¬(B ∨C), where A,B,C are indepen-dent binary random variables, each having a 50% chance of being 0.

    a. How many parameters a Naive Bayes classifier needs to estimate (withoutcounting P (¬x) as a parameter if P (x) is already counted as an estimatedparameter)?What will be the error rate of the Naive Bayes classifier (assuming infinitetraining data)?

    b. How many parameters the Joint Bayes classifier needs to estimate? Whatwill be the error rate of the Joint Bayes classifier (assuming infinite trainingdata)?

    64.

  • Answer

    yNB = arg maxy∈V al(Y )

    P (Y = y|A = a,B = b, C = c)

    = arg maxy∈V al(Y )

    P (A = a,B = b, C = c|Y = y) · P (Y = y)P (A = a,B = b, C = c

    = arg maxy∈V al(Y )

    P (A = a,B = b, C = c|Y = y) · P (Y = y)

    = arg maxy∈V al(Y )

    P (A = a|Y = y) · P (B = b|Y = y) · P (C = c|Y = y) · P (Y = y)

    Naive Bayes will need to estimate:

    P (y = 0) −→ P (y = 1) = 1− P (Y = 0)P (A = 0|Y = 0) −→ P (A = 1|Y = 0) = 1− P (A = 0|Y = 0)P (A = 0|Y = 1) −→ P (A = 1|Y = 1) = 1− P (A = 0|Y = 1)P (B = 0|Y = 0) −→ P (B = 1|Y = 0) = 1− P (B = 0|Y = 0)P (B = 0|Y = 1) −→ P (B = 1|Y = 1) = 1− P (B = 0|Y = 1)P (C = 0|Y = 0) −→ P (C = 1|Y = 0) = 1− P (C = 0|Y = 0)P (C = 0|Y = 1) −→ P (C = 1|Y = 1) = 1− P (C = 0|Y = 1)

    Altogether it is 7 parameters.In general, for n input binary variables it would have been 2n+1 parameters.

    65.

  • To compute the error rate we can con-struct a Boolean table of the func-tion and use it to estimate probabilities(since we assume infinite training data).

    A B C A ∧B B ∨C ¬(B ∨ C) Y0 0 0 0 0 1 10 0 1 0 1 0 00 1 0 0 1 0 00 1 1 0 1 0 01 0 0 0 0 1 11 0 1 0 1 0 01 1 0 1 1 0 11 1 1 1 1 0 1

    The estimated parameters are:

    P (Y = 0) =1

    2−→ P (Y = 1) = 1− P (Y = 0) = 1− 1

    2=

    1

    2

    P (A|Y ) A = 0 A = 1

    Y = 03

    4

    1

    4

    Y = 11

    4

    3

    4

    P (B|Y ) B = 0 B = 1

    Y = 02

    4

    2

    4

    Y = 12

    4

    2

    4

    P (C|Y ) C = 0 C = 1

    Y = 01

    4

    3

    4

    Y = 13

    4

    1

    4

    66.

  • The predictions of the Naive Bayes classifier are then as follows (assumingthat in case of a tie it always predicts 1):

    A B C PNB(A,B,C, Y = 0) PNB(A,B,C, Y = 1) YNB err.

    0 0 0 3/4 · 1/2 · 1/4 · 1/2 1/4 · 1/2 · 3/4 · 1/2 1 no0 0 1 3/4 · . . . · 3/4 · . . . 1/4 · . . . · 1/4 · . . . 0 no0 1 0 3/4 · . . . · 1/4 · . . . 1/4 · . . . · 3/4 · . . . 1 yes0 1 1 3/4 · . . . · 3/4 · . . . 1/4 · . . . · 1/4 · . . . 0 no1 0 0 1/4 · . . . · 1/4 · . . . 3/4 · . . . · 3/4 · . . . 1 no1 0 1 1/4 · . . . · 3/4 · . . . 3/4 · . . . · 1/4 · . . . 1 yes1 1 0 1/4 · . . . · 1/4 · . . . 3/4 · . . . · 3/4 · . . . 1 no1 1 1 1/4 · . . . · 3/4 · . . . 3/4 · . . . · 1/4 · . . . 1 no

    From this, we can compute the error rate as the number of mistakes over thenumber of possible inputs (since each input is equally likely): error rate =2/8 = 0.25.

    67.

  • yJB = arg maxy∈V al(Y )

    P (Y = y|A = a,B = b, C = c)

    = arg maxy∈V al(Y )

    P (A = a,B = b, C = c|Y = y) · P (Y = y)P (A = a,B = b, C = c

    = arg maxy∈V al(Y )

    P (A = a,B = b, C = c|Y = y) · P (Y = y)

    Joint Bayes will need to estimate:

    P (y = 0) −→ P (y = 1) = 1− P (Y = 0)P (A = 0, B = 0, C = 0|y = 0)P (A = 0, B = 0, C = 1|y = 0)

    . . .P (A = 1, B = 1, C = 0|y = 0)

    −→ P (A = 1, B = 1, C = 1|y = 0) = 1− . . .

    P (A = 0, B = 0, C = 0|y = 1)P (A = 0, B = 0, C = 1|y = 1)

    . . .P (A = 1, B = 1, C = 0|y = 1)

    −→ P (A = 1, B = 1, C = 1|y = 1) = 1− . . .

    Thus, it is 1+ 2 · (23 − 1) = 15. In general, for n input binary random variablesit would have been 1 + 2 · (2n − 1) = 2n+1 − 1.The error rate of the Joint Bayes classifier is zero (assuming infinite trainingdata) since it can model an arbitrary complex Boolean function.

    68.

  • c. Consider a Bayes classifier that assumes that A is independent of C whenconditioned on B and on Y (unlike a Naive Bayes classifier that assumes thatA,B,C are all independent of each other when conditioned on Y ).

    Show that this Bayes classifier will need to estimate fewer parameters thanthe Joint Bayes classifier, but will still have the same error rate (assuminginfinite training data). Compute the error rate of this classifier.

    Answer

    Starting from Y = (A ∧ B) ∨ ¬(B ∨ C) = (A ∧ B) ∨ ((¬B) ∧ (¬C)), one can inferthat A is independent of C when conditioned on B and on Y .

    LC: Just to convince yourself...By using the truth table which we have already written for Y = (A∧B)∨¬(B∨C), you can easily check the equalities

    P (A = a, C = c|B = b, Y = y) = P (A = a|B = b, Y = y) · P (C = c|B = b, Y = y)

    if you analyse — for each of the four combination of values for b and y ∈{0, 1}, separately — the probabilities for P (A = a, C = c| . . .), P (A = a| . . .),P (C = c| . . .).

    69.

  • Therefore P (A,B,C|Y ) = P (A,C|B, Y ) · P (B|Y ) = P (A|B, Y ) · P (C|B, Y ) · P (B|Y ).For any input A = a, B = b, C = c, our new Bayes classifier will predict

    ynewBayes = arg maxy∈{0,1}

    P (A = a|B = b, Y = y)·P (C = c|B = b, Y = y)·P (B = b|Y = y)·P (Y = y),

    which, due to the above equality, will be exactly the same as the output ofJoint Bayes.Therefore, the error rate for new Bayes classifier will be zero.

    The parameters it needs to estimate are:P (Y = 0),

    P (B = 0|Y = 0), P (B = 0|Y = 1),

    P (A = 0|B = 0, Y =0), P (A = 0|B = 0, Y =1), P (A = 0|B = 1, Y =0), P (A = 0|B = 1, Y =1),

    P (C=0|B = 0, Y =0), P (C=0|B = 0, Y =1), P (C=0|B = 1, Y =0), P (C=0|B = 1, Y =1).

    Thus, there are 11 parameters to estimate, which is more than what’s requiredby Naive Bayes (7), but less than Joint Bayes (15).

    70.

  • LC: Just to convince yourself...

    Using the Boolean table ofY = (A ∧B) ∨ ¬(B ∨ C),the estimations of these pa-rameters are:

    P (Y = 0) = 1/2,

    P (B = 0|Y = 0) = 1/2, P (B = 0|Y = 1) = 1/2,

    P (A = 0|B = 0, Y = 0) = 1/2, P (A = 0|B = 0, Y = 1) = 1/2,

    P (A = 0|B = 1, Y = 0) = 1, P (A = 0|B = 1, Y = 1) = 0,

    P (C = 0|B = 0, Y = 0) = 0, P (C = 0|B = 0, Y = 1) = 1,

    P (C = 0|B = 1, Y = 0) = 1/2, P (C = 0|B = 1, Y = 1) = 1/2.

    The predictions of the classifier are then as follows (there are no ties!):

    P (A|B,Y = 0) · P (C|B, Y = 0)· P (A|B,Y = 1) · P (C|B, Y = 1)·A B C P (B|Y = 0) · P (Y = 0) P (B|Y = 1) · P (Y = 1) YnB err.

    0 0 0 1/2 · 0 · 1/2 · 1/2 1/2 · 1 · 1/2 · 1/2 1 no0 0 1 1/2 · 1 · . . . · . . . 1/2 · 0 · . . . · . . . 0 no0 1 0 1 · 1/2 · . . . · . . . 0 · 1/2 · . . . · . . . 0 no0 1 1 1 · 1/2 · . . . · . . . 0 · 1/2 · . . . · . . . 0 no1 0 0 1/2 · 0 · . . . · . . . 1/2 · 1 · . . . · . . . 1 no1 0 1 1/2 · 1 · . . . · . . . 1/2 · 0 · . . . · . . . 0 no1 1 0 0 · 1/2 · . . . · . . . 1 · 1/2 · . . . · . . . 1 no1 1 1 0 · 1/2 · . . . · . . . 1 · 1/2 · . . . · . . . 1 no

    This corresponds to zero error rate.

    71.

  • Computing

    The sample complexity of the Naive Bayes and Joint Bayes Clssifiers

    CMU, 2010 spring, Eric Xing, Tom Mitchell, Aarti Singh, HW2, pr. 1.1

    72.

  • A big reason we use the Naive Bayes classifier is that it requires less trainingdata than the Joint Bayes Classifier. This exercise should give you a “feeling”for how great the disparity really is.

    Imagine that each instance is an independent “observation” of the multi-variate random variable X̄ = (X1, ..., Xd), where the Xi are i.i.d. and Bernoulliof parameter p = 0.5.

    To train the Joint Bayes classifier, we need to see every value of X̄ “enough”times; training the Naive Bayes classifier only requires seeing both values ofXi “enough” times.

    Main Question: How many “observations”/instances are needed until, withprobability 1− ε, we have seen every variable we need to see at least once?Note: To train the classifiers well would require more than this, but for thisproblem we only require one observation.

    Hint: You may want to use the following inequalities:

    • For any k ≥ 1, (1− 1/k)k ≤ e−1

    • For any events E1, ..., Ek, Pr(E1 ∪ . . . ∪ Ek) ≤∑k

    i=1 Pr(Ei).(This is called the “union bounds” property.)

    73.

  • Consider the Naive Bayes classifier.

    a. Show that if N observations have been made, the probability that a given value of

    Xi (either 0 or 1) has not been seen is1

    2N−1.

    b. Show that if more than NNB = 1+ log2

    (d

    ε

    )

    observations have been made, then the

    probability that any Xi has not been observed in both states is ≤ ε.

    Solution:

    a. P (component Xi not seen in both states) =

    (1

    2

    )N

    +

    (1

    2

    )N

    =2

    2N=

    1

    2N−1

    b. P (any component not seen in both states)

    ≤∑di=1 P (component Xi not seen in both states)

    =∑d

    i=1

    1

    2NNB−1= d · 1

    2NNB−1= d · 1

    21+log2dε−1

    = d · 12log2

    = d · 1d

    ε

    = d · εd= ε

    74.

  • Consider the Joint Bayes classifier.

    c. Let x̄ be a particular value of X̄. Show that after N observations, the probability

    that we have never seen x̄ is ≤ e−N/2d.

    d. Using the “union bounds” property, show that if more than NJB = 2d ln

    (2d

    ε

    )

    observations have been made, then the probability that any value of X̄ has notbeen seen is ≤ ε.

    Solution:

    c. P (x̄ not seen in N observations)

    =

    (

    1− 12d

    )N

    =

    [(

    1− 12d

    )2d]N/2d

    ≤(1

    e

    )N/2d

    = e−N/2d

    d. P (any x̄ not seen in NJB observations)

    ≤∑x̄ P (x̄ not seen in NJB observations)≤∑x̄ e−NJB/2

    d

    = 2d · e−NJB/2d = 2d · e− ln 2d

    ε = 2d · 1eln

    2d

    ε

    =2d

    2d

    ε

    = ε

    75.

  • e. Let d = 2 and ε = 0.1. What are the values of NNB and NJB?What about d = 5?And d = 10?

    Solution:

    ε = 0.1, d = 2 ⇒

    NNB = 1 + log22

    0.1= 1 + log2 20 ≈ 5.32

    NJB = 22 · ln 2

    2

    0.1= 4 · ln 40 ≈ 14.75

    ε = 0.1, d = 5 ⇒

    NNB = 1 + log25

    0.1= 1 + log2 50 ≈ 6.64

    NJB = 25 · ln 2

    5

    0.1= 32 · ln 320 ≈ 184.58

    ε = 0.1, d = 10 ⇒

    NNB = 1 + log210

    0.1= 1 + log2 100 ≈ 7.64

    NJB = 210 · ln 2

    10

    0.1= 1024 · ln 10240 ≈ 9455.67

    76.

  • The relationship between [the decision rules of]

    Naive Bayes and Logistic Regression;

    the case of Boolean input variables

    CMU, 2005 fall, Tom Mitchell, Andrew Moore, HW2, pr. 2

    CMU, 2009 fall, Carlos Guestrin, HW1, pr. 4.1.2

    CMU, 2009 fall, Geoff Gordon, HW4, pr. 1.2-3

    CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 3.a

    77.

  • a. [NB and LR: the relationship between the decision rules]

    In Tom’s draft chapter (Generative and discriminative classifiers: NaiveBayes and logistic regression) it has been proved that when Y follows aBernoulli distribution and X = (X1, . . . , Xd) is a vector of Gaussian variables,then under certain assumptions the Gaussian Naive Bayes classifier impliesthat P (Y |X) is given by the logistic function with appropriate parameters w.So,

    P (Y = 1|X) = 11 + exp(w0 +

    ∑di=1 wiXi)

    .

    and therefore,

    P (Y = 0|X) = exp(w0 +∑d

    i=1 wiXi)

    1 + exp(w0 +∑n

    i=1 wiXi)

    Consider instead the case where Y is Boolean (more generally, Bernoulli) andX = (X1, . . . , Xd) is a vector of Boolean variables. Prove for this case also thatP (Y |X) follows this same form and hence that Logistic Regression is also thediscriminative counterpart to a Naive Bayes generative classifier over Booleanfeatures.

    Note: Discriminative classifiers learn the parameters of P (Y |X) directly,whereas generative classifiers instead learn the parameters of P (X |Y ) andP (Y ).

    78.

  • Hints:

    1. Simple notation will help. Since the Xi’s are Boolean variables,you need only one parameter to define, P (Xi|Y = yk), for eachi = 1, . . . , d.Define θi1 = P (Xi = 1|Y = 1), in which case P (Xi = 0|Y = 1) = 1−θi1.Similarly, use θi0 to denote P (Xi = 1|Y = 0).2. Notice that with the above notation you can represent P (Xi|Y =1) as follows:

    P (Xi|Y = 1) = θXii1 (1− θi1)(1−Xi),except the cases when θi1 = 0 and Xi = 0, respectively θi1 = 1 andXi = 1. Note that when Xi = 1 the second term is equal to 1because its exponent is zero. Similarly, when Xi = 0 the first termis equal to 1 because its exponent is zero.

    79.

  • Solution

    P (Y = 1|X = x) B.F.= P (X = x|Y = 1)P (Y = 1)∑y′∈{0,1} P (X = x|Y = y′)P (Y = y′)

    =1

    1 +P (X = x|Y = 0)P (Y = 0)P (X = x|Y = 1)P (Y = 1)

    =1

    1 + exp

    (

    lnP (X = x|Y = 0)P (Y = 0)P (X = x|Y = 1)P (Y = 1)

    )

    =1

    1 + exp

    (

    lnP (X1 = x1, . . . , Xd = xd|Y = 0)P (Y = 0)P (X1 = x1, . . . , Xd = xd|Y = 1)P (Y = 1)

    )

    cond. indep.=

    1

    1 + exp(

    ln P (Y=0)P (Y=1) +∑d

    i=1 lnP (Xi=xi|Y =0)P (Xi=xi|Y =1)

    )

    Conditions:

    1. P (X = x|Y = 1)P (Y = 1) 6= 0;

    2. P (X = x|Y = 0)P (Y = 0) 6= 0;

    3. P (X = xi|Y = 0) 6= 0 and

    P (X = xi|Y = 1) 6= 0.

    80.

  • Prior probabilities are: P (Y = 1) = π and P (Y = 0) = 1− π.Also, each Xi follows a Bernoulli distribution:P (Xi|Y = 1) = θXii1 (1− θi1)(1−Xi), and P (Xi|Y = 0) = θXii0 (1− θi0)(1−Xi).So,

    P (Y = 1|X = x) = 1

    1 + exp

    (

    ln1− ππ

    +∑d

    i=1 lnθXii0 (1− θi0)(1−Xi)θXii1 (1− θi1)(1−Xi)

    )

    =1

    1 + exp

    (

    ln1− ππ

    +∑d

    i=1

    (

    Xi lnθi0θi1

    + (1−Xi) ln1− θi01− θi1

    ))

    =1

    1 + exp

    (

    ln1− ππ

    +∑d

    i=1 ln1− θi01− θi1

    +∑d

    i=1 Xi

    (

    lnθi0θi1

    − ln 1− θi01− θi1

    ))

    Therefore, in order to reach P (Y = 1|X = x) = 1/(1 + exp(w0 +∑d

    i=1 wiXi), we can set

    w0 = ln1− ππ

    +

    d∑

    i=1

    ln1− θi01− θi1

    and wi = lnθi0θi1

    − ln 1− θi01− θi1

    for i = 1, . . . , d.

    81.

  • Note

    Although here we derived for P (Y |X) a form which is specific toLogistic Regression starting from the decision rule (better said,from the expression of P (Y = 1|X)) used by Naive Bayes, thisdoes not mean that Logistic Regression itself uses the conditionalindependence assumption.

    82.

  • b. [Relaxing the conditional independence assumption]

    To capture interactions between features, the Logistic Regression model canbe supplemented with extra terms. For example, a term can be added tocapture a dependency between X1 and X2:

    P (Y = 1|X) = 11 + exp(w0 + w1,2X1X2 +

    ∑ni=1 wiXi)

    Similarly, the conditional independence assumptions made by Naive Bayescan be relaxed so that X1 and X2 are not assumed to be conditionally inde-pendent. In this case, we can write:

    P (Y |X) = P (Y )P (X1, X2|Y )∏n

    i=3 P (Xi|Y )P (X)

    Prove that for this case, that P (Y |X) follows the same form as the logisticregression model supplemented with the extra term that captures the de-pendency between X1 and X2 (and hence that the supplemented Logistic Re-gression model is the discriminative counterpart to this generative classifier).

    83.

  • Hints:

    1. Using simple notation will help here as well. You need moreparameters than before to define P (X1, X2, Y ). So let’s define βijk =P (X1 = i, X2 = j, Y = k), for each i, j and k.

    2. The above notation can be used to represent P (X1, X2|Y = k) asfollows:

    P (X1, X2|Y = k) = (β11k)X1X2 (β10k)X1(1−X2) (β01k)(1−X1)X2 (β00k)(1−X1)(1−X2)

    for k ∈ {0, 1}, except for the cases when β11k = 0 and X1X2 = 0, orβ10k = 0 and X1(1−X2) = 0, or β01k = 0 and (1−X1)X2 = 0, or β00k = 0and (1−X1)(1−X2) = 0.

    84.

  • Solution

    P (Y = 1|X)B. F.=

    P (X|Y = 1)P (Y = 1)

    P (X|Y = 1)P (Y = 1) + P (X|Y = 0)P (Y = 0)

    =1

    1 +P (X|Y = 0)P (Y = 0)

    P (X|Y = 1)P (Y = 1)

    =1

    1 + exp

    (

    lnP (X|Y = 0)P (Y = 0)

    P (X|Y = 1)P (Y = 1)

    )

    cdtl. indep.=

    1

    1 + exp

    (

    lnP (X1, X2|Y = 0)

    ∏di=3 P (Xi|Y = 0)P (Y = 0)

    P (X1, X2|Y = 1)∏

    di=3 P (Xi|Y = 1)P (Y = 1)

    )

    =1

    1 + exp

    (

    ln1 − π

    π+∑

    di=3 ln

    P (Xi|Y = 0)

    P (Xi|Y = 1)+ ln

    P (X1, X2|Y = 0)

    P (X1, X2|Y = 1)

    )

    =1

    1 + exp

    (

    ln1 − π

    π+∑

    di=3 ln

    θXii0 (1 − θi0)

    (1−Xi)

    θXii1 (1 − θi1)

    (1−Xi)+ ln

    (β110)X1X2 (β100)

    X1(1−X2) (β010)(1−X1)X2 (β000)

    (1−X1)(1−X2)

    (β111)X1X2 (β101)X1(1−X2) (β011)(1−X1)X2 (β001)(1−X1)(1−X2)

    )

    =1

    1 + exp

    (

    ln1 − π

    π+∑

    di=3

    (

    Xi

    (

    lnθi0

    θi1+ ln

    1 − θi1

    1 − θi0

    )

    + ln1 − θi1

    1 − θi0

    )

    + lnβ000

    β001+ w1X1 + w2X2 + w1,2X1X2

    )

    Conditions:

    1. P (X = x|Y = 1)P (Y = 1) 6= 0;

    2. P (X = x|Y = 0)P (Y = 0) 6= 0;

    3. P (X1X2|Y = 0) 6= 0 and

    P (X1X2|Y = 1) 6= 0; P (Xi|Y =

    0) 6= 0 and P (Xi|Y = 1) 6= 0.

    85.

  • with

    w0 = ln1− ππ

    +

    d∑

    i=3

    ln1− θi11− θi0

    + lnβ000β001

    w1 = lnβ100β101

    + lnβ001β000

    w2 = lnβ010β011

    + lnβ001β000

    w1,2 = lnβ110β111

    + lnβ101β100

    + lnβ011β010

    + lnβ000β001

    wi = lnθi0θi1

    + ln1− θi11− θi0

    for i = 3, . . . , d.

    86.

  • Gaussian Baysian Classification

    87.

  • Exemplifying the Gaussian [Naive] Bayes algorithm on data from R

    CMU, 2001 fall, Andrew Moore, midterm, pr. 3.a

    88.

  • Suppose you have the nearby training set with one real-valuedinput X and a categorial output Y that has two values.

    X Y0 A2 A3 B4 B5 B6 B7 B

    a. You must learn from this data the parameters of the Gaussian Bayesclassifer. Write your answer in the following table.

    µA = σ2A = P (Y = A) =

    µB = σ2B = P (Y = B) =

    89.

  • Solution (in Romanian)

    Pentru a estima mediile µA şi µB, vom folosi formula µMLE =

    ∑ni=1 xin

    , unde n este

    numărul instanţelor de antrenament. Aşadar, µA =

    ∑2i=1 Xi2

    =0 + 2

    2= 1, iar µB =

    ∑7i=3 Xi5

    =3 + 4 + 5 + 6 + 7

    5= 5.

    Similar, pentru calculul varianţelor σA şi σB, vom folosi formula σ2MLE =∑n

    i=1(xi − µMLE)2n

    . Aşadar, σ2A =1

    2[(0 − 1)2 + (2 − 1)2] = 1, iar σ2B =

    1

    5[(3 − 5)2 + (4 −

    5)2 + 02 + (6− 5)2 + (7− 5)2] = 15· 2 · [4 + 1] = 2.

    Pentru calculul probabilităţilor P (Y = A) şi P (Y = B) se ţine cont de faptul că Y estevariabilă de tip Bernoulli. Aşadar, P (Y = A) = 2/7 şi P (Y = B) = 5/7.Centralizând aceste estimări, obţinem:

    µA = 1 σ2A = 1 P (Y = A) = 2/7

    µB = 5 σ2B = 2 P (Y = B) = 5/7

    90.

  • b. Using the notation α = p(X = 2|Y = A) and β = p(X = 2|Y = B),

    − What is p(X = 2, Y = A)? (Answer in terms of α.)− What is p(X = 2, Y = B)? (Answer in terms of β.)− What is p(X = 2)? (Answer in terms of α and β.)− What is p(Y = A|X = 2)? (Answer in terms of α and β.)− How would the point X = 2 be classified by the Gaussian Bayes algorithm?(Answer in terms of α and β.)

    Solution (in Romanian)

    p(X = 2, Y = A) = p(X = 2|Y = A) · P (Y = A) = 2α7

    p(X = 2, Y = B) = p(X = 2|Y = B) · P (Y = B) = 5β7

    p(X = 2) = p(X = 2|Y = A) · P (Y = A) + p(X = 2|Y = B) · P (Y = B) = 17(2α+ 5β)

    p(Y = A|X = 2) = p(Y = A,X = 2)p(X = 2)

    =2α

    2α+ 5β.

    91.

  • Algoritmul Bayes [Naiv] gaussian va asocia punctului X = 2 eticheta Y = A

    dacă p(Y = A|X = 2) ≥ p(Y = B|X = 2) ⇔ 2α ≥ 5β ⇔ α ≥ 52β.

    Folosind valorile estimate pentru parametrii µA, µB, σA şi σB la punctul a,vom putea scrie:

    α =1√2π

    e−(2 − 1)2

    2 =1√2π

    e−1

    2 şi β =1√

    2π ·√2e−(2− 5)22 · 2 = 1

    2√πe−9

    4 .

    Deci,

    α ≥ 52β ⇔ 1√

    2πe−1

    2 ≥ 54√πe−9

    4 ⇔ e7

    4 ≥ 52√2⇔ 7

    4≥ ln 5

    2√2

    ⇔ 1.75 ≥ ln 5− 32ln 2 ⇔ 1.75 ≥ 0.5697 (adev.).

    Prin urmare, algoritmul Bayes [Naiv] gaussian va asocia punctului X = 2eticheta Y = A.

    92.

  • Graphical representations[made by Sebastian Ciobanu]

    without multiplyingthe p.d.f.’s by the selection probabilities

    −2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    p(x

    )

    Y = A

    Y = B

    after multiplying the p.d.f.’s bythe selection probabilities

    −2 0 2 4 6 8

    0.0

    0.1

    0.2

    0.3

    0.4

    xp

    (x)

    Y = A

    Y = B

    93.

  • Observaţie

    Se poate constata relativ uşor că există două puncte de intersecţie (x1 = −8.451şi x2 = 2.451) pentru graficele funcţiilor

    2

    7N (µA, σ2A) şi

    5

    7N (µB , σ2B).

    Toate instanţele de test x situate ı̂ntre aceste puncte de intersecţie (x1 <x < x2) vor aparţine clasei A (acolo curba roşie este situată deasupra celeialbastre).

    Instanţele situate fie la stânga lui x1 fie la dreapta lui x2 vor aparţine claseiB (acolo curba albastră este situată deasupra celei roşii).

    Separatorul decizional este de tip pătratic, fiind constituit din punctele x1 şix2.

    LC: Mulţumesc studentului MSc Dinu Sergiu pentru această observaţie.

    94.

  • Exemplifying the Gaussian [Naive] Bayes algorithm on data from R2

    CMU, 2014 fall, W. Cohen, Z. Bar-Joseph, HW2, pr. 5.c

    95.

  • In a two dimensional case, we can visualize how Gaussian Naive Bayes be-haves when input features are correlated. A data set is shown in Figure (A),where red points are in Class 0, blue points are in Class 1. The conditionaldistributions are two-dimensional Gaussians. In (B), (C) and (D), the ellipsesrepresent conditional distributions for each class. The centers of ellipses showthe means, and the contours show the boundary of two standard deviations.

    a. Which of them is most likely to be the true conditional distribution?

    b. Which of them is most likely to be estimates by a Gaussian Naive Bayesmodel?

    c. If we assume the prior probabilities for both classes are equal, which modelwill achieve a higher accuracy on the training data?

    96.

  • (A) Data (B)

    (C) (D)

    97.

  • Solution:

    a. (C) is the truth.

    b. (B) corresponds to the Gaussian Naive Bayes estimates. [LC: Here followsthe explanation:]Because the Gaussian Naive Bayes model assume independence of the twofeatures conditioned on the class label, the estimated model should be alignedwith the axies. Both (B) and (D) satisfy this, but only in (B) the width andheight of the oval, which are proportional to the standard deviation of eachaxis, matched the data.

    c. (C) gives the lowest training error.

    98.

  • Estimating the parameters for

    Gaussian Naive Bayes and Full/Joint Gaussian Naive Bayes algorithms

    CMU, 2014 fall, W. Cohen, Z. Bar-Joseph, HW2, pr. 5.ab

    Let Y ∈ {0, 1} be class labels, and let X ∈ Rd denote a d-dimensional feature.We will first consider a Gaussian Naive Bayes model, where the condi-tional distribution of each feature is a one-dimensional Gaussian, X(j)|Y ∼N(µ

    (j)Y , (σ

    (j)Y )

    2), for j = 1, · · · , d.a. Given n independent training data points, {(X(1), Y (1)), · · · , (X(n), Y (n))},give maximum-likelihood estimates (MLE) for the parametres of the proba-bilistic distribution of X(j)|Y , for j = 1, . . . , d.

    99.

  • Solution:

    The likelihood of the samples in Class 0 is

    L(X(j)i,0 |µ

    (j)0 , (σµ

    (j)0 )

    2) =

    n0∏

    i=1

    1√2πσ

    (j)0

    exp

    (

    −(X

    (j)i,0 − µ

    (j)0 )

    2

    2(σ(j)0 )

    2

    )

    =

    (

    1√2πσ

    (j)0

    )n0

    exp

    (

    −n0∑

    i=1

    (X(j)i,0 − µ

    (j)0 )

    2

    2(σ(j)0 )

    2

    )

    and the log-likelihood is

    lnL = −n0 lnσ(j)0 −1

    2(σ(j)0 )

    2

    n0∑

    i=1

    (X(j)i,0 − µ

    (j)0 )

    2 + constant

    Taking the partial derivatives of the log-likelihood, we have

    ∂ lnL

    ∂µ(j)0

    = 0 ⇔n0∑

    i=1

    (X(j)i,0 − µ

    (j)0 ) = 0⇔µ

    (j)0 =

    1

    n0

    n0∑

    i=1

    X(j)i,0

    ∂ lnL

    ∂σ(j)0

    = 0 ⇔ − n0σ(j)0

    +1

    (σ(j)0 )

    3

    n0∑

    i=1

    (X(j)i,0 − µ

    (j)0 )

    2 = 0⇔(σµ(j)0 )2 =1

    n0

    n0∑

    i=1

    (X(j)i,0 − µ̂

    (j)0 )

    2

    Similarly, one can derive the MLE for the parameters in Class 1.

    100.

  • b. Suppose the prior of Y is already given. How many parameters do youneed to estimate in Gaussian Naive Bayes model?

    Solution:

    For each class, there are 2 parameters (the mean and variance) for each feature, there-fore there are 2 · 2d = 4d parameters for all features in the two classes.

    c. In a full/Joint Gaussian Bayes model, we assume that the conditionaldistribution Pr(X |Y ) is a multidimensional Gaussian, X |Y ∼ N (µY ,ΣY ), whereµ ∈ Rd is the mean vector and Σ ∈ Rd×d is the covariance matrix.Again, suppose the prior of Y is already given. How many parameters do youneed to estimate in a full/Joint Gaussian Bayes model?

    Solution:

    For each class, there are d parameters for the mean, d(d + 1)/2 parameters for thecovariance matrix, because the covariance matrix is symmetric. Therefore, the numberof parameters is 2 · (d+ d(d+ 1)/2) = d(d+ 3) in total for the two classes.

    101.

  • Provingthe relationship between the decision rules for

    Gaussian Naive Bayes and the Logistic Regression algorithm

    when the covariance matrices are diagonal and identical

    i.e., σ2i0 = σ2i1 for i = 1, . . . , d

    CMU, 2009 spring, Ziv Bar-Joseph, HW2, pr. 2

    102.

  • Assume a two-class (Y ∈ {0, 1}) Naive Bayes model over the d-dimensional real-valued input space Rd, where the input variablesX|Y = 0 ∈ Rd are distributed as

    Gaussian(µ0 =< µ10, . . . , µd0 >, σ =< σ1, . . . , σd >)

    and X|Y = 1 ∈ Rd as

    Gaussian(µ1 =< µ11, . . . , µd1 >, σ =< σ1, . . . , σd >)

    i.e., the inputs given the class have different means but identicalvariance for both classes.

    103.

  • Prove that, given the conditions stated above, the conditionalprobability P (Y = 1|X = x), where X = (X1, . . . , Xd) and x =(x1, . . . , xd) can be written in a simiar form to Logistic Regression:

    1

    1 + exp(w0 + w · x)

    with the parameters w0 ∈ R and w = (w1, . . . , wd) ∈ Rd chosen in asuitable way.

    As a consequence, the decision rule for the Gaussean Bayes clas-sifier supported by this model the desion rule has a linear form.

    104.

  • Solution

    P (Y = 1|X = x) B.F.= P (X = x|Y = 1)P (Y = 1)∑y′∈{0,1} P (X = x|Y = y′)P (Y = y′)

    =1

    1 +P (X = x|Y = 0)P (Y = 0)P (X = x|Y = 1)P (Y = 1)

    =1

    1 + exp

    (

    lnP (X = x|Y = 0)P (Y = 0)P (X = x|Y = 1)P (Y = 1)

    )

    =1

    1 + exp

    (

    lnP (X1 = x1, . . . , Xd = xd|Y = 0)P (Y = 0)P (X1 = x1, . . . , Xd = xd|Y = 1)P (Y = 1)

    )

    ︸ ︷︷ ︸exponent

    105.

  • exponentcond. indep.

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    lnP (Xi = xi|Y = 0)P (Xi = xi|Y = 1)

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    ln

    1√2πσi

    exp(

    − (xi−µi0)22σ2

    i

    )

    1√2πσi

    exp(

    − (xi−µi1)22σ2

    i

    )

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    ((xi − µi1)2

    2σ2i− (xi − µi0)

    2

    2σ2i

    )

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    2xi(µi0 − µi1) + (µ2i1 − µ2i0)2σ2i

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    (xi(µi0 − µi1)

    σ2i+

    (µ2i1 − µ2i0)2σ2i

    )

    = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    (µ2i1 − µ2i0)2σ2i

    ︸ ︷︷ ︸w0

    +

    d∑

    i=1

    µi0 − µi1σ2i

    ︸ ︷︷ ︸wi

    xi

    106.

  • In conclusion,

    P (Y = 1|X = x) = 11 + e(w·x+w0)

    with

    w0 = lnP (Y = 0)

    P (Y = 1)+

    d∑

    i=1

    (µ2i1 − µ2i0)2σ2i

    and wi =µi0 − µi1

    σ2i, i = 1, . . . , d

    Note that

    P (Y = 0|X = x) = e(w·x+w0)

    1 + e(w·x+w0)

    andP (Y = 1|X = x) > P (Y = 0|X = x) ⇔ w · x+ w0 < 0

    107.

  • Since the coefficients wi for i = 1, . . . , d do not depend on xi, itfollows that this decison rule of Gaussian Naive Bayes [in theconditions stated in the beginning of this problem] is a linear rule,like in Logistic Regression.

    However, this relationship does not mean that there is a one-to-one correspondence between the parameters wi of Gaussian NaiveBayes (GNB) and the parameters wi of logistic regression (LR)because LR is discriminative and therefore doesn’t model P (X),while GNB does model P (X).

    To be more specific, note that the coefficients wi in the GNB decision rules

    should be devided by P (x1, . . . , xd) in order to correspond to P (Y = 1|X = x),which means that then they will not anymore be independent of xi, like the

    LR coefficients.

    108.

  • Provingthe relationship between

    The full Gaussian Bayes algorithm and Logistic Regression

    when Σ0 = Σ1

    CMU, 2011 spring, Tom Mitchell, HW2, pr. 2.2

    109.

  • Let’s make the following assumptions:

    1. Y is a boolean variable following a Bernoulli distribution, with parameterπ = P (Y = 1) and thus P (Y = 0) = 1− π.2. X = (X1, X2, . . . , Xd)

    ⊤ is a vector of random variables not conditionally in-dependent given Y , and P (X |Y = k) follows a multivariate normal distributionN(µk,Σ).

    Note that µk is the d × 1 mean vector depending on the value of Y , and Σ isthe d× d covariance matrix, which does not depend on Y . We will write/usethe density of the multivariate normal distribution in vector/matrix notation.

    N (x;µ,Σ) = 1(2π)n/2|Σ|1/2 exp

    (

    −12(x− µ)⊤Σ−1(x − µ)

    )

    Is the form of P (Y |X) implied by such this [not-so-naive] Gaussian Bayesclassifier [LC: similar to] the form used by logistic regression?Derive the form of P (Y |X) to prove your answer.

    110.

  • We start with:

    P (Y = 1|X) = P (X |Y = 1)P (Y = 1)P (X |Y = 1)P (Y = 1) + P (X |Y = 0)P (Y = 0)

    =1

    1 +P (Y = 0)P (X |Y = 0)P (Y = 1)P (X |Y = 1)

    =1

    1 + exp

    (

    lnP (Y = 0)P (X |Y = 0)P (Y = 1)P (X |Y = 1)

    )

    =1

    1 + exp

    (

    lnP (Y = 0)

    P (Y = 1)+ ln

    P (X |Y = 0)P (X |Y = 1)

    )

    Next we will focus on the term lnP (X |Y = 0)P (X |Y = 1) :

    lnP (X |Y = 0)P (X |Y = 1) = ln

    1

    (2π)d/2|Σ|1/21

    (2π)d/2|Σ|1/2+ ln exp[(⋆)] = ln exp[(⋆)] = (⋆)

    where (⋆) is the formulation obtained as the difference between the exponential partsof two multivariate Gaussian densities P (X |Y = 0) and P (X |Y = 1).

    111.

  • (⋆) =1

    2[(X − µ1)⊤Σ−1(X − µ1)− (X − µ0)⊤Σ−1(X − µ0)]

    = (µ⊤0 − µ⊤1 )Σ−1X +1

    2µ⊤1 Σ

    −1µ1 −1

    2µ⊤0 Σ

    −1µ0

    As a result, we have:

    P (Y = 1|X) = 1

    1 + exp

    (

    ln1− ππ

    +1

    2µ⊤1 Σ

    −1µ1 −1

    2µ⊤0 Σ

    −1µ0 + (µ⊤0 − µ⊤1 )Σ−1X)

    =1

    1 + exp(w0 + w⊤X)

    where w0 = ln1− ππ

    +1

    2µ⊤1 Σ

    −1µ1 −1

    2µ⊤0 Σ

    −1µ0 is a scalar,

    and w = Σ−1(µ0 − µ1) is a d× 1 a parameter vector.Note that ((µ⊤0 −µ⊤1 )Σ−1)⊤ = ((µ0−µ1)⊤Σ−1)⊤ = (Σ−1)⊤((µ0−µ1)⊤)⊤ = Σ−1(µ0−µ1) becauseΣ−1 is symmetric.(Σ is symmetric because it is a covariance matrix, and therefore Σ−1 is also symmetric.)

    In conclusion, P (Y |X) has the form of the logistic regression (in vector and matrixnotation).

    112.

  • The quadratic nature of the decision boundary for

    the Gaussian Joint Bayes classifier

    when Σ0 6= Σ1Stanford, 2014 fall, Andrew Ng, midterm, pr. 2.b

    113.

  • The probabilistic distributions learned by the Gaussian Bayes Joint (GJB)classifier can be written as:

    p(y) = φy(1− φ)1−y , where φ = p(y = 1)

    p(x|y = 0) = 1(2π)d/2|Σ0|1/2

    exp(

    − 12(x− µ0)⊤Σ−10 (x − µ0)

    )

    p(x|y = 1) = 1(2π)d/2|Σ1|1/2

    exp(

    − 12(x− µ1)⊤Σ−11 (x − µ1)

    )

    The decision rule of GJB predicts

    y = 1 is p(y = 1|x) ≥ p(y = 0|x) and y = 0 otherwise.

    Show that if Σ0 6= Σ1, then the separating boundary is quadratic in x. Thatis, simplify the decision rule p(y = 1|x) ≥ p(y = 0|x) to the form

    x⊤Ax +B⊤x+ C ≥ 0,

    for some A ∈ Rd×d, B ∈ Rd, C ∈ R, and A 6= 0. Please clearly state your valuesfor A, B and C.

    114.

  • Examining the log-probabilities yields:

    ln p(y = 1|x) ≥ ln p(y = 0|x) ⇔ ln p(y = 1|x)− ln p(y = 0|x) ≥ 0 ⇔ ln p(y = 1|x)p(y = 0|x) ≥ 0

    F. Bayes⇔ ln p(x|y = 1) p(y = 1)p(x|y = 0) p(y = 0) ≥ 0

    ⇔ ln φ1− φ − ln

    |Σ1|1/2|Σ0|1/2

    − 12

    (

    (x− µ1)⊤Σ−11 (x− µ1)− (x− µ0)⊤Σ−10 (x− µ0))

    ≥ 0

    ⇔ −12

    (

    x⊤(Σ−11 −Σ−10

    )x−2

    (µ⊤1 Σ

    −11 −µ⊤0 Σ−10

    )x+ µ⊤1 Σ

    −11 µ1 − µ⊤0 Σ−10 µ0

    )

    + lnφ

    1− φ−ln|Σ1|1/2|Σ0|1/2

    ≥ 0

    ⇔ x⊤(1

    2

    (Σ−10 − Σ−11

    ))

    x+(

    µ⊤1 Σ−11 − µ⊤0 Σ−10

    )

    x

    + lnφ

    1− φ + ln|Σ0|1/2|Σ1|1/2

    +1

    2

    (

    µ⊤0 Σ−10 µ0 − µ⊤1 Σ−11 µ1

    )

    ≥ 0.

    From the above, we see that A =1

    2

    (Σ−10 − Σ−11

    ), B⊤ = µ⊤1 Σ

    −11 − µ⊤0 Σ−10 , C = ln

    φ

    1− φ +

    ln|Σ0|1/2|Σ1|1/2

    +1

    2

    (

    µ⊤0 Σ−10 µ0 − µ⊤1 Σ−11 µ1

    )

    . Furthermore, A 6= 0, since Σ0 6= Σ1 implies that

    Σ−10 6= Σ−11 . Therefore, the decision boundary is quadratic.

    115.