weatherwax solution to duda heart

Solutions to Selected Problems In:

Pattern Classification

by Duda, Hart, Stork

John L. Weatherwax∗

February 24, 2008

Problem Solutions

Chapter 2 (Bayesian Decision Theory)

Problem 11 (randomized rules)

Part (a): Let R(x) be the average risk given measurement x. Then since this risk dependson the action and correspondingly the probability of taking such action, this risk can beexpressed as

R(x) =

m∑

i=1

R(ai|x)P (ai|x)

Then the total risk (over all possible x) is given by the average of R(x) or

R =

∫

Ωx

(m∑

i=1

R(ai|x)P (ai|x))p(x)dx

Part (b): One would pick

P (ai|x) =

1 i = ArgMiniR(ai|x)0 i = anything else

Thus for every x pick the action that minimizes the local risk. This amounts to a determin-istic rule i.e. there is no benefit to using randomness.

∗[email protected]

1

Part (c): Yes, since in a deterministic decision rule setting randomizing some decisionsmight further improve performance. But how this would be done would be difficult to say.

Problem 13

Let

λ(αi|ωj) =

0 i = jλr i = c + 1λs otherwise

Then since the definition of risk is expected loss we have

R(αi|x) =c∑

i=1

λ(ai|ωj)P (ωj|x) i = 1, 2, . . . , c (1)

R(αc+1|x) = λr (2)

Now for i = 1, 2, . . . , c the risk can be simplified as

R(αi|x) = λs

c∑

j=1, j 6=i

λ(ai|ωj)P (ωj|x) = λs(1 − P (ωi|x))

Our decision function is to choose the action with the smallest risk, this means that we pickthe reject option if and only if

λr < λs(1 − P (ωi|x)) ∀ i = 1, 2, . . . , c

This is equivalent to

P (ωi|x) < 1 − λr

λs∀ i = 1, 2, . . . , c

This can be interpreted as if all posterior probabilities are below a threshold (1 − λr

λs

) weshould reject. Otherwise we pick action i such that

λs(1 − P (ωi|x)) < λs(1 − P (ωi|x)) ∀ j = 1, 2, . . . , c, j 6= i

This is equivalent toP (ωi|x) > P (ωj|x) ∀ j 6= i

This can be interpreted as selecting the class with the largest posteriori probability. Insummary we should reject if and only if

P (ωi|x) < 1 − λr

λs

∀ i = 1, 2, . . . , c (3)

Note that if λr = 0 there is no loss for rejecting (equivalent benefit/reward for correctlyclassifying) and we always reject. If λr > λs, the loss from rejecting is greater than the lossfrom missclassifing and we should never reject. This can be seen from the above since λr

λs

> 1

so 1 − λr

λs

< 0 so equation 3 will never be satisfied.

Problem 31 (the probability of error)

Part (a): We can assume with out loss of generality that µ1 < µ2. If this is not the caseinitially change the labels of class one and class two so that it is. We begin by recalling theprobability of error Pe

Pe = P (H = H1|H2)P (H2) + P (H = H2|H1)P (H1)

=

(∫ ξ

−∞p(x|H2)dx

)

P (H2) +

(∫ ∞

ξ

p(x|H1)dx

)

P (H1) .

Here ξ is the decision boundary between the two classes i.e. we classify as class H1 whenx < ξ and classify x as class H2 when x > ξ. Here P (Hi) are the priors for each of thetwo classes. We begin by finding the Bayes decision boundary ξ under the minimum errorcriterion. From the discussion in the book, the decision boundary is given by a likelihoodratio test. That is we decide class H2 if

p(x|H2)

p(x|H1)≥ P (H1)

P (H2)

(

C21 − C11

C12 − C22

)

,

and classify the point as a member of H1 otherwise. The decision boundary ξ is the pointat which this inequality is an equality. If we use a minimum probability of error criterionthe costs Cij are given by Cij = 1 − δij, and the decision boundary reduces when P (H1) =P (H2) = 1

2to

p(ξ|H1) = p(ξ|H2) .

If each conditional distributions is Gaussian p(x|Hi) = N (µi, σ2i ), then the above expression

becomes1√

2πσ1

exp−1

2

(ξ − µ1)2

σ21

=1√

2πσ2

exp−1

2

(ξ − µ2)2

σ22

.

If we assume (for simplicity) that σ1 = σ2 = σ the above expression simplifies to (ξ−µ1)2 =

(ξ −µ2)2, or on taking the square root of both sides of this we get ξ −µ1 = ±(ξ −µ2). If we

take the plus sign in this equation we get the contradiction that µ1 = µ2. Taking the minussign and solving for ξ we find our decision threshold given by

ξ =1

2(µ1 + µ2) .

Again since we have uniform priors (P (H1) = P (H2) = 12), we have an error probability

given by

2Pe =

∫ ξ

−∞

1√2πσ

exp−1

2

(x − µ2)2

σ2dx +

∫ ∞

ξ

1√2πσ

exp−1

2

(x − µ1)2

σ2dx .

or

2√

2πσPe =

∫ ξ

−∞exp−1

2

(x − µ2)2

σ2dx +

∫ ∞

ξ

exp−1

2

(x − µ1)2

σ2dx .

To evaluate the first integral use the substitution v = x−µ2√2σ

so dv = dx√2σ

, and in the second

use v = x−µ2√2σ

so dv = dx√2σ

. This then gives

2√

2πσPe = σ√

2

∫

µ1−µ22√

2σ

−∞e−v2

dv + σ√

2

∫ ∞

−µ1−µ22√

2σ

e−v2

dv .

or

Pe =1

2√

π

∫µ1−µ22√

2σ

−∞e−v2

dv +1

2√

π

∫ ∞

−µ1−µ22√

2σ

e−v2

dv .

Now since e−v2is an even function, we can make the substitution u = −v and transform the

second integral into an integral that is the same as the first. Doing so we find that

Pe =1√π

∫

µ1−µ22√

2σ

−∞e−v2

dv .

Doing the same transformation we can convert the first integral into the second and we havethe combined representation for Pe of

Pe =1√π

∫

µ1−µ22√

2σ

−∞e−v2

dv =1√π

∫ ∞

−µ1−µ22√

2σ

e−v2

dv .

Since we have that µ1 < µ2, we see that the lower limit on the second integral is positive. Inthe case when µ1 > µ2 we would switch the names of the two classes and still arrive at theabove. In the general case our probability of error is given by

Pe =1√π

∫ ∞

a√2

e−v2

dv ,

with a defined as a = |µ1−µ2|2σ

. To match exactly the book let v = u√2

so that dv = du√2

andthe above integral becomes

Pe =1√2π

∫ ∞

a

e−u2/2du .

In terms of the error function (which is defined as erf(x) = 2√π

∫ x

0e−t2dt), we have that

Pe =1

2

[

2√π

∫ ∞

a√2

e−v2

dv

]

=1

2

[

2√π

∫ ∞

0

e−v2

dv − 2√π

∫ a√2

0

e−v2

dv

]

=1

2

[

1 − 2√π

∫ a√2

0

e−v2

dv

]

=1

2− 1

2erf

( |µ1 − µ2|2√

2σ

)

.

Since the error function is programmed into many mathematical libraries this form is easilyevaluated.

Part (b): From the given inequality

Pe =1√2π

∫ ∞

a

e−t2/2dt ≤ 1√2πa

e−a2/2 ,

and the definition a as the lower limit of the integration in Part (a) above we have that

Pe ≤ 1√2π

1|µ1−µ2|

2σ

e− (µ1−µ2)2

2(4σ2)

=2σ√

2π|µ1 − µ2|exp−1

8

(µ1 − µ2)2

σ2 → 0 ,

as |µ1−µ2|σ

→ +∞. This can happen if µ1 and µ2 get far apart of σ shrinks to zero. In eachcase the classification problem gets easier.

Problem 32 (probability of error in higher dimensions)

As in Problem 31 the classification decision boundary are now the points (x) that satisfy

||x − µ1||2 = ||x − µ2||2 .

Expanding each side of this expression we find

||x||2 − 2xtµ1 + ||µ1||2 = ||x||2 − 2xtµ2 + ||µ2||2 .

Canceling the common ||x||2 from each side and grouping the terms linear in x we have that

xt(µ1 − µ2) =1

2

(

||µ1||2 − ||µ2||2)

.

Which is the equation for a hyperplane.

Problem 34 (error bounds on non-Gaussian data)

Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, webegin by considering the Chernoff bound. Specifically, using the bound min(a, b) ≤ aβb1−β ,for a > 0, b > 0, and 0 ≤ β ≤ 1, one can show that

P (error) ≤ P (ω1)βP (ω2)

1−βe−k(β) for 0 ≤ β ≤ 1 .

Here the expression k(β) is given by

k(β) =β(1 − β)

2(µ2 − µ1)

t [βΣ1 + (1 − β)Σ2]−1 (µ2 − µ1) +

1

2ln

(

|βΣ1 + (1 − β)Σ2||Σ1|β |Σ2|1−β

)

.

When we desire to compute the Bhattacharyya bound we assign β = 1/2. For a two classscalar problem with symmetric means and equal variances that is where µ1 = −µ, µ2 = +µ,and σ1 = σ2 = σ the expression for k(β) above becomes

k(β) =β(1 − β)

2(2µ)

[

βσ2 + (1 − β)σ2]−1

(2µ) +1

2ln

( |βσ2 + (1 − β)σ2|σ2βσ2(1−β)

)

=2β(1 − β)µ2

σ2.

Since the above expression is valid for all β in the range 0 ≤ β ≤ 1, we can obtain the tightestpossible bound by computing the maximum of the expression k(β) (since the dominant βexpression is e−k(β). To evaluate this maximum of k(β) we take the derivative with respectto β and set that result equal to zero obtaining

d

dβ

(

2β(1 − β)µ2

σ2

)

= 0

which gives1 − 2β = 0

or β = 1/2. Thus in this case the Chernoff bound equals the Bhattacharra bound. Whenβ = 1/2 we have for the following bound on our our error probability

P (error) ≤√

P (ω1)P (ω2)e− µ

2

2σ2 .

To further simplify things if we assume that our variance is such that σ2 = µ2, and that ourpriors over class are taken to be uniform (P (ω1) = P (ω2) = 1/2) the above becomes

P (error) ≤ 1

2e−1/2 ≈ 0.3033 .

Chapter 3 (Maximum Likelihood and Bayesian Estimation)

Problem 34

In the problem statement we are told that to operate on a data of “size” n, requires f(n)numeric calculations. Assuming that each calculation requires 10−9 seconds of time to com-pute, we can process a data structure of size n in 10−9f(n) seconds. If we want our resultsfinished by a time T then the largest data structure that we can process must satisfy

10−9f(n) ≤ T or n ≤ f−1(109T ) .

Converting the ranges of times into seconds we find that

T = 1 sec, 1 hour, 1 day, 1 year = 1 sec, 3600 sec, 86400 sec, 31536000 sec .

Using a very naive algorithm we can compute the largest n such that f(n) by simply incre-menting n repeatedly. This is done in the Matlab script prob 34 chap 3.m and the resultsare displayed in Table XXX.

Problem 35 (recursive v.s. direct calculation of means and covariances)

Part (a): To compute µn we have to add n, d-dimensional vectors resulting in nd com-putations. We then have to divide each component by the scalar n, resulting in another ncomputations. Thus in total we have

nd + n = (1 + d)n ,

calculations.

To compute the sample covariance matrix, Cn, we have to first compute the differencesxk − µn requiring d subtraction for each vector xk. Since we must do this for each of the nvectors xk we have a total of nd calculations required to compute all of the xk − µn differencevectors. We then have to compute the outer products (xk − µn)(xk − µn)′, which required2 calculations to produce each (in a naive implementation). Since we have n such outerproducts to compute computing them all requires nd2 operations. We next, have to add allof these n outer products together, resulting in another set of (n − 1)d2 operations. Finallydividing by the scalar n − 1 requires another d2 operations. Thus in total we have

nd + nd2 + (n − 1)d2 + d2 = 2nd2 + nd ,

calculations.

Part (b): We can derive recursive formulas for µn and Cn in the following way. First forµn we note that

µn+1 =1

n + 1

n+1∑

k=1

xk =xn+1

n + 1+

(

n

n + 1

)

1

n

n∑

k=1

xk

=1

n + 1xn+1 +

(

n

n + 1

)

µn .

Writing nn+1

as n+1−1n+1

= 1 − 1n+1

the above becomes

µn+1 =1

n + 1xn+1 + µn − 1

n + 1µn

= µn +

(

1

n + 1

)

(xn+1 − µn) ,

as expected.

To derive the recursive update formula for the covariance matrix Cn we begin by recallingthe definition of Cn of

Cn+1 =1

n

n+1∑

k=1

(xk − µn+1)(xk − µn+1)′ .

Now introducing the recursive definition of the mean µn+1 as computed above we have that

Cn+1 =1

n

n+1∑

k=1

(xk − µn − 1

n + 1(xn+1 − µn))(xk − µn − 1

n + 1(xn+1 − µn))

′

=1

n

n+1∑

k=1

(xk − µn)(xk − µn)′ − 1

n(n + 1)

n+1∑

k=1

(xk − µn)(xn+1 − µn)′

− 1

n(n + 1)

n+1∑

k=1

(xn+1 − µn)(xk − µn)′ − 1

n(n + 1)2

n+1∑

k=1

(xn+1 − µn)(xn+1 − µn)′ .

Since µn = 1n

∑nk=1 xk we see that

nµn − 1

n

n∑

k=1

xk = 0 or1

n

n∑

k=1

(µn − xk) = 0 ,

Using this fact the second and third terms the above can be greatly simplified. For examplewe have that

n+1∑

k=1

(xk − µn) =

n∑

k=1

(xk − µn) + (xn+1 − µn) = xn+1 − µn

Using this identity and the fact that the fourth term has a summand that does not dependon k we have Cn+1 given by

Cn+1 =n − 1

n

(

1

n − 1

n∑

k=1

(xk − µn)(xk − µn)′ +1

n − 1(xn+1 − µn)(xn+1 − µn)′

)

− 2

n(n + 1)(xn+1 − µn)(xn+1 − µn) +

1

n(n + 1)(xn+1 − µn)(xn+1 − µn)′

=

(

1 − 1

n

)

Cn +

(

1

n− 1

n(n + 1)

)

(xn+1 − µn)(xn+1 − µn)′

= Cn − 1

nCn +

1

n

(

n + 1 − 1

n + 1

)

(xn+1 − µn)(xn+1 − µn)′

=

(

n − 1

n

)

Cn +1

n + 1(xn+1 − µn)(xn+1 − µn)′ .

which is the result in the book.

Part (c): To compute µn using the recurrence formulation we have d subtractions to com-pute xn − µn−1 and the scalar division requires another d operations. Finally, addition bythe µn−1 requires another d operations giving in total 3d operations. To be compared withthe (d + 1)n operations in the non-recursive case.

To compute the number of operations required to compute Cn recursively we have d com-putations to compute the difference xn − µn−1 and then d2 operations to compute the outerproduct (xn− µn−1)(xn− µn−1)

′, another d2 operations to multiply Cn−1 by(

n−2n−1

)

and finallyd2 operations to add this product to the other one. Thus in total we have

d + d2 + d2 + d2 + d2 = 4d2 + d .

To be compare with 2nd2 + nd in the non recursive case. For large n the non-recursivecalculations are much more computationally intensive.

Chapter 6 (Multilayer Neural Networks)

Problem 39 (derivatives of matrix inner products)

Part (a): We present a slightly different method to prove this here. We desire the gradient(or derivative) of the function

φ ≡ xT Kx .

The i-th component of this derivative is given by

∂φ

∂xi=

∂

∂xi

(

xT Kx)

= eTi Kx + xT Kei

where ei is the i-th elementary basis function for Rn, i.e. it has a 1 in the i-th position and

zeros everywhere else. Now since

(eTi Kx)T = xT KT eT

i = eTi Kx ,

the above becomes∂φ

∂xi

= eTi Kx + eT

i KT x = eTi

(

(K + KT )x)

.

Since multiplying by eTi on the left selects the i-th row from the expression to its right we

see that the full gradient expression is given by

∇φ = (K + KT )x ,

as requested in the text. Note that this expression can also be proved easily by writing eachterm in component notation as suggested in the text.

Part (b): If K is symmetric then since KT = K the above expression simplifies to

∇φ = 2Kx ,

as claimed in the book.

Chapter 9 (Algorithm-Independent Machine Learning)

Problem 9 (sums of binomial coefficients)

Part (a): We first recall the binomial theorem which is

(x + y)n =n∑

k=0

(

nk

)

xkyn−k .

If we let x = 1 and y = 1 then x + y = 2 and the sum above becomes

2n =n∑

k=0

(

nk

)

,

which is the stated identity. As an aside, note also that if we let x = −1 and y = 1 thenx + y = 0 and we obtain another interesting identity involving the binomial coefficients

0 =n∑

k=0

(

nk

)

(−1)k .

Problem 23 (the average of the leave one out means)

We define the leave one out mean µ(i) as

µ(i) =1

n − 1

∑

j 6=i

xj .

This can obviously be computed for every i. Now we want to consider the mean of the leaveone out means. To do so first notice that µ(i) can be written as

µ(i) =1

n − 1

∑

j 6=i

xj

=1

n − 1

(

n∑

j=1

xj − xi

)

=1

n − 1

(

n

(

1

n

) n∑

j=1

xj − xi

)

=n

n − 1µ − xi

n − 1.

With this we see that the mean of all of the leave one out means is given by

1

n

n∑

i=1

µ(i) =1

n

n∑

i=1

(

n

n − 1µ − xi

n − 1

)

=n

n − 1µ − 1

n − 1

(

1

n

) n∑

i=1

xi

=n

n − 1µ − µ

n − 1= µ ,

as expected.

Problem 40 (maximum likelihood estimation with a binomial random variable)

Since k has a binomial distribution with parameters (n′, p) we have that

P (k) =

(

n′

k

)

pk(1 − p)n′−k .

Then the p that maximizes this expression is given by taking the derivative of the above(with respect to p) setting the resulting expression equal to zero and solving for p. We findthat this derivative is given by

d

dpP (k) =

(

n′

k

)

kpk−1(1 − p)n′−k +

(

n′

k

)

pk(1 − p)n′−k−1(n′ − k)(−1) .

Which when set equal to zero and solve for p we find that p = kn′ , or the empirical counting

estimate of the probability of success as we were asked to show.

weatherwax solution to duda heart

Documents

Transcript of weatherwax solution to duda heart