Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations...

80
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1/1

Transcript of Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations...

Page 1: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Foundations of Statistical Inference

Julien Berestycki

Department of StatisticsUniversity of Oxford

MT 2016

Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 1

Page 2: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Lecture 10 : Decision theory

Julien Berestycki (University of Oxford) SB2a MT 2016 2 / 1

Page 3: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision theory

Example You have been exposed to a deadly virus. About 1/3 ofpeople who are exposed to the virus are infected by it, and all thoseinfected by it die unless they receive a vaccine. By the time anysymptoms of the virus show up, it is too late for the vaccine to work.You are offered a vaccine for £500. Do you take it or not?

The most likely scenario is that you don’t have the virus but basing adecision on this ignores the costs (or loss) associated with thedecisions. We would put a very high loss on dying!

Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 1

Page 4: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision theory

Example You have been exposed to a deadly virus. About 1/3 ofpeople who are exposed to the virus are infected by it, and all thoseinfected by it die unless they receive a vaccine. By the time anysymptoms of the virus show up, it is too late for the vaccine to work.You are offered a vaccine for £500. Do you take it or not?

The most likely scenario is that you don’t have the virus but basing adecision on this ignores the costs (or loss) associated with thedecisions. We would put a very high loss on dying!

Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 1

Page 5: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision theory

Example You are in a gun fight with a trigger-happy tough cop.Unfortunately, he has his gun pointed at you and you must decidewether he still has some bullet in it. You estimate the probability that hedoes at less than 5%. You must decide whether or not to test this byreaching for your gun.

Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 1

Page 6: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision theory

Example You are in a gun fight with a trigger-happy tough cop.Unfortunately, he has his gun pointed at you and you must decidewether he still has some bullet in it. You estimate the probability that hedoes at less than 5%. You must decide whether or not to test this byreaching for your gun.

Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 1

Page 7: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision Theory

Decision Theory sits ‘above’ Bayesian and classical statisticalinference and gives us a basis to compare different approaches tostatistical inference.

We make decisions by applying rules to data.Decisions are subject to risk.A risk function specifies the expected loss which follows from theapplication of a given rule, and this is a basis for comparing rules.We may choose a rule to minimize the maximum risk, or we maychoose a rule to minimize the average risk.

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 1

Page 8: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision Theory

Decision Theory sits ‘above’ Bayesian and classical statisticalinference and gives us a basis to compare different approaches tostatistical inference.We make decisions by applying rules to data.

Decisions are subject to risk.A risk function specifies the expected loss which follows from theapplication of a given rule, and this is a basis for comparing rules.We may choose a rule to minimize the maximum risk, or we maychoose a rule to minimize the average risk.

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 1

Page 9: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision Theory

Decision Theory sits ‘above’ Bayesian and classical statisticalinference and gives us a basis to compare different approaches tostatistical inference.We make decisions by applying rules to data.Decisions are subject to risk.

A risk function specifies the expected loss which follows from theapplication of a given rule, and this is a basis for comparing rules.We may choose a rule to minimize the maximum risk, or we maychoose a rule to minimize the average risk.

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 1

Page 10: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision Theory

Decision Theory sits ‘above’ Bayesian and classical statisticalinference and gives us a basis to compare different approaches tostatistical inference.We make decisions by applying rules to data.Decisions are subject to risk.A risk function specifies the expected loss which follows from theapplication of a given rule, and this is a basis for comparing rules.

We may choose a rule to minimize the maximum risk, or we maychoose a rule to minimize the average risk.

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 1

Page 11: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Decision Theory

Decision Theory sits ‘above’ Bayesian and classical statisticalinference and gives us a basis to compare different approaches tostatistical inference.We make decisions by applying rules to data.Decisions are subject to risk.A risk function specifies the expected loss which follows from theapplication of a given rule, and this is a basis for comparing rules.We may choose a rule to minimize the maximum risk, or we maychoose a rule to minimize the average risk.

Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 1

Page 12: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Terminology

θ is the ‘true state of nature’, X ∼ f (x ; θ) is the data.

The Decision rule is δ. If X = x , adopt the action δ(x) given by the rule.

Example A single parameter θ is estimated from X = x by δ(x) = θ̂(x).

The rule θ̂ is the functional form of the estimator. The action is thevalue of the estimator.

The Loss function LS(θ, δ(x)) measures the loss from action δ(x) whenθ holds.

Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 1

Page 13: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Terminology

θ is the ‘true state of nature’, X ∼ f (x ; θ) is the data.

The Decision rule is δ. If X = x , adopt the action δ(x) given by the rule.

Example A single parameter θ is estimated from X = x by δ(x) = θ̂(x).

The rule θ̂ is the functional form of the estimator. The action is thevalue of the estimator.

The Loss function LS(θ, δ(x)) measures the loss from action δ(x) whenθ holds.

Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 1

Page 14: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Terminology

θ is the ‘true state of nature’, X ∼ f (x ; θ) is the data.

The Decision rule is δ. If X = x , adopt the action δ(x) given by the rule.

Example A single parameter θ is estimated from X = x by δ(x) = θ̂(x).

The rule θ̂ is the functional form of the estimator. The action is thevalue of the estimator.

The Loss function LS(θ, δ(x)) measures the loss from action δ(x) whenθ holds.

Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 1

Page 15: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Terminology

θ is the ‘true state of nature’, X ∼ f (x ; θ) is the data.

The Decision rule is δ. If X = x , adopt the action δ(x) given by the rule.

Example A single parameter θ is estimated from X = x by δ(x) = θ̂(x).

The rule θ̂ is the functional form of the estimator. The action is thevalue of the estimator.

The Loss function LS(θ, δ(x)) measures the loss from action δ(x) whenθ holds.

Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 1

Page 16: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Terminology

θ is the ‘true state of nature’, X ∼ f (x ; θ) is the data.

The Decision rule is δ. If X = x , adopt the action δ(x) given by the rule.

Example A single parameter θ is estimated from X = x by δ(x) = θ̂(x).

The rule θ̂ is the functional form of the estimator. The action is thevalue of the estimator.

The Loss function LS(θ, δ(x)) measures the loss from action δ(x) whenθ holds.

Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 1

Page 17: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

LS(θ, θ̂(x)) is the loss function which increases for θ̂(x) being awayfrom θ

. Here are three common loss functions.Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

where a > 0.

Quadratic lossLS(θ, θ̂(x)) = (θ̂(x)− θ)2.

Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 1

Page 18: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

LS(θ, θ̂(x)) is the loss function which increases for θ̂(x) being awayfrom θ. Here are three common loss functions.Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

where a > 0.

Quadratic lossLS(θ, θ̂(x)) = (θ̂(x)− θ)2.

Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 1

Page 19: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

LS(θ, θ̂(x)) is the loss function which increases for θ̂(x) being awayfrom θ. Here are three common loss functions.Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

where a > 0.

Quadratic lossLS(θ, θ̂(x)) = (θ̂(x)− θ)2.

Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 1

Page 20: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

LS(θ, θ̂(x)) is the loss function which increases for θ̂(x) being awayfrom θ. Here are three common loss functions.Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

where a > 0.

Quadratic lossLS(θ, θ̂(x)) = (θ̂(x)− θ)2.

Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 1

Page 21: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

θ b  -­‐b  

a  

θ

θ

Zero-­‐One  Loss  

Absolute  error  loss  

Quadra3c  loss  

Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 1

Page 22: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Risk function

DefinitionThe risk function R(θ, δ) is defined as

R(θ, δ) =

∫LS(θ, δ(x))f (x ; θ)dx ,

This is the expected value of the loss (aka expected loss).

Example In the context of point estimation, with Quadratic Loss, therisk function is the mean square error,

R(θ, θ̂) = E[(θ̂(X )− θ)2].

Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 1

Page 23: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Risk function

DefinitionThe risk function R(θ, δ) is defined as

R(θ, δ) =

∫LS(θ, δ(x))f (x ; θ)dx ,

This is the expected value of the loss (aka expected loss).

Example In the context of point estimation, with Quadratic Loss, therisk function is the mean square error,

R(θ, θ̂) = E[(θ̂(X )− θ)2].

Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 1

Page 24: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Admissible and inadmissible rules.

DefinitionA procedure δ1 is inadmissible if there exists another procedure δ2such that

R(θ, δ1) ≥ R(θ, δ2), for all θ ∈ Θ

with R(θ, δ1) > R(θ, δ2) for at least some θ.

A procedure which is not inadmissible is admissible.

Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 1

Page 25: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Admissible and inadmissible rules.

DefinitionA procedure δ1 is inadmissible if there exists another procedure δ2such that

R(θ, δ1) ≥ R(θ, δ2), for all θ ∈ Θ

with R(θ, δ1) > R(θ, δ2) for at least some θ.

A procedure which is not inadmissible is admissible.

Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 1

Page 26: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

Suppose X ∼ U(0, θ). Consider estimators of the form θ̂(x) = ax (thisis a family of decisions rules indexed by a).

Show that a = 3/2 is a necessary condition for the rule θ̂ to beadmissible for quadratic loss.

R(θ, θ̂) =

∫ θ

0(ax − θ)2 1

θdx

= (a2/3− a + 1)θ2

and R is minimized at a = 3/2.

This does not show θ̂(x) = 3x/2 is admissible here.

It only shows that all estimators with a 6= 3/2 are inadmissible. Theestimator θ̂(x) = 3x/2 maybe inadmissible relative to other estimatorsnot in this class.

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 1

Page 27: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

Suppose X ∼ U(0, θ). Consider estimators of the form θ̂(x) = ax (thisis a family of decisions rules indexed by a).

Show that a = 3/2 is a necessary condition for the rule θ̂ to beadmissible for quadratic loss.

R(θ, θ̂) =

∫ θ

0(ax − θ)2 1

θdx

= (a2/3− a + 1)θ2

and R is minimized at a = 3/2.

This does not show θ̂(x) = 3x/2 is admissible here.

It only shows that all estimators with a 6= 3/2 are inadmissible. Theestimator θ̂(x) = 3x/2 maybe inadmissible relative to other estimatorsnot in this class.

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 1

Page 28: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

Suppose X ∼ U(0, θ). Consider estimators of the form θ̂(x) = ax (thisis a family of decisions rules indexed by a).

Show that a = 3/2 is a necessary condition for the rule θ̂ to beadmissible for quadratic loss.

R(θ, θ̂) =

∫ θ

0(ax − θ)2 1

θdx

= (a2/3− a + 1)θ2

and R is minimized at a = 3/2.

This does not show θ̂(x) = 3x/2 is admissible here.

It only shows that all estimators with a 6= 3/2 are inadmissible. Theestimator θ̂(x) = 3x/2 maybe inadmissible relative to other estimatorsnot in this class.

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 1

Page 29: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

Suppose X ∼ U(0, θ). Consider estimators of the form θ̂(x) = ax (thisis a family of decisions rules indexed by a).

Show that a = 3/2 is a necessary condition for the rule θ̂ to beadmissible for quadratic loss.

R(θ, θ̂) =

∫ θ

0(ax − θ)2 1

θdx

= (a2/3− a + 1)θ2

and R is minimized at a = 3/2.

This does not show θ̂(x) = 3x/2 is admissible here.

It only shows that all estimators with a 6= 3/2 are inadmissible. Theestimator θ̂(x) = 3x/2 maybe inadmissible relative to other estimatorsnot in this class.

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 1

Page 30: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

Suppose X ∼ U(0, θ). Consider estimators of the form θ̂(x) = ax (thisis a family of decisions rules indexed by a).

Show that a = 3/2 is a necessary condition for the rule θ̂ to beadmissible for quadratic loss.

R(θ, θ̂) =

∫ θ

0(ax − θ)2 1

θdx

= (a2/3− a + 1)θ2

and R is minimized at a = 3/2.

This does not show θ̂(x) = 3x/2 is admissible here.

It only shows that all estimators with a 6= 3/2 are inadmissible. Theestimator θ̂(x) = 3x/2 maybe inadmissible relative to other estimatorsnot in this class.

Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 1

Page 31: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Minimax rules

DefinitionA rule δ is a minimax rule if maxθ R(θ, δ) ≤ maxθ R(θ, δ′) for any otherrule δ′. It minimizes the maximum risk.

Since minimax minimizes the maximum risk (ie, the loss averaged overall possible data X ∼ f ) the choice of rule is not influenced by theactual data X = x (though given the rule δ, the action δ(x) isdata-dependent).

It makes sense when the maximum loss scenario must be avoided, butcan can lead to poor performance on average.

Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 1

Page 32: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Minimax rules

DefinitionA rule δ is a minimax rule if maxθ R(θ, δ) ≤ maxθ R(θ, δ′) for any otherrule δ′. It minimizes the maximum risk.

Since minimax minimizes the maximum risk (ie, the loss averaged overall possible data X ∼ f ) the choice of rule is not influenced by theactual data X = x (though given the rule δ, the action δ(x) isdata-dependent).

It makes sense when the maximum loss scenario must be avoided, butcan can lead to poor performance on average.

Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 1

Page 33: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Minimax rules

DefinitionA rule δ is a minimax rule if maxθ R(θ, δ) ≤ maxθ R(θ, δ′) for any otherrule δ′. It minimizes the maximum risk.

Since minimax minimizes the maximum risk (ie, the loss averaged overall possible data X ∼ f ) the choice of rule is not influenced by theactual data X = x (though given the rule δ, the action δ(x) isdata-dependent).

It makes sense when the maximum loss scenario must be avoided, butcan can lead to poor performance on average.

Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 1

Page 34: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes risk, Bayes rule, expected posterior loss

DefinitionSuppose we have a prior probability π = π(θ) for θ. Denote by

r(π, δ) =

∫R(θ, δ)π(θ)dθ

the Bayes risk of rule δ. A Bayes rule is a rule that minimizes theBayes risk. A Bayes rule is sometimes called a Bayes procedure.

Let π(θ|x) =L(x ; θ)π(θ)

h(x)denote the posterior following from likelihood

L and prior π.

DefinitionThe expected posterior loss is defined as∫

LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 1

Page 35: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes risk, Bayes rule, expected posterior loss

DefinitionSuppose we have a prior probability π = π(θ) for θ. Denote by

r(π, δ) =

∫R(θ, δ)π(θ)dθ

the Bayes risk of rule δ. A Bayes rule is a rule that minimizes theBayes risk. A Bayes rule is sometimes called a Bayes procedure.

Let π(θ|x) =L(x ; θ)π(θ)

h(x)denote the posterior following from likelihood

L and prior π.

DefinitionThe expected posterior loss is defined as∫

LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 1

Page 36: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes risk, Bayes rule, expected posterior loss

DefinitionSuppose we have a prior probability π = π(θ) for θ. Denote by

r(π, δ) =

∫R(θ, δ)π(θ)dθ

the Bayes risk of rule δ. A Bayes rule is a rule that minimizes theBayes risk. A Bayes rule is sometimes called a Bayes procedure.

Let π(θ|x) =L(x ; θ)π(θ)

h(x)denote the posterior following from likelihood

L and prior π.

DefinitionThe expected posterior loss is defined as∫

LS(θ, δ(x))π(θ|x)dθJulien Berestycki (University of Oxford) SB2a MT 2016 13 / 1

Page 37: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ =

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 38: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ

=

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 39: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ =

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 40: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ =

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 41: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ =

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 42: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

LemmaA Bayes rule minimizes the expected posterior loss.

Proof ∫R(θ, δ)π(θ)dθ =

∫ ∫LS(θ, δ(x))L(θ; x)π(θ)dx dθ

=

∫ ∫LS(θ, δ(x))π(θ|x)h(x)dx dθ

=

∫h(x)

(∫LS(θ, δ(x))π(θ|x)dθ

)dx

That is for each x we choose δ(x) to minimize the integral∫LS(θ, δ(x))π(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 1

Page 43: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes rules for Point estimation

Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

We need to minimize∫ ∞−∞

π(θ|x)LS(θ, θ̂)dθ = a∫ ∞θ̂+b

π(θ|x)dθ + a∫ θ̂−b

−∞π(θ|x)dθ

∝ 1−∫ θ̂+b

θ̂−bπ(θ|x)dθ

That is we want to maximize∫ θ̂+b

θ̂−bπ(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 1

Page 44: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes rules for Point estimation

Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

We need to minimize∫ ∞−∞

π(θ|x)LS(θ, θ̂)dθ

= a∫ ∞θ̂+b

π(θ|x)dθ + a∫ θ̂−b

−∞π(θ|x)dθ

∝ 1−∫ θ̂+b

θ̂−bπ(θ|x)dθ

That is we want to maximize∫ θ̂+b

θ̂−bπ(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 1

Page 45: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes rules for Point estimation

Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

We need to minimize∫ ∞−∞

π(θ|x)LS(θ, θ̂)dθ = a∫ ∞θ̂+b

π(θ|x)dθ + a∫ θ̂−b

−∞π(θ|x)dθ

∝ 1−∫ θ̂+b

θ̂−bπ(θ|x)dθ

That is we want to maximize∫ θ̂+b

θ̂−bπ(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 1

Page 46: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes rules for Point estimation

Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

We need to minimize∫ ∞−∞

π(θ|x)LS(θ, θ̂)dθ = a∫ ∞θ̂+b

π(θ|x)dθ + a∫ θ̂−b

−∞π(θ|x)dθ

∝ 1−∫ θ̂+b

θ̂−bπ(θ|x)dθ

That is we want to maximize∫ θ̂+b

θ̂−bπ(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 1

Page 47: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Bayes rules for Point estimation

Zero-One loss

LS(θ, θ̂(x)) =

{0 |θ̂(x)− θ| < ba |θ̂(x)− θ| ≥ b

where a,b are constants.

We need to minimize∫ ∞−∞

π(θ|x)LS(θ, θ̂)dθ = a∫ ∞θ̂+b

π(θ|x)dθ + a∫ θ̂−b

−∞π(θ|x)dθ

∝ 1−∫ θ̂+b

θ̂−bπ(θ|x)dθ

That is we want to maximize∫ θ̂+b

θ̂−bπ(θ|x)dθ

Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 1

Page 48: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Zero-one loss

If π(θ|x) is unimodal the maximum is attained by choosing θ̂ to be themid-point of the interval of length 2b for which π(θ|b) has the samevalue at both ends.

θ̂

As b → 0, θ̂ → the global mode of the posterior distribution. If π(θ|x) isunimodal and symmetric, the optimal θ̂ is the median (equal to themean and mode) of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 1

Page 49: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Zero-one loss

If π(θ|x) is unimodal the maximum is attained by choosing θ̂ to be themid-point of the interval of length 2b for which π(θ|b) has the samevalue at both ends.

θ̂

As b → 0, θ̂ → the global mode of the posterior distribution. If π(θ|x) isunimodal and symmetric, the optimal θ̂ is the median (equal to themean and mode) of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 1

Page 50: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Zero-one loss

If π(θ|x) is unimodal the maximum is attained by choosing θ̂ to be themid-point of the interval of length 2b for which π(θ|b) has the samevalue at both ends.

θ̂

As b → 0, θ̂ → the global mode of the posterior distribution

. If π(θ|x) isunimodal and symmetric, the optimal θ̂ is the median (equal to themean and mode) of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 1

Page 51: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Zero-one loss

If π(θ|x) is unimodal the maximum is attained by choosing θ̂ to be themid-point of the interval of length 2b for which π(θ|b) has the samevalue at both ends.

θ̂

As b → 0, θ̂ → the global mode of the posterior distribution. If π(θ|x) isunimodal and symmetric, the optimal θ̂ is the median (equal to themean and mode) of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 1

Page 52: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Absolute error loss

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

We need to minimise∫|θ̂ − θ|π(θ|x)dθ =

∫ θ̂

−∞(θ̂ − θ)π(θ|x)dθ +

∫ ∞θ̂

(θ − θ̂)π(θ|x)dθ.

Differentiate wrt θ̂ and equate to zero.∫ θ̂

−∞π(θ|x)dθ −

∫ ∞θ̂

π(θ|x)dθ = 0

That is, the optimal θ̂ is the median of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 1

Page 53: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Absolute error loss

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

We need to minimise∫|θ̂ − θ|π(θ|x)dθ

=

∫ θ̂

−∞(θ̂ − θ)π(θ|x)dθ +

∫ ∞θ̂

(θ − θ̂)π(θ|x)dθ.

Differentiate wrt θ̂ and equate to zero.∫ θ̂

−∞π(θ|x)dθ −

∫ ∞θ̂

π(θ|x)dθ = 0

That is, the optimal θ̂ is the median of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 1

Page 54: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Absolute error loss

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

We need to minimise∫|θ̂ − θ|π(θ|x)dθ =

∫ θ̂

−∞(θ̂ − θ)π(θ|x)dθ +

∫ ∞θ̂

(θ − θ̂)π(θ|x)dθ.

Differentiate wrt θ̂ and equate to zero.∫ θ̂

−∞π(θ|x)dθ −

∫ ∞θ̂

π(θ|x)dθ = 0

That is, the optimal θ̂ is the median of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 1

Page 55: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Absolute error loss

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

We need to minimise∫|θ̂ − θ|π(θ|x)dθ =

∫ θ̂

−∞(θ̂ − θ)π(θ|x)dθ +

∫ ∞θ̂

(θ − θ̂)π(θ|x)dθ.

Differentiate wrt θ̂ and equate to zero.∫ θ̂

−∞π(θ|x)dθ −

∫ ∞θ̂

π(θ|x)dθ = 0

That is, the optimal θ̂ is the median of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 1

Page 56: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Absolute error loss

Absolute error loss

LS(θ, θ̂(x)) = a|θ̂(x)− θ|

We need to minimise∫|θ̂ − θ|π(θ|x)dθ =

∫ θ̂

−∞(θ̂ − θ)π(θ|x)dθ +

∫ ∞θ̂

(θ − θ̂)π(θ|x)dθ.

Differentiate wrt θ̂ and equate to zero.∫ θ̂

−∞π(θ|x)dθ −

∫ ∞θ̂

π(θ|x)dθ = 0

That is, the optimal θ̂ is the median of the posterior distribution.

Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 1

Page 57: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Quadratic loss

Minimize

Eθ|x [(θ̂ − θ)2]

= Eθ|x [(θ̂ − θ̄ + θ̄ − θ)2]

= (θ̂ − θ̄)2 + 2(θ̂ − θ̄)Eθ|x (θ − θ̄) + Eθ|x [(θ − θ̄)2]

where θ̄ is the posterior mean of θ.

Note θ̂ and θ̄ are constants in the posterior distribution of θ so that(θ̂ − θ̄)E(θ − θ̄) = 0. So

Eθ|x [(θ̂ − θ)2] = (θ̂ − θ̄)2 + Vθ|x (θ)

The Quadratic loss function is minimized when θ̂ = θ̄, the posteriormean.

Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 1

Page 58: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Quadratic loss

Minimize

Eθ|x [(θ̂ − θ)2] = Eθ|x [(θ̂ − θ̄ + θ̄ − θ)2]

= (θ̂ − θ̄)2 + 2(θ̂ − θ̄)Eθ|x (θ − θ̄) + Eθ|x [(θ − θ̄)2]

where θ̄ is the posterior mean of θ.

Note θ̂ and θ̄ are constants in the posterior distribution of θ so that(θ̂ − θ̄)E(θ − θ̄) = 0. So

Eθ|x [(θ̂ − θ)2] = (θ̂ − θ̄)2 + Vθ|x (θ)

The Quadratic loss function is minimized when θ̂ = θ̄, the posteriormean.

Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 1

Page 59: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Quadratic loss

Minimize

Eθ|x [(θ̂ − θ)2] = Eθ|x [(θ̂ − θ̄ + θ̄ − θ)2]

= (θ̂ − θ̄)2 + 2(θ̂ − θ̄)Eθ|x (θ − θ̄) + Eθ|x [(θ − θ̄)2]

where θ̄ is the posterior mean of θ.

Note θ̂ and θ̄ are constants in the posterior distribution of θ so that(θ̂ − θ̄)E(θ − θ̄) = 0

. So

Eθ|x [(θ̂ − θ)2] = (θ̂ − θ̄)2 + Vθ|x (θ)

The Quadratic loss function is minimized when θ̂ = θ̄, the posteriormean.

Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 1

Page 60: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Quadratic loss

Minimize

Eθ|x [(θ̂ − θ)2] = Eθ|x [(θ̂ − θ̄ + θ̄ − θ)2]

= (θ̂ − θ̄)2 + 2(θ̂ − θ̄)Eθ|x (θ − θ̄) + Eθ|x [(θ − θ̄)2]

where θ̄ is the posterior mean of θ.

Note θ̂ and θ̄ are constants in the posterior distribution of θ so that(θ̂ − θ̄)E(θ − θ̄) = 0. So

Eθ|x [(θ̂ − θ)2] = (θ̂ − θ̄)2 + Vθ|x (θ)

The Quadratic loss function is minimized when θ̂ = θ̄, the posteriormean.

Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 1

Page 61: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Summary

The form of the Bayes rule depends upon the loss function in thefollowing way

Zero-one loss (as b → 0) leads to the posterior mode.Absolute error loss leads to the posterior median.Quadratic loss leads to the posterior mean.

Note These are not the only loss functions one could use in a givensituation, and other loss functions will lead to different Bayes rules.

Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 1

Page 62: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

X ∼ Binomial (n, θ), and the prior π(θ) is a Beta (α, β) distribution.

E(θ) =α

α + β

The distribution is unimodal if α, β > 1 with mode (α− 1)/(α + β − 2).

The posterior distribution of θ | x is Beta (α + x , β + n − x).

With zero-one loss and b → 0 the Bayes estimator is(α + x − 1)/(α + β + n − 2).

For a quadratic loss function, the Bayes estimator is(α + x)/(α + β + n).

For an absolute error loss function is the median of the posterior.

Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 1

Page 63: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

X ∼ Binomial (n, θ), and the prior π(θ) is a Beta (α, β) distribution.

E(θ) =α

α + β

The distribution is unimodal if α, β > 1 with mode (α− 1)/(α + β − 2).

The posterior distribution of θ | x is Beta (α + x , β + n − x).

With zero-one loss and b → 0 the Bayes estimator is(α + x − 1)/(α + β + n − 2).

For a quadratic loss function, the Bayes estimator is(α + x)/(α + β + n).

For an absolute error loss function is the median of the posterior.

Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 1

Page 64: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

X ∼ Binomial (n, θ), and the prior π(θ) is a Beta (α, β) distribution.

E(θ) =α

α + β

The distribution is unimodal if α, β > 1 with mode (α− 1)/(α + β − 2).

The posterior distribution of θ | x is Beta (α + x , β + n − x).

With zero-one loss and b → 0 the Bayes estimator is(α + x − 1)/(α + β + n − 2).

For a quadratic loss function, the Bayes estimator is(α + x)/(α + β + n).

For an absolute error loss function is the median of the posterior.

Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 1

Page 65: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

X ∼ Binomial (n, θ), and the prior π(θ) is a Beta (α, β) distribution.

E(θ) =α

α + β

The distribution is unimodal if α, β > 1 with mode (α− 1)/(α + β − 2).

The posterior distribution of θ | x is Beta (α + x , β + n − x).

With zero-one loss and b → 0 the Bayes estimator is(α + x − 1)/(α + β + n − 2).

For a quadratic loss function, the Bayes estimator is(α + x)/(α + β + n).

For an absolute error loss function is the median of the posterior.

Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 1

Page 66: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Example

X ∼ Binomial (n, θ), and the prior π(θ) is a Beta (α, β) distribution.

E(θ) =α

α + β

The distribution is unimodal if α, β > 1 with mode (α− 1)/(α + β − 2).

The posterior distribution of θ | x is Beta (α + x , β + n − x).

With zero-one loss and b → 0 the Bayes estimator is(α + x − 1)/(α + β + n − 2).

For a quadratic loss function, the Bayes estimator is(α + x)/(α + β + n).

For an absolute error loss function is the median of the posterior.

Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 1

Page 67: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Randomized decision rules

Suppose we have a collection of l decision rules d1, . . . ,dl .

Forprobability weights p1, . . . ,pl define d∗ =

∑pidi to be the rule ‘select

rule di with probability pi and apply’.

Definitiond∗ is a randomized decision rule.

The risk function of a randomized decision rule is

R(θ,d∗) =l∑

i=1

piR(θ,di).

Minimax rules may be of this form.

Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 1

Page 68: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Randomized decision rules

Suppose we have a collection of l decision rules d1, . . . ,dl .Forprobability weights p1, . . . ,pl define d∗ =

∑pidi to be the rule ‘select

rule di with probability pi and apply’.

Definitiond∗ is a randomized decision rule.

The risk function of a randomized decision rule is

R(θ,d∗) =l∑

i=1

piR(θ,di).

Minimax rules may be of this form.

Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 1

Page 69: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Randomized decision rules

Suppose we have a collection of l decision rules d1, . . . ,dl .Forprobability weights p1, . . . ,pl define d∗ =

∑pidi to be the rule ‘select

rule di with probability pi and apply’.

Definitiond∗ is a randomized decision rule.

The risk function of a randomized decision rule is

R(θ,d∗) =l∑

i=1

piR(θ,di).

Minimax rules may be of this form.

Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 1

Page 70: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Randomized decision rules

Suppose we have a collection of l decision rules d1, . . . ,dl .Forprobability weights p1, . . . ,pl define d∗ =

∑pidi to be the rule ‘select

rule di with probability pi and apply’.

Definitiond∗ is a randomized decision rule.

The risk function of a randomized decision rule is

R(θ,d∗) =l∑

i=1

piR(θ,di).

Minimax rules may be of this form.

Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 1

Page 71: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

DefinitionA decision problem is said to be finite when the parameter spaceΘ = {θ1, . . . , θk} is finte.

In this case the notions of admissible, minimax and Bayes can begiven geometric interpretations.

DefinitionThe risk set S ⊂ Rk is the set ofpoints (R(θ1,d), . . .R(θk ,d)) forsome decision rule d .

LemmaS is a convex set.

Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 1

Page 72: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

DefinitionA decision problem is said to be finite when the parameter spaceΘ = {θ1, . . . , θk} is finte.

In this case the notions of admissible, minimax and Bayes can begiven geometric interpretations.

DefinitionThe risk set S ⊂ Rk is the set ofpoints (R(θ1,d), . . .R(θk ,d)) forsome decision rule d .

LemmaS is a convex set.

Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 1

Page 73: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

1 Extreme points =non-randomized rules.

2 Lower thick line= admissiblerules.

3 Minimax is intersection withR1 = R2 .

Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 1

Page 74: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

1 Extreme points =non-randomized rules.

2 Lower thick line= admissiblerules.

3 Minimax is intersection withR1 = R2 .

Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 1

Page 75: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

1 Extreme points =non-randomized rules.

2 Lower thick line= admissiblerules.

3 Minimax is intersection withR1 = R2 .

Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 1

Page 76: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

1 Extreme points =non-randomized rules.

2 Lower thick line= admissiblerules.

3 Minimax is intersection withR1 = R2 (random) .

Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 1

Page 77: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

1 Extreme points =non-randomized rules.

2 Lower thick line= admissiblerules.

3 Minimax is intersection withR1 = R2 (non-random).

Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 1

Page 78: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

To find the Bayes rule; suppose prior is (π1, π2). For any c the lineπ1R1 + π2R2 = c represents a class of decision rules with same Bayesrisk c.

The Bayes rules is unique andtherefore non-randomJulien Berestycki (University of Oxford) SB2a MT 2016 24 / 1

Page 79: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

To find the Bayes rule; suppose prior is (π1, π2). For any c the lineπ1R1 + π2R2 = c represents a class of decision rules with same Bayesrisk c.

The Bayes rules is unique andtherefore non-random

Bayes rule is not unique but can bechosen non-random.

Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 1

Page 80: Foundations of Statistical Inference - Oxford Statisticsberestyc/teaching/lecture10.pdfFoundations of Statistical Inference Julien Berestycki Department of Statistics University of

Finite decision problem

To find the Bayes rule; suppose prior is (π1, π2). For any c the lineπ1R1 + π2R2 = c represents a class of decision rules with same Bayesrisk c.

The Bayes rules is unique andtherefore non-random

How the prior influences the Bayesrule.

Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 1