Evaluating decadal hindcasts: why and how?

32
Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3 July 2013, Exeter, UK)

description

Evaluating decadal hindcasts: why and how?. Chris Ferro (University of Exeter) T. Fricker , F. Otto, D. Stephenson, E. Suckling. CliMathNet Conference (3 July 2013, Exeter, UK). Evaluating decadal hindcasts: why and how?. Chris Ferro (University of Exeter) - PowerPoint PPT Presentation

Transcript of Evaluating decadal hindcasts: why and how?

Page 1: Evaluating decadal hindcasts: why and how?

Evaluating decadal hindcasts: why and how?

Chris Ferro (University of Exeter)T. Fricker, F. Otto, D. Stephenson, E. Suckling

CliMathNet Conference (3 July 2013, Exeter, UK)

Page 2: Evaluating decadal hindcasts: why and how?

Evaluating decadal hindcasts: why and how?

Chris Ferro (University of Exeter)T. Fricker, F. Otto, D. Stephenson, E. Suckling

CliMathNet Conference (3 July 2013, Exeter, UK)

Page 3: Evaluating decadal hindcasts: why and how?

Evaluating decadal hindcasts: why and how?

Chris Ferro (University of Exeter)T. Fricker, F. Otto, D. Stephenson, E. Suckling

CliMathNet Conference (3 July 2013, Exeter, UK)

Page 4: Evaluating decadal hindcasts: why and how?

Evaluating ensemble forecasts

Multiple predictions, e.g. model simulations from several initial conditions.

Want scores that favour ensembles whose members behave as if they and the observation are drawn from the same probability distribution.

Page 5: Evaluating decadal hindcasts: why and how?

Current practice is unfair

Current practice evaluates a proper scoring rule for the empirical distribution function of the ensemble.

A scoring rule, s(p,y), for a probability forecast, p, and an observation, y, is proper if (for all p) the expected score, Ey{s(p,y)}, is optimized when y ~ p.

Proper scoring rules favour probability forecasts that behave as if the observations are randomly sampled from the forecast distributions.

Page 6: Evaluating decadal hindcasts: why and how?

Examples of proper scoring rules

Brier score: s(p,y) = (p – y)2 for observation y = 0 or 1, and probability forecast 0 ≤ p ≤ 1.

Ensemble Brier score: s(x,y) = (i/n – y)2 where i of the n ensemble members predict the event {y = 1}.

CRPS: for real y and forecast p(t) = Pr(y ≤ t), with I the indicator function,

Ensemble CRPS: where i(t) of the n ensemble members predict the event {y ≤ t},

.dt)ty(I)t(p)y,p(s 2

.dt)ty(In/)t(i)y,x(s 2

Page 7: Evaluating decadal hindcasts: why and how?

Example: ensemble CRPS

Observations y ~ N(0,1) and n ensemble members xi ~ N(0,σ2) for i = 1, ..., n.

Plot expected value of the ensemble CRPS against σ.

The ensemble CRPS is optimized when ensemble is underdispersed (σ < 1).

n = 2

n = 4

n = 8

Page 8: Evaluating decadal hindcasts: why and how?

Fair scoring rules for ensembles

Interpret the ensemble as a random sample.

Fair scoring rules favour ensembles that behave as if the observations are sampled from the same distribution.

A scoring rule, s(x,y), for an ensemble forecast, x, sampled from p, and an observation, y, is fair if (for all p) the expected score, Ex,y{s(x,y)}, is optimized when y ~ p.

Fricker, Ferro, Stephenson (2013) Three recommendations for evaluating climate predictions. Meteorological Applications, 20, 246-255 (open access)

Page 9: Evaluating decadal hindcasts: why and how?

Characterization: binary case

Let y = 1 if an event occurs, and let y = 0 otherwise.

Let si,y be the (finite) score when i of n ensemble members forecast the event and the observation is y.

The (negatively oriented) score is fair if

(n – i)(si+1,0 – si,0) = i(si-1,1 – si,1)

for i = 0, 1, ..., n and si+1,0 ≥ si,0 for i = 0, 1, ..., n – 1.

Ferro (2013) Fair scores for ensemble forecasts. Submitted.

Page 10: Evaluating decadal hindcasts: why and how?

Examples of fair scoring rules

Ensemble Brier score: s(x,y) = (i/n – y)2 where i of the n ensemble members predict the event {y = 1}.

Fair Brier score: s(x,y) = (i/n – y)2 – i(n – i)/{n2(n – 1)}.

Ensemble CRPS: where i(t) of the n ensemble members predict the event {y ≤ t},

Fair CRPS: if (x1, ..., xn) are the n ensemble members,

.dt)ty(In/)t(i)y,x(s 2

)}.1n(n2/{|xx|td)ty(In/)t(i)y,x(s 2

jiji

2

Page 11: Evaluating decadal hindcasts: why and how?

Example: ensemble CRPS

Observations y ~ N(0,1) and n ensemble members xi ~ N(0,σ2) for i = 1, ..., n.

Plot expected value of the fair CRPS against σ.

The fair CRPS is always optimized when ensemble is well dispersed (σ = 1).

unfair scorefair score

n = 2

n = 4

n = 8

all n

Page 12: Evaluating decadal hindcasts: why and how?

How good are climate predictions?

Justify and quantify our judgments about the credibility of climate predictions, i.e. predictions of performance.

Extrapolating past performance has little justification.

Measure performance of available experiments and judge if harder or easier than the climate predictions.

Ensure beliefs agree with these performance bounds.

Otto, Ferro, Fricker, Suckling (2013) On judging the credibility of climate predictions. Climatic Change, on-line (open access)

Page 13: Evaluating decadal hindcasts: why and how?

Summary

Use existing data explicitly to justify quantitative predictions of the performance of climate predictions.

Evaluate ensemble forecasts (not only probability forecasts) to learn about ensemble prediction system.

Use fair scoring rules to favour ensembles whose members behave as if they and the observation are drawn from the same probability distribution.

Page 14: Evaluating decadal hindcasts: why and how?

ReferencesFerro CAT (2013) Fair scores for ensemble forecasts. SubmittedFricker TE, Ferro CAT, Stephenson DB (2013) Three

recommendations for evaluating climate predictions. Meteorological Applications, 20, 246-255 (open access)

Goddard L, co-authors (2013) A verification framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40, 245-272

Otto FEL, Ferro CAT, Fricker TE, Suckling EB (2013) On judging the credibility of climate predictions. Climatic Change, on-line (open access)

Page 15: Evaluating decadal hindcasts: why and how?
Page 16: Evaluating decadal hindcasts: why and how?

Evaluating climate predictions

1. Large trends over the verification period can inflate spuriously the value of some verification measures, e.g. correlation.

Scores, which measure the performance of each forecast separately before averaging, are immune to spurious skill. Correlation: 0.06 and 0.84

Page 17: Evaluating decadal hindcasts: why and how?

Evaluating climate predictions

2. Long-range predictions of short-lived quantities (e.g. daily temperatures) can be well calibrated, and may exhibit resolution.

Evaluate predictions for relevant quantities, not only multi-year means.

Page 18: Evaluating decadal hindcasts: why and how?

Evaluating climate predictions

3. Scores should favour ensembles whose members behave as if they and the observation are sampled from the same distribution. ‘Fair’ scores do this; traditional scores do not.

n = 2

n = 4

n = 8

unfair scorefair score

Figure: The unfair continuous ranked probability score is optimized by under-dispersed ensembles of size n.

Page 19: Evaluating decadal hindcasts: why and how?

Summary

Use existing data explicitly to justify quantitative predictions of the performance of climate predictions.

Be aware that some measures of performance may be inflated spuriously by climate trends.

Consider climate predictions of more decision-relevant quantities, not only multi-year means.

Use fair scores to evaluate ensemble forecasts.

Page 20: Evaluating decadal hindcasts: why and how?

Credibility and performance

Many factors may influence credibility judgments, but should do so if and only if they affect our expectations about the performance of the predictions.

Identify credibility with predicted performance.

We must be able to justify and quantify (roughly) our predictions of performance if they are to be useful.

Page 21: Evaluating decadal hindcasts: why and how?

Performance-based arguments

Extrapolate past performance on basis of knowledge of the climate model and the real climate (Parker 2010).

Define a reference class of predictions (including the prediction in question) whose performances you cannot reasonably order in advance, measure the performance of some members of the class, and infer the performance of the prediction in question.

Popular for weather forecasts (many similar forecasts) but less use for climate predictions (Frame et al. 2007).

Page 22: Evaluating decadal hindcasts: why and how?

Climate predictions

Few past predictions are similar to future predictions, so performance-based arguments are weak for climate.

Other data may still be useful: short-range predictions, in-sample hindcasts, imperfect model experiments etc.

These data are used by climate scientists, but typically to make qualitative judgments about performance.

We propose to use these data explicitly to make quantitative judgments about future performance.

Page 23: Evaluating decadal hindcasts: why and how?

Bounding arguments

1. Form a reference class of predictions that does not contain the prediction in question.

2. Judge if the prediction in question is a harder or easier problem than those in the reference class.

3. Measure the performance of some members of the reference class.

This provides a bound for your expectations about the performance of the prediction in question.

Page 24: Evaluating decadal hindcasts: why and how?

Bounding arguments

S = performance of a prediction from reference class C

S′ = performance of the prediction in question, from C′Let performance be positive with smaller values better.

Infer probabilities Pr(S > s) from a sample from class C.

If C′ is harder than C then Pr(S′ > s) > Pr(S > s) for all s.

If C′ is easier than C then Pr(S′ > s) < Pr(S > s) for all s.

Page 25: Evaluating decadal hindcasts: why and how?

Hindcast example

Global mean, annual mean surface air temperature anomalies relative to mean over the previous 20 years. Initial-condition ensembles of HadCM3 launched every year from 1960 to 2000. Measure performance by the absolute errors and consider a lead time of 9 years.

1. Perfect model: predict another HadCM3 member

2. Imperfect model: predict a MIROC5 member

3. Reality: predict HadCRUT4 observations

Page 26: Evaluating decadal hindcasts: why and how?

Hindcast example

Page 27: Evaluating decadal hindcasts: why and how?

1. Errors when predict HadCM3

Page 28: Evaluating decadal hindcasts: why and how?

2. Errors when predict MIROC5

Page 29: Evaluating decadal hindcasts: why and how?

3. Errors when predict reality

Page 30: Evaluating decadal hindcasts: why and how?

Recommendations

Use existing data explicitly to justify quantitative predictions of the performance of climate predictions.

Collect data on more predictions, covering a range of physical processes and conditions, to tighten bounds.

Design hindcasts and imperfect model experiments to be as similar as possible to future prediction problems.

Train ourselves to be better judges of relative performance, especially to avoid over-confidence.

Page 31: Evaluating decadal hindcasts: why and how?

Future developments

Bounding arguments may help us to form fully probabilistic judgments about performance.

Let s = (s1, ..., sn) be a sample from S ~ F( |∙ p).

Let S′ ~ F( |∙ cp) with priors p ~ g( ) and ∙ c ~ h( ).∙Then Pr(S′ ≤ s|s) = ∫∫F(s|cp)h(c)g(p|s)dcdp.

Bounding arguments refer to prior beliefs about S′ directly rather than indirectly through beliefs about c.

Page 32: Evaluating decadal hindcasts: why and how?

Predicting performance

We might try to predict performance by forming our own prediction of the predictand.

If we incorporate information about the prediction in question then we must already have judged its credibility; if not then we ignore relevant information.

Consider predicting a coin toss. Our own prediction is Pr(head) = 0.5. Then our prediction of the performance of another prediction is bound to be Pr(correct) = 0.5 regardless of other information about that prediction.