An Introduction to Forecast Verification - HZGCOSMO/CLM/ART Training Forecast Verification Felix...

COSMO/CLM/ART Training Forecast Verification Felix Fundel

An Introduction to Forecast Verification

Felix Fundel

Deutscher Wetterdienst

FE 15 – Vorhersagbarkeit & Verifikation

Telefon:+49 (69) 8062 2422

Email: [email protected]


Outline

Basics

What is verificationWhat is a good forecast?

Reasons for NWP verification?Why is NWP erroneous?

Answers verification can giveTypes of forecasts

Types of observationsForecast properties

Methods

1 )Deterministic ForecastsContinuousCategorical

2) Ensemble Prediction SystemsEnsembleProbabilistic

3) SpatialFuzzyObject based

Final RemarksTake special care ofVerification guideline

Active fields of researchFurther reading


What is Verification?

Comparison of prediction (forecast) and truth (observation or analysis)

Usually the “truth” is not known. Using “evaluation” or “validation” instead of “verification“ would be more adequate

Infer on the goodness (quality and value) of the forecast (qualitatively or quantitatively)

I Basics


What is a good forecast?

Deterministic point forecast

Straight forward error characterization(accuracy & correlation)

However:How much error do I allow for?What would be an adequate reference forecast?Do I demand the same quality for day 1 as for day 7?Is the observation really this exact?



Deterministic spatial forecast

Should the forecast be evaluated point by point?

Should I allow for some spatial inaccuracy

Is the forecast equally good on all spatial scales?

How would a forecaster perceive the forecast quality?



Ensemble forecast

Is the observation within the ensemble range?

Would outside be ok as well?

How much ensemble spread is good?

Can I say anything about forecast quality from just one realization?


Reasons for NWP verification

Monitoring and quantification of errors→ for communication to the public or customers

Unravel systematic forecast errors→ for developers and forecasters

Compare different models / experiments→ for decision making, developers


Why is NWP erroneous?

Deficiencies in the model• Coarse grid resolution (vertical & horizontal)• Coarse temporal resolution• Parameterization of physical processes (e.g. radiation,

precipitation)• Numerical approximations• Errors from model boundaries (regional models)• Coding errors

Inaccurate initial conditions• Too few observations (spatial & temporal)• Uncertain observations (instrument error & representativity)• Errors in data assimilation

Even a perfect model with perfect initial conditions will have a limited predictability (butterfly effect)


Some answers verification can deliver

The forecast is x % better/worse than a reference forecast (e.g. climatology, persistence, other model)

The forecast is valuable up to a lead-time of x days?

Forecast quality depends on time of the day, region, season, meteorological conditions…

Will I have economical benefit from using the forecasts for decision making?

The forecast is/is not calibrated

The forecast is capable of representing the location, timing, shape, magnitude of objects (e.g. rain cells)


Types of forecasts

Deterministic model• Decide for in initial state and start a

single integration

Ensemble prediction• Make many integrations (e.g. from

different initial states, or multi model or analogue ensemble…)


Types of observations

• Point based• SYNOP • Ships, buoys • TEMP• Satellites• Airplanes• Observers

• Spatial• Rain radar• Satellites• Model analysis


Observations are not the truth!

• Each observation is a model itself (exception: counting events)

• Observations should come with an uncertainty which, if possible, should be considered in verification (but is rarely done)

Verification against observations is (almost) alway s flawed

• Model value is a grid box average and observations within a grid box might vary strongly. Many ways exist to match observation to grid point

• Is the observed value really what you want you model to predict?

• Gridded observations or analysis might be a work around but then, those rely on models (statistical or physical) again and might not be independent from the forecast.

Caution!


Forecast properties

Deterministic prediction systemBias - mean errorAssociation - e.g. correlation

Ensemble prediction systemReliability - conditional bias over several categories (usually forecast probabilities)Resolution - ability to resolve events in different subsetsSharpness - spread of the forecast distributionUncertainty - observation variability

BothSkill - Value w.r.t. a reference forecast (e.g. persistence or climatology)Value - Is the forecast helpful for decision making


Types of verificationContinuousFor deterministic predictions as time-series, spatial data or both combinedExample: temperature, pressure, upper-air variables

Dichotomous (binary, 2 categories, yes/no, special case multiple categories)For deterministic predictions as time-series, spatial data or both combinedExample: rain yes or no? cloud amount category, wind speed, warnings

EnsembleFor ensemble models considering the forecast distributionExample: Does the ensemble spread capture the forecast uncertainty

ProbabilisticFor probabilities derived from ensemble modelsExample: probability to exceed wind speed of 10 m/s?

SpatialMostly deterministic modelExample: Are objects predicted at the correct location?

II Verification Methods


Deterministic Forecasts

BIAS (mean error)

Shows the average direction of the error (positive or negative)

In NWP defined positive (negative) if model forecas t quantities are larger (smaller) than observed

Does not indicate the magnitude of the error as pos itive an negative values might cancel each other out

Same unit as variable



MAE (mean absolute error)

Average error magnitude

Does not indicate the direction of the error



RMSE (root mean squared error)

Average error magnitude with quadratic weight

Sensitive to large errors, always larger than MAE ( sum of bias and error variance)

RMSE > MAE means variation in errors




Correlation coefficient & anomaly correlation

Correspondence between forecast and observations

Measures linear association and phase errors.

Independent from biases.

Can give misleading results if verification sample is inhomogeneous (e.g. temperature correlation with day and night values i n one sample)

(Anomaly correlation should be used to reduce effec ts of inhomogeneity)



EXAMPLE DETEMINISTIC TEMP VERIFICATION OF GEOPOTENT IAL OVER EUROPE


Categorical Forecasts

Observed

yes no Total

Forecast yes hits false alarms forecast yes

no misses correct negatives forecast no

Total observed yes observed no total

• Used for binary data• If data is not binary, decide for a threshold (e.g. precipitation > 10mm/h) and

make your data binary• Sum up all entries in the contingency table


Categorical Forecasts!!CAUTION!!Famous example: Tornado forecast verificationCollection of tornado forecasts (yes/no) and outcom es

Observed

yes no Total

Forecast yes 28 72 100

no 23 2680 2703

Total 51 2752 2803

Accuracy = (28+2680)/2803 = 96.6% (published in Ame ric. Meteorol. Journal 1884)

If no tornados were forecast at all: Accuracy = (0+ 2752)/2803 = 98.2%

It is advisable to use more measures than just accu racy…


Categorical Forecasts

fraction of correct forecasts (best=1)

under or over forecasting (best=1)

correctly forecast events (best=1)

wrongly forecast events (best=0)

Can forecast separate yes from no events (best=1)

How much more often is a event forecast correctly than incorrectly (best= Inf)

Many more exist, see http://www.cawcr.gov.au/projec ts/verification/#Methods_for_dichotomous_forecasts


Multi-categorical Forecasts

Observed Category Total

i,j 1 2 ... K

1 n(F1,O1) n(F1,O2) ...n(F1,OK)

N(F1)

Forecast 2 n(F2,O1) n(F2,O2) ...n(F2,OK)

N(F2)

Category...

... ... ... ... ...

K n(FK,O1) n(FK,O2) ...n(FK,OK)

N(FK)

Total N(O1) N(O2) ... N(OK) N


Ensemble Forecasts(taking into account all members)

• An EPS provides a range of forecasts • By comparing just a singe EPS

forecast to an observation nothing can be said about the forecast quality!

• Even an observation outside the EPS does not mean the forecast is wrong

• Evaluation an EPS requires the collection of many cases

• This allows to infer on the statistical correctness of the forecast distribution given by the EPS


Ensemble Forecasts

Reliability Resolution

Measures ability to discriminate different events

Measures average agreement

ObservationsForecasts

Forecasts

Event 1

Event 2


Ensemble Forecasts

Talagrand diagram (a.k.a. rank histogram)

• Count the number of cases an observation falls in each of the bins given by the ensemble forecast (e.g. 20 Member EPS has 21 bins)

• As each member should be equally likely the talagrand diagram should be flat (necessary (not sufficient) criterion for reliability)

• Tails in the diagram indicate overall biases • Peaks on the left and right indicate too little spread• Hill shape indicates too much spread


Example COSMO -DE-EPS

Hourly precipitation summer 12


Ensemble ForecastsSpread/Skill behavior

• The spread (width) of an ensemble forecast should be related to the uncertainty of the forecast

• It is desirable to have growing spread when the forecast error grows• Common measure for spread is the average of the standard deviation over a set of

ensemble forecast• Common measure for skill is the RMSE of the ensemble mean• In case of no bias, spread and skill should give the same value



RMSE (SKILL)

STDEV(SPRED)

VMAX_10M July 2012


Ensemble ForecastsCRPS (continuous rankes probability score)

• Like MSE for ensemble predictions• Observation and forecast are expressed as cumulative density function and the

average difference of the probabilities is calculated• Can be decomposed in reliability, resolution and uncertainty components


Ensemble Forecasts

Other measure

• Check the number of outliers, i.e. observation falling outside the ensemble range 2/(n+1) is expected for a perfect EPS!

• The ensemble mean is often verified in a deterministic manner

• Each member can be verified in a deterministic manner


Reliability diagram

• Visualization of conditional (on forecast probability) biases• Decide for a threshold (e.g. temperature>=°C), convert forecasts to probabilities

exceeding this threshold and convert observation to binary according to threshold• Plot frequency of observation for each forecast probability class• Binning requires a lot of data

Probabilistic Forecasts(transform ensemble to probability)

overconfident forecast

(not enough spread)

biased forecast


Brier Score

• Like MSE for probabilistic forecasts (magnitude of error between forecast probability [0%-100%] and observed probability [0% or 100%])

• Decide for a threshold (e.g. temperature>=°C), convert forecasts to probabilities exceeding this threshold and convert observation to binary according to threshold

• Can be decomposed in resolution, reliability and uncertainty component• Perfect score = 0

Probabilistic Forecasts


ROC (relative operating characteristic)

• Decide for an event threshold• Calculate contingency table entries for a set of probability thresholds• Plot POD against FAR• Perfect when area under ROC curve = 1• Measures forecast resolution (forecasts can discriminate events)• If line falls under diagonal, forecast is worse than a random guess• Can be used to compare with deterministic forecast




Relative (economic) value score

C = Costs for taking preventive actionL = Loss if no preventive action was taken

• For each possible forecast probability calculate contingency table entries• Calculate VS for a number of cost/lass values [0-1]• Quantifies the relative monetary value of an EPS for a decision making problem• Considers the costs of a forecast user linked to the forecast event• Can give an indication for the best probability a forecast user should base her

decision on• For calibrated (unbiased) forecast systems the best probability equals C/L


Deterministic & Probabilistic

Skill Score (reduction of error variance)

• Can be applied to any score• Reference could e.g. be a climatological forecast, a persistence forecast or another

model• Perfect score = 1 , 0 indicates no improvement over reference• Result is % improvement in score compared to reference forecast• Ultimate answer to the question “how good is the forecast?”• Allows to compare scores for different events (e.g. easy and hard to predict)



Hourly precipitation summer 12

Reference: determ. COSMO


Spatial Verification

Problems of traditional (point-to-point) verification methods

Double Penalty• Location or timing errors in the forecast are

penalized double• E.g. rain if forecast where there is no rain

observed and no rain is forecast where it actually was observed

• Increasingly problematic with increasing resolution of forecast models

Forecast of objects

• Properties of objects like rain cells of cloud are important aspects of a forecast and not captured by point-to-point-verification



FUZZY (neighborhood)

• Decide for a set of thresholds• Gradually smooth forecast and/or observed fields

(set of smoothing functions are possible)• Decide for a verification measure, e.g. Fraction Skill Score

• Useful if forecast and observation are available as grid (e.g. observation from rain radar)

• Shows useful scales of predictability • Popular for comparing models with different horizontal resolution• Reduced double penalty effect with larger scales• Many dichotomous or probabilistic scores can be used for analysis

∑ ∑

∑

= =

=

+

−−=

N

i

N

iobsfcst

N

iobsfcst

PP

PP

1 1

22

1

2

N1

N1

)(N1

1FSS



Object-based

• Extract objects from forecast/observed fields (e.g. rain cells, clouds,…)• Make statistics on properties of those objects (e.g. size, location, magnitude,…)• Some object based methods use an object matching and tracking in time• Object based methods try to mimic the forecast users perception of the forecast quality• Double penalty effects can be avoided


Take special care of

• Calculating scores is easy, preparing data for the verification task is usually most time consuming

• Data quality (eliminate erroneous measurements)

• Use an appropriate score (e.g. no RMSE to verify precipitation)

• Use a homogeneous data set (stratify as much as possible)

• Stratify after observations or external factors (rather than forecast values)

• Use as many as possible data (aggregate if possible)

• Try to implement error bars (and avoid dependent observations)

• Keep in mind that your observation and verification is usually imperfect (don’t expect perfect results)

III Final Remarks


Verification Guidelineo Who is the user?

(forecaster, developer, administrative, decision ma ker,…)

o What forecast aspects are relevant to this user?(parameter, domain, warnings, scenarios,…)

o What observations are available?(point, gridded, analysis, quality,…)

o What methods are possible with the given data(deterministic, ensemble, probabilistic, fuzzy, …)

o What score(s) give(s) the right information?(bias, association, skill scores, economic value,…)

o What is an appropriate reference to compare the for ecast to?(other model, climatology, persistence, randomness)

o How should the result be visualized best?(text, case studies, graphs, errorbars,…)

o Look at your data and don’t rely on scores only!(make scatterplots and look at individual forecasts )


Active fields of research

• Use of observation uncertainty in verification

• Spatial verification methods applied to ensemble prediction systems

• Accounting timing errors in verification (avoiding double-penalty effects in time)

• Verifying forecast scenarios

• Verification of extreme events

• Multivariate verification


Further readingCollection of methods & scores (lots of further lin ks)http://www.cawcr.gov.au/projects/verification

Short Introductionhttp://www.dwd.de/DE/forschung/wettervorhersage/num_modellierung/05_verifikation/verifikation_node.html

Collection of monitoring & verification products (o nly at DWD)http://oflxs04.dwd.de/~mkoehler/plot-catalog/index.php

WV verification report (only at DWD)http://intranet.res.bund.de/downloads/VB52_Gesamtdokument.pdf

Books/PapersForecast Verification (Joliffe & Stephenson)Fuzzy Methods Review (Ebert)WMO guidelines and meetings

ToolsR and the packages “verification” and “SpatialVx”R package Rfdbk for using feedback filesCOSMO Common Verification VERSUSCOSMO Spatial Verification VAST

An Introduction to Forecast Verification - HZGCOSMO/CLM/ART Training Forecast Verification Felix...

Documents

Transcript of An Introduction to Forecast Verification - HZGCOSMO/CLM/ART Training Forecast Verification Felix...