[email protected] 15.4.2005 NOMEK - Verification Training - OSLO / 1 Pertti Nurmi ( Finnish...

15.4.2005NOMEK - Verification Training - OSLO / [email protected]

Pertti Nurmi( Finnish Meteorological Institute )

General Guide to Forecast Verification( Methodology )

NOMEK - Oslo15.-16.4.2005

0

10

20

30

1980 1985 1990 1995 2000

T max D+2

0.2

0.4

0.6

T mean D+6-10

T mean D+1-5

T2m; ME & MAE; ECMWF & LAMAverage over 30 stations; Winter 2003

-1

0

1

2

3

4

5

6 12 18 24 30 36 42 48 54 60 72 84 96 108 120

MAE_ECMWF

MAE_LAM

ME_ECMWF

ME_LAM

(C)

( hrs )


A glimpse to verification history,USA Tornados, 1884 (The Finlay case)

2680 + 302800

= 96,8 %

TornadoTornado observed

forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800


Glimpse to history from the Wild West, cont’d...

2750 + 02800

= 98,2 %

96,8 %


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800


forecastYes No fc S

Yes 0 0 0

No 50 2750 2800

obs S 50 2750 2800

Never forecasta Tornado


Another interpretation:

96,8 %3050= 60 % POD, Probability Of Detection

70100

= 70 % FAR, False Alarm Ratio

98,2 %

POD = FAR = B = 0 % !


forecastYes No fc S

Yes 0 0 0

No 50 2750 2800

obs S 50 2750 2800


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

Back to the original results:

10050= 2 B or FBI, (Frequency) Bias


First reminder on verification:

An act (“art”?) of countless methods and measures

An essential daily real-time practice in the operational forecasting environment

An active feedback and dialogue process is a necessity

A fundamental means to improve weather forecasts and services


Outline:1. Introduction - History

2. Goals and general guidelines

3. Continuous variables

4. Categorical events• Binary (dichotomous; yes/no) forecasts• Multi-category forecasts

5. Probability forecasts6. Forecast value (NOT covered under these lectures)

References• Literature• Websites

…You heard it already

Acknowledgement: Laurie Wilson (CMC, Canada)

<= Break?



2.Goals and general guidelines


Goals of *THIS* Training:

Understand the basic properties and relationships among common verification measures

Learn to extract useful information from (graphical) verification results

Increase interest in forecast verification and the methods Apply them during everyday forecasting practice

Emphasis is on verification of weather elements rather than, e.g. NWP fields

2.


Goals of (objective) Verification:

• “Administrative” Feedback process to operational forecasters => Example follows ! Monitor the quality of forecasts and potential trends in quality Justify cost of provision of weather services Justify acquisition of additional or new models, equipment, …

• “Scientific” Identify strengths and weaknesses of a forecast product leading to

improvements, i.e. provide information to direct R&D

• Value (NOT covered explicitly under these lectures) Determine the (economic) value of the forecasts to users Quantitative information on user’s economic sensitivity to weather

is needed

2


Personal scoring (example) ...2

A B C A B C


Principles of (objective) Verification:

• Verification activity has value only if the information generated leads to a decision about the forecast itself or the forecast system being verified User of the information must be identified Purpose of the verification must be known in advance

• No single verification measure can provide complete information about forecast quality

• Forecasts should be formulated in a verifiable form

2


Operational Verification - “State-of-the-Art”

• Comprehensive comparison of forecast(er)s vs. observations

• Stratification and aggregation (pooling) of results

• Statistics of guidance forecasts (e.g. NWP, MOS)

• Instant feedback to forecasters

• Statistics of individual forecasters – e.g. Personal biases

• Comprehensive set of tailored verification measures

• Simplified measures for laymen

• Continuity into history

2


Allan Murphy’s (rip) “Goodness”:

• Consistency:– Forecasts agree with forecaster’s true belief about

the future weather [ strictly proper ]; cf. Hedging

• Quality:– Correspondence between observations and forecasts

[ verification ]

• Value:– Increase or decrease in economic or other kind of

value to someone as a result of using the forecast [ decision theory ]

2


Verification Procedure… Define predictand types:• Continuous: Forecast is a specific value of the variable

• Categorical - Probabilistic: Forecast is the probability of occurrence of ranges of values of the variable (categories)

Temperature; fixed time (e.g. noon), Tmin, Tmax, time-averaged (e.g. 5-day)

Wind speed and direction; fixed time, time-averaged

Precipitation (vs. no precipitation) - POP; with various rainfall thresholds

Precipitation type Cloud amount Strong winds (vs. no strong wind); with various wind force thresholds

Night frost (vs. no frost)

Fog (vs. no fog)

2


Verification Procedure, cont’d…

Define the purpose of verification– Scientific vs. administrative

– Define questions to be answered

Distinguish the dataset of matched observation and forecast pairs

Dataset stratification (from “pooled” data)– “External” stratification by time of day, season, forecast lead-time etc.

– “Internal” stratification, to separate extreme events for example According to forecast According to observation

– Maintain sufficient sample size

2




3.Continuous variables


3. Continuous Variables: First explore the data• Scatterplots of forecasts vs. observations

– Visual relation between forecast and observed distributions– Distinguish outliers in forecast and/or observation datasets– Accurate forecasts have points on a 45 degree diagonal

• Additional scatterplots– Observations vs. [ forecast - observation ] difference– Forecasts vs. [ forecast - observation ] difference– Behaviour of forecast errors with respect to observed or forecast

distributions - potential clustering or curvature in their relationships

• Time-series plot of forecasts vs. observations (or forecast error)– Potential outliers in either forecast or observation datasets– Trends and time-dependent relationships

• Neither scatterplots nor time series plots provide any concrete measures of accuracy


Continuous Variables - Example 1; Exploring the data

Scatterplot of one year of ECMWF three-day T2m forecasts (left) and forecast errors (right) versus observations at a single location. Red, yellow and green dots separate the errors in three categories.

3


Mean Error aka Bias

ME = ( 1/n ) Σ ( f i – o i )

– Average error in a given set of forecasts

– Simple and informative score on the behaviour of a given weather element

– With ME > 0 ( < 0 ), the system exhibits over- (under-) forecasting

– Not an accuracy measure; Does not provide information of magnitude of errors

– Should be viewed in comparison to climatology

Mean Absolute Error

MAE = ( 1/n ) Σ | f i – o i |

– Average magnitude of errors in a given set of forecasts

– Linear measure of accuracy

– Does not distinguish between positive and negative forecast errors

– Negatively oriented, i.e. smaller is better

– Illustrative => recommended to view ME and MAE simultaneously => Examples follow !

Range: - to Perfect score = 0

Range: 0 to Perfect score = 0

Continuous Variables3


Mean Squared Error

MSE = ( 1/n ) Σ ( f i – o i ) 2

or its square root, RMSE, which has the same unit as the forecast parameter

– Negatively oriented, i.e. smaller is better

– A quadratic scoring rule; Very sensitive to large forecast errors !!! Harmful in the presence of potential outliers in the dataset Care must be taken with limited datasets Fear for high penalties easily leads to conservative forecasting

– RMSE is always >/= MAE

– Comparison of MAE and RMSE indicates the error variance

– MSE - RMSE decomposition is not dealt with here:

Acknowledge Anders Persson (yesterday)




Continuous Variables - Example 1, cont’d…

Scatterplot of one year of ECMWF three-day T2m forecasts (left) and forecast errors (right) versus observations at a single location. Red, yellow and green dots separate the errors in three categories. Some basic statistics like ME, MAE and MSE are also shown. The plots reveal the dependence of model behaviour with respect to temperature range, i.e. over- (under‑) forecasting in the cold (warm) tails of the distribution.

3


Continuous Variables – Example 2

T2m; ME & MAE; ECMWF & LAMAverage over 30 stations; Winter 2003

-1

0

1

2

3

4

5

6 12 18 24 30 36 42 48 54 60 72 84 96 108 120

MAE_ECMWF

MAE_LAM

ME_ECMWF

ME_LAM

(C)

( hrs )

Temperature bias and MAE comparison between ECMWF and a Limited Area Model (LAM) (left), and an experimental post-processing scheme (PPP) (right), aggregated over 30 stations and one winter season. In spite of the ECMWF warm bias and diurnal cycle, it has a slightly lower MAE level than the LAM (left). The applied experimental “perfect prog” scheme does not manage to dispose of the model bias and exhibits larger absolute errors than the originating model – this example clearly demonstrates the importance of thorough verification prior to implementing a potential post-processing scheme into operational use.

T2m; ME & MAE; ECMWF & PPPAverage over 30 stations; Winter 2003

0

1

2

3

4

5

6

6 12 18 24 30 36 42 48 54 60 72 84 96 108 120

MAE_ECMWF

MAE_PPP

ME_ECMWF

ME_PPP

(C)

hrs

3


3

MOS vs. EP MAE

Aggregate of:

6 months; Jan – June 3 lead times; +12, +24, +48 hr 4 stations in Finland

Continuous Variables: Aggregation (pooling) vs. Stratification


3Continuous Variables: Aggregation (pooling) vs. Stratification

Stratified by lead time Stratified by month

Stratified by monthStratified by station


3

MOS vs. EP Bias

Aggregate of:

6 months; Jan – June 3 lead times; +12, +24, +48 hr 4 stations in Finland

Continuous Variables: Aggregation (pooling) vs. Stratification


3Continuous Variables: Aggregation (pooling) vs. Stratification

Stratified by lead time Stratified by month

Stratified by stationStratified by station


General Skill Score

SS = ( A – A ref ) / ( A perf – A ref )

where A = the applied measure of accuracy,

subscript ”ref” refers to some reference forecast, ”perf” to a perfect forecast

For negatively oriented accuracy measures like MAE or MSE :

SS = [ 1 - A / A ref ] * 100

i.e. Relative accuracy of the % improvement over a reference system

– Reference is typically climatology or persistence; => Apply both; Examples follow !

– If negative, the reference (climate or persistence) is better

MAE_SS = [ 1 - MAE / MAE ref ] * 100

MSE_SS = [ 1 - MSE / MSE ref ] * 100

– Latter also known as Reduction of Variance, RV– SS can be unstable for small sample sizes, especially with MSE_SS

Range: - to 100Perfect score = 100



T2m; MAE; Average over 3 stations & forecast ranges +12-120 hrs

0

1

2

3

4

Winter2001

Spring2001

Summer2001

Autumn2001

Winter2002

Spring2002

Summer2002

Autumn2002

Winter2003

Timeaverage

(C) End Product "Better of ECMWF / LAM"

Mean Absolute Errors of End Product and DMO temperature forecasts (left), and Skill of the End Products over model output (right). The better of either ECMWF or local LAM is chosen up to the +48 hour forecast range (hindcast), thereafter ECMWF is used. The figure is an example of both aggregation (3 stations, several forecast ranges, two models, time-average) and stratification (seasons).

T2m; Skill of End Product over "Better of ECMWF / LAM"

0

5

10

15

Winter2001

Spring2001

Summer2001

Autumn2001

Winter2002

Spring2002

Summer2002

Autumn2002

Winter2003

Timeaverage

(%) Skill

Continuous Variables – Example 33


Linear Error in Probability Space

LEPS = ( 1/n ) Σ | CDFo (f i) – CDFo (o i) |

where CDFo is the Cumulative probability Density Function of the observations,

determined from a relevant climatology

– Corresponds to MAE transformed into probability space from measurement space

– Does not depend on the scale of the variable

– Takes into account the variability of the weather element

– Can be used to evaluate forecasts at different locations

– Computation requires definition of cumulative climatological distributions at each location

– Encourages forecasting in extreme tails of the climate distributions

– Penalizes less than for similar sized errors in a more probable region of the distribution

i.e. opposite to MSE; =>Examples will follow !

Skill Score

LEPS_SS = [ 1 - LEPS / LEPS ref ] * 100

Range: 0 to 1Perfect score = 0




LEPS for a hypothetical distribution and location: The climatological frequency distribution (left) is transformed to a cumulative probability density distribution (right). A 2 ”unit” forecast error around the median, 13 vs. 15 “units” (red arrows), would yield a LEPS value of c. 0.2 in the probability space ( | 0.5 – 0.3 |, red arrows).

An equal error in the measurement space close to the tail of the distribution, 21 vs. 23 ”units” (blue arrows), would result a LEPS value of c. 0.05 ( | 0.95 – 0.9 |, blue arrows) => Fc errors of rare events are much less penalized using LEPS !

Hypothetical Climatological Distribution

0

5

10

15

20

25

30

35

40

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

n Climatology

Hypothetical Cumulative Density Function

0,0

0,2

0,4

0,6

0,8

1,0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

CDF

0,2

0,05


Skill comparison (example A) ...



Continuous Variables

Skill comparison (example B) ...

3



Skill comparison (example C) ...


Continuous variables - Summary: Verify a comprehensive set of local weather elements Produce scatterplots & time-series plots, including forecasts

and/or observations against their difference ”Stratify & Aggregate” + Compute ME, MAE, MAE_SS Additionally, compute LEPS, LEPS_SS, MSE, MSE_SS

Examples 1 - 4 in the General Guide to Verification (NOMEK Training)

Examples: Temperature: fixed time (e.g. noon, midnight), Tmin, Tmax, time-averaged (e.g. 5-day)

Wind speed and direction: fixed time, time-averaged

Accumulated precipitation: time-integrated (e.g. 6, 12, 24 hours)

Cloudiness: fixed time, time-averaged; However, typically categorized

3





4.Categorical events• Binary (dichotomous; yes/no)

forecasts

• Multi-category forecasts


EventEvent observed

forecastYes No Marginal total

Yes Hit False alarm Fc Yes

No Miss Corr. rejection Fc No

Marginal total Obs Yes Obs No Sum total

EventEvent observed


Yes a b a + b

No c d c + d

Marginal total a + c b + d a + b + c + d =n


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

4. Categorical Events


EventEvent observed


Yes a b a + b

No c d c + d


Bias aka Frequency Bias Index

B = FBI = ( a + b ) / ( a + c ) [ ~ Fc Yes / Obs Yes ]

– With B > 1 , the system exhibits over-forecasting.

– With B < 1 , the system exhibits under-forecasting.

Proportion Correct

PC = ( a + d ) / n [ ~ ( Hits + Correct rejections ) / Sum total ]

– Most simple and intuitive performance measure.

– Usually very misleading because rewards correct “Yes” and “No” forecasts equally.

– Can be maximized by forecasting the most common category all the time.

– Strongly influenced by the more common category.

– Never for extreme event verification !!!




forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97

Categorical Events4


EventEvent observed


Yes a b a + b

No c d c + d


Probability Of Detection, Hit Rate ( H ), Prefigurance

POD = a / ( a + c ) [ ~ Hits / Obs Yes ]

– Sensitive to misses only, not false alarms.

– Can be artificially improved by over-forecasting (rare events).

– Complement score Miss Rate, MR = 1 – H = c / (a+c)

– Must be examined together with …

False Alarm Ratio

FAR = b / ( a + b ) [ ~ False alarms / Fc Yes ]

– Sensitive to false alarms only, not misses.

– Can be artificially improved by under-forecasting (rare events).

– Increase of POD can be achieved by increasing FAR, and vice versa.




forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97

POD = 0.60FAR = 0.70

Categorical Events4


EventEvent observed


Yes a b a + b

No c d c + d


Post agreement

PAG = a / ( a + b ) [ ~ Hits / Fc Yes ]

– Complement of FAR (i.e. = 1 – FAR).

– Sensitive to false alarms, not misses.

False Alarm Rate, Probability of False Detection ( POFD )

F = b / ( b + d ) [ ~ False alarms / Obs No ]

– False alarms, given the event did not occur (Obs No).

– Sensitive to false alarms only, not misses.

– Can be artificially improved by under-forecasting (rare events) – ref. Tornado case.

– Generally used with POD (or H) to produce the ROC score for probability forecasts;

– Otherwise rarely used.




forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97POD = 0.60FAR = 0.70

PAG = 0.30F = 0.03

Categorical Events4


EventEvent observed


Yes a b a + b

No c d c + d


Hanssen & Kuiper’s Skill Score, True Skill Statistics

KSS = TSS = POD – F = ( ad – bc ) / [ (a+c) (b+d) ] – Popular combination skill score of POD and F.

– Measures ability to separate “yes” cases (POD) from “no” cases (F).

– For rare events, d cell is high => F small => KSS close to POD.

Threat Score, Critical Success Index

TS = CSI = a / ( a + b + c )– Simple popular measure of rare events. Sensitive to hits, false alarms and misses.

– Measure of forecast after removing correct (simple) “no” forecasts from consideration.

– Sensitive to climatological frequency of event.

– More balanced than POD or FAR.

Range: -1 to Perfect score = 1No skill level = 0

Range: 0 to 1Perfect score = 1No skill level = 0


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03

KSS = 0.57TS = 0.25

Categorical Events4


EventEvent observed


Yes a b a + b

No c d c + d


Equitable Threat Score

ETS = ( a – a r ) / ( a + b + c – a r ) where a r = ( a + b ) ( a + c ) / n

… is the number of hits due to random forecasts.

Simple TS may include hits due to random chance.

Heidke Skill Score

HSS = 2 ( ad – bc ) / [ ( a + c )( c + d ) + ( a + b )( b + d ) ]– One of the most popular skill measures for categorical forecasts.

– Score against random chance.

Range: -1/3 to Perfect score = 1No skill level = 0

Range: - to 1Perfect score = 1No skill level = 0


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03KSS = 0.57TS = 0.25

ETS = 0.24HSS = 0.39

Categorical Events4


EventEvent observed


Yes a b a + b

No c d c + d


Odds ratio

OR = a d / b cMeasures forecast system’s probability (odds)to score a hit (H) as compared to making a false alarm (F):

OR = [ H / ( 1 – H ) ] / [ F / ( 1 – F ) ]

– Independent of potential biases between observations and forecasts.

Transformation into a skill score, ranging from -1 to +1:

ORSS = ( ad – bc) / ( ad + bc ) = ( OR – 1 ) / ( OR + 1 )

– Produces typically very high absolute skill values, due to definition.

– Practically never used in meteorological forecast verification.

Range: 0 to Perfect score = No skill level = 1


forecastYes No fc S

Yes 30 70 100

No 20 2680 2700

obs S 50 2750 2800

B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03KSS = 0.57TS = 0.25ETS = 0.24HSS = 0.39

OR = 57.43ORSS = 0.97

Categorical Events4


RainRain observed

forecastYes No fc S

Yes 52 45 97

No 22 227 249

obs S 74 272 346

Contingency table of one year (with 19 missing cases) of categorical rain vs. no rain forecasts (left), and resulting statistics (right). Rainfall is a relatively rare event at this particular location, occurring in only c. 20 % (74/346) of the cases. Due to this, PC is quite high at 0.81. The relatively high rain detection rate (0.70) is “balanced” by high number of false alarms (0.46), with almost every other rain forecast having been superfluous. This is also seen as biased over-forecasting of the event (B=1.31). Due to the scarcity of the event the false alarm rate is quite low (0.17) – if used alone this measure would give a very misleading picture of forecast quality. The Odds Ratio shows that it was 12 times more probable to make a correct (rain or no rain) forecast than an incorrect one. The resulting skill score (0.85) is much higher than the other skill scores which is to be noted - this is a typical feature of the ORSS due to its definition.

B = 1.31 TS = 0.44 PC = 0.81 ETS = 0.32

~> POD = 0.70 KSS = 0.53 FAR = 0.46 HSS = 0.48 PAG = 0.54 OR = 11.92 F = 0.17 ORSS = 0.85

Precipitation in Finland

Categorical Events – Example 54


Multi-category Events• Extension of 2*2 to several (k) mutually exhaustive categories

– Rain type: rain / snow / freezing rain (k=3)– Wind warnings: strong gale / gale / no gale (k=3)

• Only PC (Proportion Correct) can be directly generalized

• Other verification measures need be converted into a series of 2*2 tables– “Forecast event” distinct from the “non-forecast event”

Generalization of KSS and HSS – measures of improvement over random forecasts:

KSS = { Σ p ( fi , oi ) - Σ p ( fi ) p ( oi ) } / { 1 - Σ ( p (fi) ) 2 }

HSS = { Σ p ( fi , oi ) - Σ p ( fi ) p ( oi ) } / { 1 - Σ p ( fi ) p ( oi )}

Observed

Forecasto 1 o 2 o 3 fc S

f 1 r s t S f a = r b= s+t

f 2 u v w S f 2 a = v b= u+w c= u+x d= v+w+y+z

f 3 x y z S f 3 a = z b= x+y c= s+y d= r+t+x+z

obs S S o S o 2 S o 3 S c= t+w d= r+s+u+v

4


Multi-category contingency table of one year (with 19 missing cases) of cloudiness forecasts (left), and resulting statistics (right). Results are shown exclusively for forecasts of each cloud category, together with the overall PC, KSS and HSS scores. The most marked feature is the very strong over-forecasting of the “partly cloudy” category leading to numerous false alarms (B=2.5, FAR=0.8), and, despite this, the poor detection (POD=0.46). The forecasts cannot reflect the observed U‑shaped distribution of cloudiness at all. Regardless of this inferiority both overall skill scores are relatively high (c. 0.4), following the fact that most of the cases (90 %) fall either in the “no cloud” or “cloudy” category - neither of these scores takes into account the relative sample probabilities, but weight all correct forecasts similarly.

No clouds (0-2) Partly cloudy (3-5) Cloudy (6-8)

B = 0.86 B = 2.54 B = 0.79 POD = 0.58 POD = 0.46 POD = 0.65

~> FAR = 0.32 FAR = 0.82 FAR = 0.18 F = 0.13 F = 0.25 F = 0.19 TS = 0.45 TS = 0.15 TS = 0.57

Overall: PC = 0.61 KSS = 0.41 HSS = 0.37

CloudsClouds observed

forecast0 - 2 3 - 5 6 - 8 fc S

0 - 2 65 10 21 96

3 - 5 29 17 48 94

6 - 8 18 10 128 156

obs S 112 37 197 346

4 Multi-category Events – Example 6

Cloudiness in Finland


The previous data transformed into hit/miss bar charts, either given the observations (left), or given the forecasts (right). The green, yellow and red bars denote correct and one and two category errors, respectively. The U-shape in observations is clearly visible (left), whereas there is no hint of such in the forecast distribution (right).

112

197

37

96

156

94

4 Multi-category Events – Example 6, cont’d…


Multi-category Events

Example from Finland, Again !

33

106

4


Examples Rain (vs. no rain); with various rainfall thresholds

Snowfall; with various thresholds

Strong winds (vs. no strong wind); with various wind force thresholds

Night frost (vs. no frost)

Fog (vs. no fog)

Categorical (binary, multi-category) events - Summary: Verify a comprehensive set of categorical local weather events

• Compile relevant contingency tables• Include multi-category events• Focus on adverse and/or extreme local weather

“Stratify & Aggregate” + Compute B, (PC), POD & FAR, (F),

(PAG), KSS, TS, ETS, HSS Additionally, compute OR, ORSS, ROC

Examples 5 - 6 in the General Guide to Verification (NOMEK Training)

4





4. Categorical events• Binary (dichotomous; yes/no) forecasts• Multi-category forecasts

5.Probability forecasts


Why Probability Forecasts ?

®“… the widespread practice of ignoring uncertainty when formulating and communicating forecasts represents an extreme form of inconsistency and generally results in the largest possible reductions in quality and value.”

- Allan Murphy (1993)

5

( A sophisticated, indirect phrase to emphasize the importance of addressing uncertainty… )


Why Probability Forecasts ?

®“… Go look at the weather,I believe it’s gonna rain”

- Legendary Chicago Blues ArtistMuddy Waters (early 1960s)

singing ”Clouds in My Heart”

5

( A simple, direct phrase to emphasize uncertainty in everyday life… )


Probability Forecasts

• All forecasting involves some level of uncertainty

• Deterministic forecasts cannot address the inherent uncertainty of the weather parameter or event

• Probabilities of the expected event (with values between 0 % and 100 % or 0 and 1) take into account the underlying joint distribution { p ( f, x ) } between forecasts and observations

• Conversion of probability forecasts to categorical events is simple (but not necessarily advisable) by defining the “on/off” threshold; Reverse is not straightforward.

• Verification is somewhat laborious => Large datasets are required to obtain any significant information

5


Reliability Diagram: Preparation

• Stratify probability forecasts and observations into deciles

• For each decile, record observed frequency of the event

• Keep track of the number of cases in each decile “bin”

• Plot on a diagram

• Plot additional histogram of the number of

fc cases in each bin => Sharpness Diagram

FC Probability FCs Events Non-Events Obs. RelativeBin in Bin in Bin in Bin Frequency0 65 2 63 3

10 23 3 20 1320 26 6 20 2330 25 10 15 4040 25 10 15 4050 20 10 10 5060 35 20 15 5770 30 20 10 6780 35 25 10 7190 12 10 2 83100 4 4 0 100

300 120 180

5

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Forecast probability (%)

Ob

se

rve

d r

el.

fre

qu

en

cy

(%

)


Reliability Diagram: Interpretation

Reliability– Above 45 degree line => Underforecasting

– Below 45 degree line => Overforecasting

– Analogous to bias

– One of the components of the Brier score (see later)

Sharpness (Resolution)

– U-shaped histogram best

– Gaussian distribution worst

– Measure of spread (variance)

in the distribution of forecasts

5

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Forecast probability (%)

Ob

se

rve

d r

el.

fre

qu

en

cy

(%

)


Climatology Minimal RESOLUTION Underforecasting

Good RESOLUTION at expense of RELIABILITY

Reliable forecasts of a rare event

Small sample size

from Wilks, 19955


Brier Score

BS = ( 1/n ) Σ ( p i – o i ) 2

– Most common accuracy measure of probability forecasts; Note: o i is binary (0 or 1) !!! – Analogous to MSE in probability space; Negatively oriented, i.e. smaller is better

– A quadratic scoring rule; Very sensitive to large forecast errors !!! Careful with limited datasets

– For two categories only, for multiple categories see RPS …

– Strongly influenced by climatological frequency of the verification sample Different samples not to be compared

– BS can be algebraically decomposed => Reliability, Resolution, Uncertainty

Brier Skill Score

BSS = [ 1 – BS / BS ref ] * 100

BS ref = ( 1/n ) Σ ( ref i – o i ) 2

where ref i is either climatological relative frequency of the event, or persistence


Probability Forecasts: Measures


5


Ranked Probability Score

RPS = ( 1/(k-1)) Σ { ( Σ p i ) – ( Σ o i ) } 2

where k is the number of probability categories

Ranked Probability Skill Score

RPSS = [ 1 – RPS / RPS ref ] * 100

– Vector generalization of BS and BSS to multi-event or multi-category situations

– Measures the sums of squared differences in cumulative probability space

– Quadratic score – Penalizes most severely when forecast probabilities are further from the actual observed distributions

– As with BSSS, RPSS is very sensitive to size of dataset



Probability Forecasts: Measures5


Probability Forecasts - Signal Detection Theory:ROC - Relative Operating Characteristic

• Determines the ability of a forecasting system to separate situations when a signal is present (e.g. occurrence of rain) from absent signal (noise)

• In other words: Assesses the performance of a forecasting system to discriminate between occurrence (Yes) and non-occurrence (No) of an event

• Tests e.g. model performance relative to a specific threshold

• Applicable for two-category probability forecasts and also categorical deterministic forecasts, i.e. allows their comparison

• Gained wider and wider popularity in meteorological forecast verification during recent years

( Receiver-Operating Characteristics in medical sciences )

5


Signal Detection Theory: ROC Curve

• Graphical representation in a square box of the Hit rate (H) (y-axis) against the False Alarm Rate (F) (x-axis) for different potential decision thresholds

• Curve is plotted from a “binned” set of probability forecasts by stepping (or sliding) a decision threshold (e.g. 10% probability intervals) through the forecasts, each probability decision threshold generating a separate 2*2 contingency table

– The probability forecast is transformed into a set of categorical “yes/no” forecasts

– A set of value pairs of H and F is obtained, forming the curve

• It is desirable that H be high and F be low, i.e. the closer the point is to the upper left-hand corner, the better the forecast

• A perfect forecast system, with only correct forecasts & no false alarms, (regardless of the threshold chosen) has a “curve” that rises from (0,0) (H=F=0) along the y-axis to (0,1) (upper left-hand corner; H=1, F=0) and then straight to (1,1) (H=F=1)

EventEvent observed


Yes a b a + b

No c d c + d


H = a / ( a + c )

F = b / ( b + d )

5

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

H

F

10%

20%

30%

40%

50%

60%

90%

80%

70%


ROC Curve Generation EventEvent observed


Yes a b a + b

No c d c + d


5

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

H

F

10%

20%

30%

40%

50%

60%

90%

80%

70% To learn more about ROC and Signal Detection Theory, check:

http://wise.cgu.edu/

H = a / ( a + c )

F = b / ( b + d )

a+c =1920 b+d =5351

Example

Probability # of Cumulative # of Non- Cumulative Non- H FThreshold Occurences Occurences Occurences Occurencies (%) (%)

a b

0 - 9 43 1920 613 5351 100 10010 - 19 172 1877 1389 4738 98 8920 - 29 283 1705 1183 3349 89 6330 - 39 350 1422 936 2166 74 4040 - 49 323 1072 602 1230 56 2350 - 59 287 749 327 628 39 1260 - 69 169 462 151 301 24 670 - 79 163 293 88 150 15 380 - 89 89 130 40 62 7 190 - 99 41 41 22 22 2 0

( S a ) ( S b )


Signal Detection Theory: ROCA Area

• Area under the ROC curve

• A relative index and a widely used summary measure

• Decreases from 1 when curve moves downward from the ideal top-left corner

• A useless forecast system is along the diagonal, when H=F and the area is = 0.5;

Such system cannot discriminate between occurrences and non-occurrences of the event

ROCA based skill score:

ROC_SS = 2 * ROCA - 1

• Negative below the diagonal

• At it’s minimum: ROC_SS = - 1, when ROCA = 0

• ROC is applicable for deterministic categorical forecast

– ROC_SS translates into KSS (= H – F)

– Only one single decision threshold - only a single ROC point results

Typically this is “inside“ the ROC area, i.e. indicating worse quality

• ROC, ROCA and ROC_SS are directly related to a decision-theoretic approach

– Can be related to the economic value of probability forecasts to end users

– Allowing for the assessment of the costs of false alarms

Range: -1 to 1Perfect score = 1

Range: 0 to 1Perfect system = 1

5


Reliability (left) and ROC (right) diagrams of one year of PoP forecasts. (The data are the same as earlier – where PoPs were transformed into categorical yes/no forecasts by using 50 % as the “on/off” threshold.) The inset box in the reliability diagram shows the frequency of use of the various forecast probabilities (sharpness) and the horizontal dotted line the climatological event probability. The reliability curve (with open circles) indicates strong over-forecasting bias throughout the probability range. This seems to be a common feature at this particular location as indicated by the qualitatively similar 10-year average reliability curve (dashed line). Brier skill scores (BSS) are computed against two reference forecast systems. Of these, climatology appears to be a much stronger “no skill opponent” than persistence. The ROC curve (right) is constructed on the basis of forecast and observed probabilities leading to different potential decision thresholds and respective value pairs of H and F. Also ROCA and

ROC_SS values are shown. The black dot represents the single value ROC from the categorical binary case of Example 5 (Slide #39) (H=POD=0.7; F=0.17).

5 Probability Forecasts – Example 7


ROC Curve: Probability FC Scheme, Vers.1 5

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

H

F

10%

20%

30%

40%

50%

60%

80%

70%

D0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100


ROC Curve: Probability FC Scheme, Vers.2 5

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

10%

20%

50%

60%

90%

80%

70%

D

5%

40%

H

F


Probability Forecasts - Summary: Verify a comprehensive set of probability forecasts focusing

on adverse and/or extreme weather Produce reliability diagrams, including sharpness

distribution Compute BS, BSS, for multi-category events RPS, RPSS

Produce ROC diagrams, ROCA, ROC_SS Example 7 in the General Guide to Verification (NOMEK Training)

BS:• Based on squared error

• Decompositions provide insight into several performance attributes (NOT discussed here)

• Dependent on frequency of occurrence of the event

ROC:• Considers forecasts’ ability to discriminate

between Yes and No events

• Provides verification information for individual decision thresholds

• Less dependent on frequency of occurrence of event

5


FINAL reminder on verification:

An act (“art”?) of countless methods and measures

An essential daily real-time practice in the operational forecasting environment

An active feedback and dialogue process is a necessity

A fundamental means to improve weather forecasts and services


References: LiteratureBougeault, P., 2003. WGNE recommendations on verification methods for numerical prediction of weather elements and severe weather events (CAS/JSC WGNE Report No. 18)

Proceedings, Making Verification More Meaningful (Boulder, 30 July - 1 August 2002)

Proceedings, SRNWP Mesoscale Verification Workshop (De Bilt, 2001)

Proceedings, WMO/WWRP International Conference on Quantitative Precipitation Forecasting (Vols. 1 and 2, Reading, 2 - 6 September 2002)

Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences: An Introduction (Chapter 7: Forecast Verification) (Academic Press)

Jolliffe, I.T. and D.B. Stephenson, 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Sciences (Wiley)

Stanski, H.R., L.J. Wilson and W.R. Burrows, 1989. Survey of Common Verification Methods in Meteorology (WMO Research Report No. 89-5)

Cherubini, T., A. Ghelli and F. Lalaurette, 2001. Verification of precipitation forecasts over the Alpine region using a high density observing network (ECMWF Tech. Mem., 340, 18pp)

Murphy, A.H. and R.L. Winkler, 1987. A General Framework for Forecast Verification (Mon. Wea. Rev., 115, 1330-1338)

Stephenson, D.B., 2000. Use of the “Odds Ratio” for Diagnosing Forecast Skill (Weather and Forecasting, 15, 221-232)

Grazzini, F and A. Persson, 2003: User Guide to ECMWF Forecast Products (ECMWF Met. Bull., M3.2) Thornes, J.E. and D.B. Stephenson, 2001. How to judge the quality and value of weather forecast products (Meteorol. Appls., 8, 307-314)


References: Websiteshttp://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/jwgv/jwgv.htmll - WMO/WWRP/WGNE Working Group on Verification websites

http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/Workshop2004/home.html - International Verification Methods Workshop (Montreal, 2004)

http://www.rap.ucar.edu/research/verification/ver_wkshp1.html - Making Verification More Meaningful Workshop (Boulder, 2002)

http://www.chmi.cz/meteo/ov/wmo - WMO/WWRP Workshop on the Verification of QPF (Prague, 2001)

http://hirlam.knmi.nl/open/srnwp/ - EUMETNET/SRNWP Mesoscale Verification Workshops (DeBilt, 2004)

http://www.sec.noaa.gov/forecast_verification/Glossary.html - NOAA/SEC Glossary of verification terms

http://nws.noaa.gov/tdl/verif - NOAA MOS verification website

http://wwwt.emc.ncep.noaa.gov/gmb/ens/verif.html - NOAA EPS Verification website

http://www.wmo.ch/web/www/DPS/SVS-for-LRF.html - WMO/CBS Standardised Verification System for Long-Range Forecasts

http://www.ecmwf.int/products/forecasts/d/charts/medium/verification - Verification of ECMWF Forecasting System

http://www.ecmwf.int/products/forecasts/guide - User Guide to ECMWF Forecast Products

[email protected] 15.4.2005 NOMEK - Verification Training - OSLO / 1 Pertti Nurmi ( Finnish...

Documents

Transcript of [email protected] 15.4.2005 NOMEK - Verification Training - OSLO / 1 Pertti Nurmi ( Finnish...