[email protected] 15.4.2005 NOMEK - Verification Training - OSLO / 1 Pertti Nurmi ( Finnish...
-
Upload
mariah-thomas -
Category
Documents
-
view
226 -
download
0
Transcript of [email protected] 15.4.2005 NOMEK - Verification Training - OSLO / 1 Pertti Nurmi ( Finnish...
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Pertti Nurmi( Finnish Meteorological Institute )
General Guide to Forecast Verification( Methodology )
NOMEK - Oslo15.-16.4.2005
0
10
20
30
1980 1985 1990 1995 2000
T max D+2
0.2
0.4
0.6
T mean D+6-10
T mean D+1-5
T2m; ME & MAE; ECMWF & LAMAverage over 30 stations; Winter 2003
-1
0
1
2
3
4
5
6 12 18 24 30 36 42 48 54 60 72 84 96 108 120
MAE_ECMWF
MAE_LAM
ME_ECMWF
ME_LAM
(C)
( hrs )
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
A glimpse to verification history,USA Tornados, 1884 (The Finlay case)
2680 + 302800
= 96,8 %
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Glimpse to history from the Wild West, cont’d...
2750 + 02800
= 98,2 %
96,8 %
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
TornadoTornado observed
forecastYes No fc S
Yes 0 0 0
No 50 2750 2800
obs S 50 2750 2800
Never forecasta Tornado
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Another interpretation:
96,8 %3050= 60 % POD, Probability Of Detection
70100
= 70 % FAR, False Alarm Ratio
98,2 %
POD = FAR = B = 0 % !
TornadoTornado observed
forecastYes No fc S
Yes 0 0 0
No 50 2750 2800
obs S 50 2750 2800
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
Back to the original results:
10050= 2 B or FBI, (Frequency) Bias
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
First reminder on verification:
An act (“art”?) of countless methods and measures
An essential daily real-time practice in the operational forecasting environment
An active feedback and dialogue process is a necessity
A fundamental means to improve weather forecasts and services
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Outline:1. Introduction - History
2. Goals and general guidelines
3. Continuous variables
4. Categorical events• Binary (dichotomous; yes/no) forecasts• Multi-category forecasts
5. Probability forecasts6. Forecast value (NOT covered under these lectures)
References• Literature• Websites
…You heard it already
Acknowledgement: Laurie Wilson (CMC, Canada)
<= Break?
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Outline:1. Introduction - History
2.Goals and general guidelines
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Goals of *THIS* Training:
Understand the basic properties and relationships among common verification measures
Learn to extract useful information from (graphical) verification results
Increase interest in forecast verification and the methods Apply them during everyday forecasting practice
Emphasis is on verification of weather elements rather than, e.g. NWP fields
2.
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Goals of (objective) Verification:
• “Administrative” Feedback process to operational forecasters => Example follows ! Monitor the quality of forecasts and potential trends in quality Justify cost of provision of weather services Justify acquisition of additional or new models, equipment, …
• “Scientific” Identify strengths and weaknesses of a forecast product leading to
improvements, i.e. provide information to direct R&D
• Value (NOT covered explicitly under these lectures) Determine the (economic) value of the forecasts to users Quantitative information on user’s economic sensitivity to weather
is needed
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Personal scoring (example) ...2
A B C A B C
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Principles of (objective) Verification:
• Verification activity has value only if the information generated leads to a decision about the forecast itself or the forecast system being verified User of the information must be identified Purpose of the verification must be known in advance
• No single verification measure can provide complete information about forecast quality
• Forecasts should be formulated in a verifiable form
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Operational Verification - “State-of-the-Art”
• Comprehensive comparison of forecast(er)s vs. observations
• Stratification and aggregation (pooling) of results
• Statistics of guidance forecasts (e.g. NWP, MOS)
• Instant feedback to forecasters
• Statistics of individual forecasters – e.g. Personal biases
• Comprehensive set of tailored verification measures
• Simplified measures for laymen
• Continuity into history
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Allan Murphy’s (rip) “Goodness”:
• Consistency:– Forecasts agree with forecaster’s true belief about
the future weather [ strictly proper ]; cf. Hedging
• Quality:– Correspondence between observations and forecasts
[ verification ]
• Value:– Increase or decrease in economic or other kind of
value to someone as a result of using the forecast [ decision theory ]
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Verification Procedure… Define predictand types:• Continuous: Forecast is a specific value of the variable
• Categorical - Probabilistic: Forecast is the probability of occurrence of ranges of values of the variable (categories)
Temperature; fixed time (e.g. noon), Tmin, Tmax, time-averaged (e.g. 5-day)
Wind speed and direction; fixed time, time-averaged
Precipitation (vs. no precipitation) - POP; with various rainfall thresholds
Precipitation type Cloud amount Strong winds (vs. no strong wind); with various wind force thresholds
Night frost (vs. no frost)
Fog (vs. no fog)
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Verification Procedure, cont’d…
Define the purpose of verification– Scientific vs. administrative
– Define questions to be answered
Distinguish the dataset of matched observation and forecast pairs
Dataset stratification (from “pooled” data)– “External” stratification by time of day, season, forecast lead-time etc.
– “Internal” stratification, to separate extreme events for example According to forecast According to observation
– Maintain sufficient sample size
2
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Outline:1. Introduction - History
2. Goals and general guidelines
3.Continuous variables
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
3. Continuous Variables: First explore the data• Scatterplots of forecasts vs. observations
– Visual relation between forecast and observed distributions– Distinguish outliers in forecast and/or observation datasets– Accurate forecasts have points on a 45 degree diagonal
• Additional scatterplots– Observations vs. [ forecast - observation ] difference– Forecasts vs. [ forecast - observation ] difference– Behaviour of forecast errors with respect to observed or forecast
distributions - potential clustering or curvature in their relationships
• Time-series plot of forecasts vs. observations (or forecast error)– Potential outliers in either forecast or observation datasets– Trends and time-dependent relationships
• Neither scatterplots nor time series plots provide any concrete measures of accuracy
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous Variables - Example 1; Exploring the data
Scatterplot of one year of ECMWF three-day T2m forecasts (left) and forecast errors (right) versus observations at a single location. Red, yellow and green dots separate the errors in three categories.
3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Mean Error aka Bias
ME = ( 1/n ) Σ ( f i – o i )
– Average error in a given set of forecasts
– Simple and informative score on the behaviour of a given weather element
– With ME > 0 ( < 0 ), the system exhibits over- (under-) forecasting
– Not an accuracy measure; Does not provide information of magnitude of errors
– Should be viewed in comparison to climatology
Mean Absolute Error
MAE = ( 1/n ) Σ | f i – o i |
– Average magnitude of errors in a given set of forecasts
– Linear measure of accuracy
– Does not distinguish between positive and negative forecast errors
– Negatively oriented, i.e. smaller is better
– Illustrative => recommended to view ME and MAE simultaneously => Examples follow !
Range: - to Perfect score = 0
Range: 0 to Perfect score = 0
Continuous Variables3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Mean Squared Error
MSE = ( 1/n ) Σ ( f i – o i ) 2
or its square root, RMSE, which has the same unit as the forecast parameter
– Negatively oriented, i.e. smaller is better
– A quadratic scoring rule; Very sensitive to large forecast errors !!! Harmful in the presence of potential outliers in the dataset Care must be taken with limited datasets Fear for high penalties easily leads to conservative forecasting
– RMSE is always >/= MAE
– Comparison of MAE and RMSE indicates the error variance
– MSE - RMSE decomposition is not dealt with here:
Acknowledge Anders Persson (yesterday)
Range: 0 to Perfect score = 0
Continuous Variables3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous Variables - Example 1, cont’d…
Scatterplot of one year of ECMWF three-day T2m forecasts (left) and forecast errors (right) versus observations at a single location. Red, yellow and green dots separate the errors in three categories. Some basic statistics like ME, MAE and MSE are also shown. The plots reveal the dependence of model behaviour with respect to temperature range, i.e. over- (under‑) forecasting in the cold (warm) tails of the distribution.
3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous Variables – Example 2
T2m; ME & MAE; ECMWF & LAMAverage over 30 stations; Winter 2003
-1
0
1
2
3
4
5
6 12 18 24 30 36 42 48 54 60 72 84 96 108 120
MAE_ECMWF
MAE_LAM
ME_ECMWF
ME_LAM
(C)
( hrs )
Temperature bias and MAE comparison between ECMWF and a Limited Area Model (LAM) (left), and an experimental post-processing scheme (PPP) (right), aggregated over 30 stations and one winter season. In spite of the ECMWF warm bias and diurnal cycle, it has a slightly lower MAE level than the LAM (left). The applied experimental “perfect prog” scheme does not manage to dispose of the model bias and exhibits larger absolute errors than the originating model – this example clearly demonstrates the importance of thorough verification prior to implementing a potential post-processing scheme into operational use.
T2m; ME & MAE; ECMWF & PPPAverage over 30 stations; Winter 2003
0
1
2
3
4
5
6
6 12 18 24 30 36 42 48 54 60 72 84 96 108 120
MAE_ECMWF
MAE_PPP
ME_ECMWF
ME_PPP
(C)
hrs
3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
3
MOS vs. EP MAE
Aggregate of:
6 months; Jan – June 3 lead times; +12, +24, +48 hr 4 stations in Finland
Continuous Variables: Aggregation (pooling) vs. Stratification
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
3Continuous Variables: Aggregation (pooling) vs. Stratification
Stratified by lead time Stratified by month
Stratified by monthStratified by station
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
3
MOS vs. EP Bias
Aggregate of:
6 months; Jan – June 3 lead times; +12, +24, +48 hr 4 stations in Finland
Continuous Variables: Aggregation (pooling) vs. Stratification
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
3Continuous Variables: Aggregation (pooling) vs. Stratification
Stratified by lead time Stratified by month
Stratified by stationStratified by station
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
General Skill Score
SS = ( A – A ref ) / ( A perf – A ref )
where A = the applied measure of accuracy,
subscript ”ref” refers to some reference forecast, ”perf” to a perfect forecast
For negatively oriented accuracy measures like MAE or MSE :
SS = [ 1 - A / A ref ] * 100
i.e. Relative accuracy of the % improvement over a reference system
– Reference is typically climatology or persistence; => Apply both; Examples follow !
– If negative, the reference (climate or persistence) is better
MAE_SS = [ 1 - MAE / MAE ref ] * 100
MSE_SS = [ 1 - MSE / MSE ref ] * 100
– Latter also known as Reduction of Variance, RV– SS can be unstable for small sample sizes, especially with MSE_SS
Range: - to 100Perfect score = 100
Continuous Variables3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
T2m; MAE; Average over 3 stations & forecast ranges +12-120 hrs
0
1
2
3
4
Winter2001
Spring2001
Summer2001
Autumn2001
Winter2002
Spring2002
Summer2002
Autumn2002
Winter2003
Timeaverage
(C) End Product "Better of ECMWF / LAM"
Mean Absolute Errors of End Product and DMO temperature forecasts (left), and Skill of the End Products over model output (right). The better of either ECMWF or local LAM is chosen up to the +48 hour forecast range (hindcast), thereafter ECMWF is used. The figure is an example of both aggregation (3 stations, several forecast ranges, two models, time-average) and stratification (seasons).
T2m; Skill of End Product over "Better of ECMWF / LAM"
0
5
10
15
Winter2001
Spring2001
Summer2001
Autumn2001
Winter2002
Spring2002
Summer2002
Autumn2002
Winter2003
Timeaverage
(%) Skill
Continuous Variables – Example 33
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Linear Error in Probability Space
LEPS = ( 1/n ) Σ | CDFo (f i) – CDFo (o i) |
where CDFo is the Cumulative probability Density Function of the observations,
determined from a relevant climatology
– Corresponds to MAE transformed into probability space from measurement space
– Does not depend on the scale of the variable
– Takes into account the variability of the weather element
– Can be used to evaluate forecasts at different locations
– Computation requires definition of cumulative climatological distributions at each location
– Encourages forecasting in extreme tails of the climate distributions
– Penalizes less than for similar sized errors in a more probable region of the distribution
i.e. opposite to MSE; =>Examples will follow !
Skill Score
LEPS_SS = [ 1 - LEPS / LEPS ref ] * 100
Range: 0 to 1Perfect score = 0
Range: - to 100Perfect score = 100
Continuous Variables3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
LEPS for a hypothetical distribution and location: The climatological frequency distribution (left) is transformed to a cumulative probability density distribution (right). A 2 ”unit” forecast error around the median, 13 vs. 15 “units” (red arrows), would yield a LEPS value of c. 0.2 in the probability space ( | 0.5 – 0.3 |, red arrows).
An equal error in the measurement space close to the tail of the distribution, 21 vs. 23 ”units” (blue arrows), would result a LEPS value of c. 0.05 ( | 0.95 – 0.9 |, blue arrows) => Fc errors of rare events are much less penalized using LEPS !
Hypothetical Climatological Distribution
0
5
10
15
20
25
30
35
40
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
n Climatology
Hypothetical Cumulative Density Function
0,0
0,2
0,4
0,6
0,8
1,0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
CDF
0,2
0,05
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Skill comparison (example A) ...
Continuous Variables3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous Variables
Skill comparison (example B) ...
3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous Variables3
Skill comparison (example C) ...
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Continuous variables - Summary: Verify a comprehensive set of local weather elements Produce scatterplots & time-series plots, including forecasts
and/or observations against their difference ”Stratify & Aggregate” + Compute ME, MAE, MAE_SS Additionally, compute LEPS, LEPS_SS, MSE, MSE_SS
Examples 1 - 4 in the General Guide to Verification (NOMEK Training)
Examples: Temperature: fixed time (e.g. noon, midnight), Tmin, Tmax, time-averaged (e.g. 5-day)
Wind speed and direction: fixed time, time-averaged
Accumulated precipitation: time-integrated (e.g. 6, 12, 24 hours)
Cloudiness: fixed time, time-averaged; However, typically categorized
3
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Outline:1. Introduction - History
2. Goals and general guidelines
3. Continuous variables
4.Categorical events• Binary (dichotomous; yes/no)
forecasts
• Multi-category forecasts
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes Hit False alarm Fc Yes
No Miss Corr. rejection Fc No
Marginal total Obs Yes Obs No Sum total
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
4. Categorical Events
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Bias aka Frequency Bias Index
B = FBI = ( a + b ) / ( a + c ) [ ~ Fc Yes / Obs Yes ]
– With B > 1 , the system exhibits over-forecasting.
– With B < 1 , the system exhibits under-forecasting.
Proportion Correct
PC = ( a + d ) / n [ ~ ( Hits + Correct rejections ) / Sum total ]
– Most simple and intuitive performance measure.
– Usually very misleading because rewards correct “Yes” and “No” forecasts equally.
– Can be maximized by forecasting the most common category all the time.
– Strongly influenced by the more common category.
– Never for extreme event verification !!!
Range: 0 to Perfect score = 1
Range: 0 to 1Perfect score = 1
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Probability Of Detection, Hit Rate ( H ), Prefigurance
POD = a / ( a + c ) [ ~ Hits / Obs Yes ]
– Sensitive to misses only, not false alarms.
– Can be artificially improved by over-forecasting (rare events).
– Complement score Miss Rate, MR = 1 – H = c / (a+c)
– Must be examined together with …
False Alarm Ratio
FAR = b / ( a + b ) [ ~ False alarms / Fc Yes ]
– Sensitive to false alarms only, not misses.
– Can be artificially improved by under-forecasting (rare events).
– Increase of POD can be achieved by increasing FAR, and vice versa.
Range: 0 to Perfect score = 1
Range: 0 to 1Perfect score = 0
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97
POD = 0.60FAR = 0.70
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Post agreement
PAG = a / ( a + b ) [ ~ Hits / Fc Yes ]
– Complement of FAR (i.e. = 1 – FAR).
– Sensitive to false alarms, not misses.
False Alarm Rate, Probability of False Detection ( POFD )
F = b / ( b + d ) [ ~ False alarms / Obs No ]
– False alarms, given the event did not occur (Obs No).
– Sensitive to false alarms only, not misses.
– Can be artificially improved by under-forecasting (rare events) – ref. Tornado case.
– Generally used with POD (or H) to produce the ROC score for probability forecasts;
– Otherwise rarely used.
Range: 0 to Perfect score = 1
Range: 0 to 1Perfect score = 0
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97POD = 0.60FAR = 0.70
PAG = 0.30F = 0.03
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Hanssen & Kuiper’s Skill Score, True Skill Statistics
KSS = TSS = POD – F = ( ad – bc ) / [ (a+c) (b+d) ] – Popular combination skill score of POD and F.
– Measures ability to separate “yes” cases (POD) from “no” cases (F).
– For rare events, d cell is high => F small => KSS close to POD.
Threat Score, Critical Success Index
TS = CSI = a / ( a + b + c )– Simple popular measure of rare events. Sensitive to hits, false alarms and misses.
– Measure of forecast after removing correct (simple) “no” forecasts from consideration.
– Sensitive to climatological frequency of event.
– More balanced than POD or FAR.
Range: -1 to Perfect score = 1No skill level = 0
Range: 0 to 1Perfect score = 1No skill level = 0
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03
KSS = 0.57TS = 0.25
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Equitable Threat Score
ETS = ( a – a r ) / ( a + b + c – a r ) where a r = ( a + b ) ( a + c ) / n
… is the number of hits due to random forecasts.
Simple TS may include hits due to random chance.
Heidke Skill Score
HSS = 2 ( ad – bc ) / [ ( a + c )( c + d ) + ( a + b )( b + d ) ]– One of the most popular skill measures for categorical forecasts.
– Score against random chance.
Range: -1/3 to Perfect score = 1No skill level = 0
Range: - to 1Perfect score = 1No skill level = 0
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03KSS = 0.57TS = 0.25
ETS = 0.24HSS = 0.39
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
Odds ratio
OR = a d / b cMeasures forecast system’s probability (odds)to score a hit (H) as compared to making a false alarm (F):
OR = [ H / ( 1 – H ) ] / [ F / ( 1 – F ) ]
– Independent of potential biases between observations and forecasts.
Transformation into a skill score, ranging from -1 to +1:
ORSS = ( ad – bc) / ( ad + bc ) = ( OR – 1 ) / ( OR + 1 )
– Produces typically very high absolute skill values, due to definition.
– Practically never used in meteorological forecast verification.
Range: 0 to Perfect score = No skill level = 1
TornadoTornado observed
forecastYes No fc S
Yes 30 70 100
No 20 2680 2700
obs S 50 2750 2800
B = 2.00PC = 0.97POD = 0.60FAR = 0.70PAG = 0.30F = 0.03KSS = 0.57TS = 0.25ETS = 0.24HSS = 0.39
OR = 57.43ORSS = 0.97
Categorical Events4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
RainRain observed
forecastYes No fc S
Yes 52 45 97
No 22 227 249
obs S 74 272 346
Contingency table of one year (with 19 missing cases) of categorical rain vs. no rain forecasts (left), and resulting statistics (right). Rainfall is a relatively rare event at this particular location, occurring in only c. 20 % (74/346) of the cases. Due to this, PC is quite high at 0.81. The relatively high rain detection rate (0.70) is “balanced” by high number of false alarms (0.46), with almost every other rain forecast having been superfluous. This is also seen as biased over-forecasting of the event (B=1.31). Due to the scarcity of the event the false alarm rate is quite low (0.17) – if used alone this measure would give a very misleading picture of forecast quality. The Odds Ratio shows that it was 12 times more probable to make a correct (rain or no rain) forecast than an incorrect one. The resulting skill score (0.85) is much higher than the other skill scores which is to be noted - this is a typical feature of the ORSS due to its definition.
B = 1.31 TS = 0.44 PC = 0.81 ETS = 0.32
~> POD = 0.70 KSS = 0.53 FAR = 0.46 HSS = 0.48 PAG = 0.54 OR = 11.92 F = 0.17 ORSS = 0.85
Precipitation in Finland
Categorical Events – Example 54
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Multi-category Events• Extension of 2*2 to several (k) mutually exhaustive categories
– Rain type: rain / snow / freezing rain (k=3)– Wind warnings: strong gale / gale / no gale (k=3)
• Only PC (Proportion Correct) can be directly generalized
• Other verification measures need be converted into a series of 2*2 tables– “Forecast event” distinct from the “non-forecast event”
Generalization of KSS and HSS – measures of improvement over random forecasts:
KSS = { Σ p ( fi , oi ) - Σ p ( fi ) p ( oi ) } / { 1 - Σ ( p (fi) ) 2 }
HSS = { Σ p ( fi , oi ) - Σ p ( fi ) p ( oi ) } / { 1 - Σ p ( fi ) p ( oi )}
Observed
Forecasto 1 o 2 o 3 fc S
f 1 r s t S f a = r b= s+t
f 2 u v w S f 2 a = v b= u+w c= u+x d= v+w+y+z
f 3 x y z S f 3 a = z b= x+y c= s+y d= r+t+x+z
obs S S o S o 2 S o 3 S c= t+w d= r+s+u+v
4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Multi-category contingency table of one year (with 19 missing cases) of cloudiness forecasts (left), and resulting statistics (right). Results are shown exclusively for forecasts of each cloud category, together with the overall PC, KSS and HSS scores. The most marked feature is the very strong over-forecasting of the “partly cloudy” category leading to numerous false alarms (B=2.5, FAR=0.8), and, despite this, the poor detection (POD=0.46). The forecasts cannot reflect the observed U‑shaped distribution of cloudiness at all. Regardless of this inferiority both overall skill scores are relatively high (c. 0.4), following the fact that most of the cases (90 %) fall either in the “no cloud” or “cloudy” category - neither of these scores takes into account the relative sample probabilities, but weight all correct forecasts similarly.
No clouds (0-2) Partly cloudy (3-5) Cloudy (6-8)
B = 0.86 B = 2.54 B = 0.79 POD = 0.58 POD = 0.46 POD = 0.65
~> FAR = 0.32 FAR = 0.82 FAR = 0.18 F = 0.13 F = 0.25 F = 0.19 TS = 0.45 TS = 0.15 TS = 0.57
Overall: PC = 0.61 KSS = 0.41 HSS = 0.37
CloudsClouds observed
forecast0 - 2 3 - 5 6 - 8 fc S
0 - 2 65 10 21 96
3 - 5 29 17 48 94
6 - 8 18 10 128 156
obs S 112 37 197 346
4 Multi-category Events – Example 6
Cloudiness in Finland
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
The previous data transformed into hit/miss bar charts, either given the observations (left), or given the forecasts (right). The green, yellow and red bars denote correct and one and two category errors, respectively. The U-shape in observations is clearly visible (left), whereas there is no hint of such in the forecast distribution (right).
112
197
37
96
156
94
4 Multi-category Events – Example 6, cont’d…
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Multi-category Events
Example from Finland, Again !
33
106
4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Examples Rain (vs. no rain); with various rainfall thresholds
Snowfall; with various thresholds
Strong winds (vs. no strong wind); with various wind force thresholds
Night frost (vs. no frost)
Fog (vs. no fog)
Categorical (binary, multi-category) events - Summary: Verify a comprehensive set of categorical local weather events
• Compile relevant contingency tables• Include multi-category events• Focus on adverse and/or extreme local weather
“Stratify & Aggregate” + Compute B, (PC), POD & FAR, (F),
(PAG), KSS, TS, ETS, HSS Additionally, compute OR, ORSS, ROC
Examples 5 - 6 in the General Guide to Verification (NOMEK Training)
4
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Outline:1. Introduction - History
2. Goals and general guidelines
3. Continuous variables
4. Categorical events• Binary (dichotomous; yes/no) forecasts• Multi-category forecasts
5.Probability forecasts
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Why Probability Forecasts ?
®“… the widespread practice of ignoring uncertainty when formulating and communicating forecasts represents an extreme form of inconsistency and generally results in the largest possible reductions in quality and value.”
- Allan Murphy (1993)
5
( A sophisticated, indirect phrase to emphasize the importance of addressing uncertainty… )
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Why Probability Forecasts ?
®“… Go look at the weather,I believe it’s gonna rain”
- Legendary Chicago Blues ArtistMuddy Waters (early 1960s)
singing ”Clouds in My Heart”
5
( A simple, direct phrase to emphasize uncertainty in everyday life… )
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Probability Forecasts
• All forecasting involves some level of uncertainty
• Deterministic forecasts cannot address the inherent uncertainty of the weather parameter or event
• Probabilities of the expected event (with values between 0 % and 100 % or 0 and 1) take into account the underlying joint distribution { p ( f, x ) } between forecasts and observations
• Conversion of probability forecasts to categorical events is simple (but not necessarily advisable) by defining the “on/off” threshold; Reverse is not straightforward.
• Verification is somewhat laborious => Large datasets are required to obtain any significant information
5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Reliability Diagram: Preparation
• Stratify probability forecasts and observations into deciles
• For each decile, record observed frequency of the event
• Keep track of the number of cases in each decile “bin”
• Plot on a diagram
• Plot additional histogram of the number of
fc cases in each bin => Sharpness Diagram
FC Probability FCs Events Non-Events Obs. RelativeBin in Bin in Bin in Bin Frequency0 65 2 63 3
10 23 3 20 1320 26 6 20 2330 25 10 15 4040 25 10 15 4050 20 10 10 5060 35 20 15 5770 30 20 10 6780 35 25 10 7190 12 10 2 83100 4 4 0 100
300 120 180
5
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Forecast probability (%)
Ob
se
rve
d r
el.
fre
qu
en
cy
(%
)
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Reliability Diagram: Interpretation
Reliability– Above 45 degree line => Underforecasting
– Below 45 degree line => Overforecasting
– Analogous to bias
– One of the components of the Brier score (see later)
Sharpness (Resolution)
– U-shaped histogram best
– Gaussian distribution worst
– Measure of spread (variance)
in the distribution of forecasts
5
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Forecast probability (%)
Ob
se
rve
d r
el.
fre
qu
en
cy
(%
)
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Climatology Minimal RESOLUTION Underforecasting
Good RESOLUTION at expense of RELIABILITY
Reliable forecasts of a rare event
Small sample size
from Wilks, 19955
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Brier Score
BS = ( 1/n ) Σ ( p i – o i ) 2
– Most common accuracy measure of probability forecasts; Note: o i is binary (0 or 1) !!! – Analogous to MSE in probability space; Negatively oriented, i.e. smaller is better
– A quadratic scoring rule; Very sensitive to large forecast errors !!! Careful with limited datasets
– For two categories only, for multiple categories see RPS …
– Strongly influenced by climatological frequency of the verification sample Different samples not to be compared
– BS can be algebraically decomposed => Reliability, Resolution, Uncertainty
Brier Skill Score
BSS = [ 1 – BS / BS ref ] * 100
BS ref = ( 1/n ) Σ ( ref i – o i ) 2
where ref i is either climatological relative frequency of the event, or persistence
Range: 0 to 1Perfect score = 0
Probability Forecasts: Measures
Range: - to 100Perfect score = 100
5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Ranked Probability Score
RPS = ( 1/(k-1)) Σ { ( Σ p i ) – ( Σ o i ) } 2
where k is the number of probability categories
Ranked Probability Skill Score
RPSS = [ 1 – RPS / RPS ref ] * 100
– Vector generalization of BS and BSS to multi-event or multi-category situations
– Measures the sums of squared differences in cumulative probability space
– Quadratic score – Penalizes most severely when forecast probabilities are further from the actual observed distributions
– As with BSSS, RPSS is very sensitive to size of dataset
Range: 0 to 1Perfect score = 0
Range: - to 100Perfect score = 100
Probability Forecasts: Measures5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Probability Forecasts - Signal Detection Theory:ROC - Relative Operating Characteristic
• Determines the ability of a forecasting system to separate situations when a signal is present (e.g. occurrence of rain) from absent signal (noise)
• In other words: Assesses the performance of a forecasting system to discriminate between occurrence (Yes) and non-occurrence (No) of an event
• Tests e.g. model performance relative to a specific threshold
• Applicable for two-category probability forecasts and also categorical deterministic forecasts, i.e. allows their comparison
• Gained wider and wider popularity in meteorological forecast verification during recent years
( Receiver-Operating Characteristics in medical sciences )
5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Signal Detection Theory: ROC Curve
• Graphical representation in a square box of the Hit rate (H) (y-axis) against the False Alarm Rate (F) (x-axis) for different potential decision thresholds
• Curve is plotted from a “binned” set of probability forecasts by stepping (or sliding) a decision threshold (e.g. 10% probability intervals) through the forecasts, each probability decision threshold generating a separate 2*2 contingency table
– The probability forecast is transformed into a set of categorical “yes/no” forecasts
– A set of value pairs of H and F is obtained, forming the curve
• It is desirable that H be high and F be low, i.e. the closer the point is to the upper left-hand corner, the better the forecast
• A perfect forecast system, with only correct forecasts & no false alarms, (regardless of the threshold chosen) has a “curve” that rises from (0,0) (H=F=0) along the y-axis to (0,1) (upper left-hand corner; H=1, F=0) and then straight to (1,1) (H=F=1)
EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
H = a / ( a + c )
F = b / ( b + d )
5
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
H
F
10%
20%
30%
40%
50%
60%
90%
80%
70%
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
ROC Curve Generation EventEvent observed
forecastYes No Marginal total
Yes a b a + b
No c d c + d
Marginal total a + c b + d a + b + c + d =n
5
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
H
F
10%
20%
30%
40%
50%
60%
90%
80%
70% To learn more about ROC and Signal Detection Theory, check:
http://wise.cgu.edu/
H = a / ( a + c )
F = b / ( b + d )
a+c =1920 b+d =5351
Example
Probability # of Cumulative # of Non- Cumulative Non- H FThreshold Occurences Occurences Occurences Occurencies (%) (%)
a b
0 - 9 43 1920 613 5351 100 10010 - 19 172 1877 1389 4738 98 8920 - 29 283 1705 1183 3349 89 6330 - 39 350 1422 936 2166 74 4040 - 49 323 1072 602 1230 56 2350 - 59 287 749 327 628 39 1260 - 69 169 462 151 301 24 670 - 79 163 293 88 150 15 380 - 89 89 130 40 62 7 190 - 99 41 41 22 22 2 0
( S a ) ( S b )
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Signal Detection Theory: ROCA Area
• Area under the ROC curve
• A relative index and a widely used summary measure
• Decreases from 1 when curve moves downward from the ideal top-left corner
• A useless forecast system is along the diagonal, when H=F and the area is = 0.5;
Such system cannot discriminate between occurrences and non-occurrences of the event
ROCA based skill score:
ROC_SS = 2 * ROCA - 1
• Negative below the diagonal
• At it’s minimum: ROC_SS = - 1, when ROCA = 0
• ROC is applicable for deterministic categorical forecast
– ROC_SS translates into KSS (= H – F)
– Only one single decision threshold - only a single ROC point results
Typically this is “inside“ the ROC area, i.e. indicating worse quality
• ROC, ROCA and ROC_SS are directly related to a decision-theoretic approach
– Can be related to the economic value of probability forecasts to end users
– Allowing for the assessment of the costs of false alarms
Range: -1 to 1Perfect score = 1
Range: 0 to 1Perfect system = 1
5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Reliability (left) and ROC (right) diagrams of one year of PoP forecasts. (The data are the same as earlier – where PoPs were transformed into categorical yes/no forecasts by using 50 % as the “on/off” threshold.) The inset box in the reliability diagram shows the frequency of use of the various forecast probabilities (sharpness) and the horizontal dotted line the climatological event probability. The reliability curve (with open circles) indicates strong over-forecasting bias throughout the probability range. This seems to be a common feature at this particular location as indicated by the qualitatively similar 10-year average reliability curve (dashed line). Brier skill scores (BSS) are computed against two reference forecast systems. Of these, climatology appears to be a much stronger “no skill opponent” than persistence. The ROC curve (right) is constructed on the basis of forecast and observed probabilities leading to different potential decision thresholds and respective value pairs of H and F. Also ROCA and
ROC_SS values are shown. The black dot represents the single value ROC from the categorical binary case of Example 5 (Slide #39) (H=POD=0.7; F=0.17).
5 Probability Forecasts – Example 7
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
ROC Curve: Probability FC Scheme, Vers.1 5
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
H
F
10%
20%
30%
40%
50%
60%
80%
70%
D0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
ROC Curve: Probability FC Scheme, Vers.2 5
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
10%
20%
50%
60%
90%
80%
70%
D
5%
40%
H
F
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
Probability Forecasts - Summary: Verify a comprehensive set of probability forecasts focusing
on adverse and/or extreme weather Produce reliability diagrams, including sharpness
distribution Compute BS, BSS, for multi-category events RPS, RPSS
Produce ROC diagrams, ROCA, ROC_SS Example 7 in the General Guide to Verification (NOMEK Training)
BS:• Based on squared error
• Decompositions provide insight into several performance attributes (NOT discussed here)
• Dependent on frequency of occurrence of the event
ROC:• Considers forecasts’ ability to discriminate
between Yes and No events
• Provides verification information for individual decision thresholds
• Less dependent on frequency of occurrence of event
5
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
FINAL reminder on verification:
An act (“art”?) of countless methods and measures
An essential daily real-time practice in the operational forecasting environment
An active feedback and dialogue process is a necessity
A fundamental means to improve weather forecasts and services
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
References: LiteratureBougeault, P., 2003. WGNE recommendations on verification methods for numerical prediction of weather elements and severe weather events (CAS/JSC WGNE Report No. 18)
Proceedings, Making Verification More Meaningful (Boulder, 30 July - 1 August 2002)
Proceedings, SRNWP Mesoscale Verification Workshop (De Bilt, 2001)
Proceedings, WMO/WWRP International Conference on Quantitative Precipitation Forecasting (Vols. 1 and 2, Reading, 2 - 6 September 2002)
Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences: An Introduction (Chapter 7: Forecast Verification) (Academic Press)
Jolliffe, I.T. and D.B. Stephenson, 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Sciences (Wiley)
Stanski, H.R., L.J. Wilson and W.R. Burrows, 1989. Survey of Common Verification Methods in Meteorology (WMO Research Report No. 89-5)
Cherubini, T., A. Ghelli and F. Lalaurette, 2001. Verification of precipitation forecasts over the Alpine region using a high density observing network (ECMWF Tech. Mem., 340, 18pp)
Murphy, A.H. and R.L. Winkler, 1987. A General Framework for Forecast Verification (Mon. Wea. Rev., 115, 1330-1338)
Stephenson, D.B., 2000. Use of the “Odds Ratio” for Diagnosing Forecast Skill (Weather and Forecasting, 15, 221-232)
Grazzini, F and A. Persson, 2003: User Guide to ECMWF Forecast Products (ECMWF Met. Bull., M3.2) Thornes, J.E. and D.B. Stephenson, 2001. How to judge the quality and value of weather forecast products (Meteorol. Appls., 8, 307-314)
15.4.2005NOMEK - Verification Training - OSLO / [email protected]
References: Websiteshttp://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/jwgv/jwgv.htmll - WMO/WWRP/WGNE Working Group on Verification websites
http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/Workshop2004/home.html - International Verification Methods Workshop (Montreal, 2004)
http://www.rap.ucar.edu/research/verification/ver_wkshp1.html - Making Verification More Meaningful Workshop (Boulder, 2002)
http://www.chmi.cz/meteo/ov/wmo - WMO/WWRP Workshop on the Verification of QPF (Prague, 2001)
http://hirlam.knmi.nl/open/srnwp/ - EUMETNET/SRNWP Mesoscale Verification Workshops (DeBilt, 2004)
http://www.sec.noaa.gov/forecast_verification/Glossary.html - NOAA/SEC Glossary of verification terms
http://nws.noaa.gov/tdl/verif - NOAA MOS verification website
http://wwwt.emc.ncep.noaa.gov/gmb/ens/verif.html - NOAA EPS Verification website
http://www.wmo.ch/web/www/DPS/SVS-for-LRF.html - WMO/CBS Standardised Verification System for Long-Range Forecasts
http://www.ecmwf.int/products/forecasts/d/charts/medium/verification - Verification of ECMWF Forecasting System
http://www.ecmwf.int/products/forecasts/guide - User Guide to ECMWF Forecast Products