Evalua&on)of)Weather)and)Climate) Forecasts:)A)2016 ...ericg/BrownGilleland2016.pdf ·...
Transcript of Evalua&on)of)Weather)and)Climate) Forecasts:)A)2016 ...ericg/BrownGilleland2016.pdf ·...
Evalua&on of Weather and Climate Forecasts: A 2016 perspec&ve
Barbara Brown and Eric Gilleland
NCAR, Boulder, Colorado, USA
AMS Annual Mee*ng 14 January 2015
Goals
To understand where we are going, it’s helpful to understand where we have been… • Examine the evoluDon of verificaDon over the last few decades § Where have we come from? § Where are we going? § Raise quesDons regarding verificaDon pracDces along the way…
• What is leJ to do? § What are the opportuniDes? § What are the problems/impediments?
Early verifica&on • Finley period… (see Murphy paper on “The Finley Affair”; WAF, 11, 1996)
• Focused on conDngency table staDsDcs • Development of many of the common measures sDll used today: § Gilbert (ETS) § Peirce (Hanssen-‐Kuipers) § Heidke § Etc…
• These methods are sDll the backbone of many verificaDon efforts – for warnings and events § Are they used too oJen? § Should everything be categorized? What is lost by categorizing things into Yes/No?
§ How can this categorizaDon be made more user-‐friendly/relevant?
Observed Yes no
yes hits false alarms
No misses correct negatives
Early years con&nued: Con&nuous measures
• Focus on squared error staDsDcs • Mean-‐squared error § CorrelaDon § Bias § Note: Lible recogniDon before Murphy of the non-‐independence of these measures
• Extension to probabilisDc forecasts • Brier Score (1950) – well before prevalence of probability forecasts!
§ Development of “NWP” measures § S1 score § Anomaly correlaDon § SDll relied on for monitoring and comparing performance of NWP systems (Are these sDll the best measures for this purpose?)
The “Renaissance”: The Allan Murphy era • Expanded methods for probabilisDc forecasts § DecomposiDons of scores led to more meaningful interpretaDons of verificaDon results
§ Abribute diagram • IniDaDon of ideas of meta verifica*on: EquitabilDy, Propriety
• StaDsDcal framework for forecast verificaDon § Clarified mulDple perspecDves for evaluaDng forecasts (joint distribuDon factorizaDons)
§ Placed verificaDon in a staDsDcal context, where it belongs
• ConnecDons between forecast “quality” and “value” § EvaluaDon of cost-‐loss decision-‐making situaDons in the context of improved forecast quality
§ Non-‐linear nature of quality-‐value relaDonships
“Forecasts contain no intrinsic value. They acquire value
through their ability to influence the decisions made by users of
the forecasts.”
“Forecast quality is inherently mul*faceted in nature… however, forecast verifica*on has tended to focus on one or two aspects of overall forecas*ng performance such as accuracy and skill.”
Allan H. Murphy, Weather and Forecas*ng, 8, 1993: “What is a good forecast: An essay on the nature of goodness in forecasDng”
Murphy era cont. Development of the idea of “diagnosDc” verificaDon • Also called “distribuDon-‐oriented” verificaDon
• Focus on measuring or represenDng aMributes of performance rather than relying on summary measures
• A revolu*onary idea: Instead of relying on a single measure of “overall” performance, ask ques*ons about performance and measure aMributes that are able to answer those ques*ons
Example: Use of condiDonal quanDle plots to examine
condiDonal biases in foreacsts
The “Modern” era • New focus on evaluaDon of ensemble forecasts § Development of new methods specific to ensembles (rank histogram, CRPS)
• Greater understanding of limitaDons of methods § “Meta” verificaDon
• EvaluaDon of sampling uncertainty in verificaDon measures
• Approaches to evaluate mulDple abributes simultaneously (note: this is actually an extension of Murphy’s abribute diagram idea to other types of measures)
• Ex: Performance diagrams, Taylor diagrams
The “Modern” era cont. • Development of an internaDonal VerificaDon Community § Workshops, textbooks…
• EvaluaDon approaches for special kinds of forecasts § Extreme events (Extremal Dependency Scores)
§ “NWP” measures • Extension of diagnosDc verificaDon ideas § SpaDal verificaDon methods § Feature-‐based evaluaDons (e.g., of Dme series)
• Movement toward “User-‐relevant” approaches
WMO Joint Working Group on Forecast
VerificaDon Research
From Ferro and Stephenson 2011 (Wx and Forecas1ng)
Spa&al verifica&on methods Inspired by the limited diagnos*c informaDon available from tradiDonal approaches for evaluaDng NWP predicDons • Difficult to disDnguish differences between forecasts
• The double penalty problem § Forecasts that appear good by the eye test fail by tradiDonal measures… oJen due to small offsets in spaDal locaDon
§ Smoother forecasts oJen “win” even if less useful
• TradiDonal scores don’t say what went wrong or was good about a forecast
• Many new approaches developed over the last 15 years
• StarDng to also be applied in climate model evaluaDon
New Spa&al Verifica&on Approaches Neighborhood Successive smoothing of
forecasts/obs Gives credit to "close" forecasts Scale separa=on Measure scale-‐dependent error Field deforma=on Measure distor*on and displacement (phase error) for whole field
How should the forecast be adjusted to make the best match with the observed field?
Object-‐ and feature-‐based Evaluate proper*es of iden*fiable features
hbp://www.ral.ucar.edu/projects/icp/
SPCT
−120 −100 −80 −70
3040
NARR
−120 −100 −80 −70
3040
CRCM−ccsm
0.0 0.2 0.4 0.6 0.8 1.0
Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)
G. (2013, MWR, 141, (1), 340 -‐ 355)
Hering and Genton (2011, Technometrics, 53, (4): 414 -‐ 425)
SPCT
−120 −100 −80 −70
3040
NARR
−120 −100 −80 −70
3040
CRCM−cgcm3
0.0 0.2 0.4 0.6 0.8 1.0
Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)
G. (2013, MWR, 141, (1), 340 -‐ 355)
Hering and Genton (2011, Technometrics, 53, (4): 414 -‐ 425)
SPCT
−120 −110 −100 −90 −80 −70
3035
4045
−0.4
−0.2
0.0
0.2
0.4
Loss Differential Field Histogram ofLoss Differential Field
Mean Loss Differential
Den
sity
−0.4 −0.2 0.0 0.2 0.4
02
46
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●●●●●●●●
●●●●●
●●●●
●●●●●
●
●●●●
●●
●●●●
●
●●●
●
●●●
●●●
●
●●●
●●●
●●●●●●
●●●●●●●●
●
●●
●●●
●●●●●
●●●●
●●●●●●●●●●●●●●●●
●●●
●●●
●●●●●●●●●
●
●●●●
●
●
●
●●●
●●●●●●●●
●
●
●
●●●●●●●●●●●●
●●●●●●●
●
●●●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●
●●●●●●●
●
●●●●●●
●●●●
●●●●●●●●●
●●
●●●●●●●●●
●●●●
●
●●●
●●●●●
●●●●
●●
●
●●●●●
●●
●●
●●●●●
●
●
●
●
●
●
●●●●
●●●●
●
●
●●●●●
●●●●●
●
●●●
●
●●
●
●
●
●●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●●●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●●
●●
●●
●●●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●●●●
●
●
●
●
●
●●
●●●●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●●
●
●●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●●●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●●●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
0 10 20 30 40 50
0.00
00.
005
0.01
00.
015
0.02
0
separation distance
vario
gram
o EmpiricalModel
−40 −20 0 20 40
010
2030
40
variogram by direction
0.005
0.010
0.015
0.020
Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)abserrloss: CRCM−ccsm vs CRCM−cgcm3 (NARR)
G. (2013, MWR, 141, (1), 340 -‐ 355)
Hering and Genton (2011, Technometrics, 53, (4): 414 -‐ 425)
SPCT
Supplementary material for manuscript in preparaDon (G. et al)
Meta-‐evalua&on of spa&al methods: What are the capabili&es of the new methods? • ICP, hbp://www.ral.ucar.edu/projects/icp • MesoVICT incorporates more variables and more realisDc forecast verificaDon cases
• precipitaDon, wind, pressure, etc. • complex terrain • cases over mulDple Dme points (and lead Dmes) • forecast ensembles • ensembles of observaDons • point observaDons and analysis product
• New spaDal verificaDon methods since ICP • EvaluaDon of climate models
• lack of temporal compaDbility in most cases, but not a problem (actually makes things simpler)
• especially well suited to spaDal verificaDon methods
User-‐relevant verifica&on: The Weather Informa&on Value Chain
Weather Informa=on Use / Decision
Making Economic Values
Modeling Forecas=ng
Dissemina=on Communica=on
Monitoring Observa=on
Percep=on Interpreta=on
Metrics / Evalua=on Courtesy of Jeff Lazo
In recent years, there has been a new focus on User-‐relevant verificaDon – moving toward Murphy’s vision of diagnosDc verificaDon and the connecDons
between forecast quality and value
User-‐relevant verifica&on Levels of user-‐relevance
1. Making tradiDonal verificaDon methods useful for a range of users (e.g., variety of thresholds)
2. Developing and applying specific methods for parDcular users [Ex: ParDcular staDsDcs; user-‐relevant variables]
3. Applying meaningful diagnosDc (e.g., spaDal) methods that are relevant for a parDcular users’ quesDon
4. ConnecDng economic and other value directly with forecast performance
Most verificaDon studies are at Levels 1 and 2, with some approaching 3, and very few actually at Level 4
Some examples….
Solar Power forecasts: Adap&ng standard metrics to meet user needs
Bias Mean Error
Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 SunCast
Intra Hour Day Ahead Comp1 960 (-‐76 / 1036) 858 (-‐79 / 937)
Comp2 1418 (-‐43 / 1461) 1295 (-‐43 / 1339)
Comp3 1433 (-‐33 / 1466) 1106 (-‐5 / 1111)
Comp4 510 (-‐84 / 594) n/a
Comp5 1748 (-‐12 / 1760) n/a
Comp6 1004 (-‐40 / 1044) n/a
Blended Model
226 (-‐92 / 318) 178 (-‐110 / 288)
Intra Hour Day Ahead
Sum of Errors over Forecast Period
User: Energy Trader Sum of Errors over enDre day more informaDve
Preferences for end users: Raw Values (Non-‐normalized) Normalized by Capacity/Clear Sky Normalized by Actuals
Courtesy T. Jensen
Capturing Energy Ramps
Baseline One Component of System
Blended System
Over Fcst
Under Fcst
Over
Under
Over
Under
Frequency Bias = # of Fcst Events # Observed Events
100 Up 250 Up
150 Up 300 Up
200 Up 350 Up
Credit: T. Jensen
Applica&ons of object –based approaches
Example: EvaluaDon of jet cores, highs, lows (using MODE object based approach) for model acceptance tesDng
Courtesy Marion Mibermaier, UK Met Office
Comments on user-‐relevant verifica&on • Moving toward user relevant verificaDon will increase both the usefulness and quality of forecasts, and will benefit developers as well as users
• Many of the steps toward user relevance (e.g., user-‐specified straDficaDons & thresholds) are easy to achieve § Others require major mulD-‐disciplinary efforts
• VerificaDon pracDDoners – people who do verificaDon – should endeavor as much as possible to understand the needs of the forecast users
• Much is leJ to be explored by creaDve minds!
Challenge: Develop best new user-‐relevant verifica&on method
• Sponsored by WMO/WWRP • Joint Working Group on Forecast VerificaDon Research; • High Impact Weather (HIW), Sub-‐seasonal to seasonal (S2S), and Polar PredicDon (PPP) projects
• Focus • All applicaDons of meteorological and hydrological forecasts (including TCs, seasonal and sub-‐seasonal, polar, …)
• Metrics can be quanDtaDve scores or diagnosDcs (e.g., diagrams) • Criteria for being selected as “best”
• Originality, user relevance, intuiDveness, simplicity and ease of compuDng, robustness, and resistance to hedging.
• Desirable characterisDcs: (i) Clear staDsDcal foundaDon; (ii) Applicability to a broader set of problems
Challenge: Develop best new user-‐relevant verifica&on method • Dates:
• Formal announcement: Sept 2015 • Deadline for submission: 31 Oct 2016
• Prize: Invited keynote talk at the 7th InternaDonal VerificaDon Methods Workshop in 2017
• Contact [email protected] for more informaDon
• See website at hbp://www.wmo.int/pages/prog/arep/wwrp/new/ FcstVerChallenge.html
Summary
• Much progress has been made in the last few decades, advancing the capabiliDes and potenDal impacts of forecast evaluaDon
• Many new approaches have been developed, examined, and applied, and are providing opportuniDes for more meaningful evaluaDons of both weather and climate forecasts
• But sDll more challenges ahead…
Remaining challenges (some examples) • Expansion of user-‐relevant metrics – moving to Levels 3 and 4, expanding Levels 1 and 2 § Providing a breadth of informaDon to users
• SorDng out how to incorporate uncertainty appropriately § SpaDal / Temporal § Measurement / ObservaDon § Sampling
• Improving communicaDon § How do we communicate forecast quality informaDon to the general public? To specific users?