Evalua&on)of)Weather)and)Climate) Forecasts:)A)2016 ...ericg/BrownGilleland2016.pdf ·...

Evalua&on of Weather and Climate Forecasts: A 2016 perspec&ve

Barbara Brown and Eric Gilleland

NCAR, Boulder, Colorado, USA

AMS Annual Mee*ng 14 January 2015

Goals

To understand where we are going, it’s helpful to understand where we have been… •  Examine the evoluDon of verificaDon over the last few decades § Where have we come from? § Where are we going? § Raise quesDons regarding verificaDon pracDces along the way…

• What is leJ to do? § What are the opportuniDes? § What are the problems/impediments?

Early verifica&on •  Finley period… (see Murphy paper on “The Finley Affair”; WAF, 11, 1996)

•  Focused on conDngency table staDsDcs •  Development of many of the common measures sDll used today: §  Gilbert (ETS) §  Peirce (Hanssen-‐Kuipers) §  Heidke §  Etc…

•  These methods are sDll the backbone of many verificaDon efforts – for warnings and events §  Are they used too oJen? §  Should everything be categorized? What is lost by categorizing things into Yes/No?

§  How can this categorizaDon be made more user-‐friendly/relevant?

Observed Yes no

yes hits false alarms

No misses correct negatives

Early years con&nued: Con&nuous measures

•  Focus on squared error staDsDcs •  Mean-‐squared error §  CorrelaDon §  Bias §  Note: Lible recogniDon before Murphy of the non-‐independence of these measures

•  Extension to probabilisDc forecasts •  Brier Score (1950) – well before prevalence of probability forecasts!

§ Development of “NWP” measures §  S1 score §  Anomaly correlaDon §  SDll relied on for monitoring and comparing performance of NWP systems (Are these sDll the best measures for this purpose?)

The “Renaissance”: The Allan Murphy era •  Expanded methods for probabilisDc forecasts §  DecomposiDons of scores led to more meaningful interpretaDons of verificaDon results

§  Abribute diagram •  IniDaDon of ideas of meta verifica*on: EquitabilDy, Propriety

•  StaDsDcal framework for forecast verificaDon §  Clarified mulDple perspecDves for evaluaDng forecasts (joint distribuDon factorizaDons)

§  Placed verificaDon in a staDsDcal context, where it belongs

•  ConnecDons between forecast “quality” and “value” §  EvaluaDon of cost-‐loss decision-‐making situaDons in the context of improved forecast quality

§  Non-‐linear nature of quality-‐value relaDonships

“Forecasts contain no intrinsic value. They acquire value

through their ability to influence the decisions made by users of

the forecasts.”

“Forecast quality is inherently mul*faceted in nature… however, forecast verifica*on has tended to focus on one or two aspects of overall forecas*ng performance such as accuracy and skill.”

Allan H. Murphy, Weather and Forecas*ng, 8, 1993: “What is a good forecast: An essay on the nature of goodness in forecasDng”

Murphy era cont. Development of the idea of “diagnosDc” verificaDon •  Also called “distribuDon-‐oriented” verificaDon

•  Focus on measuring or represenDng aMributes of performance rather than relying on summary measures

•  A revolu*onary idea: Instead of relying on a single measure of “overall” performance, ask ques*ons about performance and measure aMributes that are able to answer those ques*ons

Example: Use of condiDonal quanDle plots to examine

condiDonal biases in foreacsts

The “Modern” era •  New focus on evaluaDon of ensemble forecasts §  Development of new methods specific to ensembles (rank histogram, CRPS)

•  Greater understanding of limitaDons of methods §  “Meta” verificaDon

•  EvaluaDon of sampling uncertainty in verificaDon measures

•  Approaches to evaluate mulDple abributes simultaneously (note: this is actually an extension of Murphy’s abribute diagram idea to other types of measures)

•  Ex: Performance diagrams, Taylor diagrams

The “Modern” era cont. •  Development of an internaDonal VerificaDon Community § Workshops, textbooks…

•  EvaluaDon approaches for special kinds of forecasts §  Extreme events (Extremal Dependency Scores)

§  “NWP” measures •  Extension of diagnosDc verificaDon ideas §  SpaDal verificaDon methods §  Feature-‐based evaluaDons (e.g., of Dme series)

• Movement toward “User-‐relevant” approaches

WMO Joint Working Group on Forecast

VerificaDon Research

From Ferro and Stephenson 2011 (Wx and Forecas1ng)

Spa&al verifica&on methods Inspired by the limited diagnos*c informaDon available from tradiDonal approaches for evaluaDng NWP predicDons •  Difficult to disDnguish differences between forecasts

•  The double penalty problem §  Forecasts that appear good by the eye test fail by tradiDonal measures… oJen due to small offsets in spaDal locaDon

§  Smoother forecasts oJen “win” even if less useful

•  TradiDonal scores don’t say what went wrong or was good about a forecast

•  Many new approaches developed over the last 15 years

•  StarDng to also be applied in climate model evaluaDon

New Spa&al Verifica&on Approaches Neighborhood Successive smoothing of

forecasts/obs Gives credit to "close" forecasts Scale separa=on Measure scale-‐dependent error Field deforma=on Measure distor*on and displacement (phase error) for whole field

How should the forecast be adjusted to make the best match with the observed field?

Object-‐ and feature-‐based Evaluate proper*es of iden*fiable features

hbp://www.ral.ucar.edu/projects/icp/

SPCT

−120 −100 −80 −70

3040

NARR

−120 −100 −80 −70

3040

CRCM−ccsm

0.0 0.2 0.4 0.6 0.8 1.0

Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)

G. (2013, MWR, 141, (1), 340 -‐ 355)

Hering and Genton (2011, Technometrics, 53, (4): 414 -‐ 425)

SPCT

−120 −100 −80 −70

3040

NARR

−120 −100 −80 −70

3040

CRCM−cgcm3

0.0 0.2 0.4 0.6 0.8 1.0

Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)

G. (2013, MWR, 141, (1), 340 -‐ 355)


SPCT

−120 −110 −100 −90 −80 −70

3035

4045

−0.4

−0.2

0.0

0.2

0.4

Loss Differential Field Histogram ofLoss Differential Field

Mean Loss Differential

Den

sity

−0.4 −0.2 0.0 0.2 0.4

02

46

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●●●●●●●●

●●●●●

●●●●

●●●●●

●

●●●●

●●

●●●●

●

●●●

●

●●●

●●●

●

●●●

●●●

●●●●●●

●●●●●●●●

●

●●

●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●

●

●●●●

●

●

●

●●●

●●●●●●●●

●

●

●

●●●●●●●●●●●●

●●●●●●●

●

●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●

●

●●●●●●

●●●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●

●

●●●

●●●●●

●●●●

●●

●

●●●●●

●●

●●

●●●●●

●

●

●

●

●

●

●●●●

●●●●

●

●

●●●●●

●●●●●

●

●●●

●

●●

●

●

●

●●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●●●●●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

●●

●●

●●

●●●

●

●●

●

●

●

●●●

●

●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●●

●●●●

●

●●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●●

●

●●

●

●●●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●●●

●

●

●

●

●

●●

●

●

●

●

●●●●

●

●●●

●

●

●

●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●●●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

0 10 20 30 40 50

0.00

00.

005

0.01

00.

015

0.02

0

separation distance

vario

gram

o EmpiricalModel

−40 −20 0 20 40

010

2030

40

variogram by direction

0.005

0.010

0.015

0.020

Freq. CAPE > 1000 m/s | q75 > 90th percentile over time (m/s)abserrloss: CRCM−ccsm vs CRCM−cgcm3 (NARR)

G. (2013, MWR, 141, (1), 340 -‐ 355)


SPCT

Supplementary material for manuscript in preparaDon (G. et al)

Meta-‐evalua&on of spa&al methods: What are the capabili&es of the new methods? •  ICP, hbp://www.ral.ucar.edu/projects/icp •  MesoVICT incorporates more variables and more realisDc forecast verificaDon cases

•  precipitaDon, wind, pressure, etc. •  complex terrain •  cases over mulDple Dme points (and lead Dmes) •  forecast ensembles •  ensembles of observaDons •  point observaDons and analysis product

•  New spaDal verificaDon methods since ICP •  EvaluaDon of climate models

•  lack of temporal compaDbility in most cases, but not a problem (actually makes things simpler)

•  especially well suited to spaDal verificaDon methods

User-‐relevant verifica&on: The Weather Informa&on Value Chain

Weather Informa=on Use / Decision

Making Economic Values

Modeling Forecas=ng

Dissemina=on Communica=on

Monitoring Observa=on

Percep=on Interpreta=on

Metrics / Evalua=on Courtesy of Jeff Lazo

In recent years, there has been a new focus on User-‐relevant verificaDon – moving toward Murphy’s vision of diagnosDc verificaDon and the connecDons

between forecast quality and value

User-‐relevant verifica&on Levels of user-‐relevance

1.  Making tradiDonal verificaDon methods useful for a range of users (e.g., variety of thresholds)

2.  Developing and applying specific methods for parDcular users [Ex: ParDcular staDsDcs; user-‐relevant variables]

3.  Applying meaningful diagnosDc (e.g., spaDal) methods that are relevant for a parDcular users’ quesDon

4.  ConnecDng economic and other value directly with forecast performance

Most verificaDon studies are at Levels 1 and 2, with some approaching 3, and very few actually at Level 4

Some examples….

Solar Power forecasts: Adap&ng standard metrics to meet user needs

Bias Mean Error

Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 SunCast

Intra Hour Day Ahead Comp1 960 (-‐76 / 1036) 858 (-‐79 / 937)

Comp2 1418 (-‐43 / 1461) 1295 (-‐43 / 1339)

Comp3 1433 (-‐33 / 1466) 1106 (-‐5 / 1111)

Comp4 510 (-‐84 / 594) n/a

Comp5 1748 (-‐12 / 1760) n/a

Comp6 1004 (-‐40 / 1044) n/a

Blended Model

226 (-‐92 / 318) 178 (-‐110 / 288)

Intra Hour Day Ahead

Sum of Errors over Forecast Period

User: Energy Trader Sum of Errors over enDre day more informaDve

Preferences for end users: Raw Values (Non-‐normalized) Normalized by Capacity/Clear Sky Normalized by Actuals

Courtesy T. Jensen

Capturing Energy Ramps

Baseline One Component of System

Blended System

Over Fcst

Under Fcst

Over

Under

Over

Under

Frequency Bias = # of Fcst Events # Observed Events

100 Up 250 Up

150 Up 300 Up

200 Up 350 Up

Credit: T. Jensen

Applica&ons of object –based approaches

Example: EvaluaDon of jet cores, highs, lows (using MODE object based approach) for model acceptance tesDng

Courtesy Marion Mibermaier, UK Met Office

Comments on user-‐relevant verifica&on • Moving toward user relevant verificaDon will increase both the usefulness and quality of forecasts, and will benefit developers as well as users

• Many of the steps toward user relevance (e.g., user-‐specified straDficaDons & thresholds) are easy to achieve § Others require major mulD-‐disciplinary efforts

• VerificaDon pracDDoners – people who do verificaDon – should endeavor as much as possible to understand the needs of the forecast users

• Much is leJ to be explored by creaDve minds!

Challenge: Develop best new user-‐relevant verifica&on method

•  Sponsored by WMO/WWRP •  Joint Working Group on Forecast VerificaDon Research; •  High Impact Weather (HIW), Sub-‐seasonal to seasonal (S2S), and Polar PredicDon (PPP) projects

•  Focus •  All applicaDons of meteorological and hydrological forecasts (including TCs, seasonal and sub-‐seasonal, polar, …)

•  Metrics can be quanDtaDve scores or diagnosDcs (e.g., diagrams) •  Criteria for being selected as “best”

•  Originality, user relevance, intuiDveness, simplicity and ease of compuDng, robustness, and resistance to hedging.

•  Desirable characterisDcs: (i)  Clear staDsDcal foundaDon; (ii)  Applicability to a broader set of problems

Challenge: Develop best new user-‐relevant verifica&on method • Dates:

•  Formal announcement: Sept 2015 •  Deadline for submission: 31 Oct 2016

• Prize: Invited keynote talk at the 7th InternaDonal VerificaDon Methods Workshop in 2017

• Contact [email protected] for more informaDon

•  See website at hbp://www.wmo.int/pages/prog/arep/wwrp/new/ FcstVerChallenge.html

Summary

• Much progress has been made in the last few decades, advancing the capabiliDes and potenDal impacts of forecast evaluaDon

• Many new approaches have been developed, examined, and applied, and are providing opportuniDes for more meaningful evaluaDons of both weather and climate forecasts

• But sDll more challenges ahead…

Remaining challenges (some examples) •  Expansion of user-‐relevant metrics – moving to Levels 3 and 4, expanding Levels 1 and 2 § Providing a breadth of informaDon to users

•  SorDng out how to incorporate uncertainty appropriately §  SpaDal / Temporal § Measurement / ObservaDon §  Sampling

•  Improving communicaDon § How do we communicate forecast quality informaDon to the general public? To specific users?

Evalua&on)of)Weather)and)Climate) Forecasts:)A)2016 ...ericg/BrownGilleland2016.pdf ·...

Documents

Transcript of Evalua&on)of)Weather)and)Climate) Forecasts:)A)2016 ...ericg/BrownGilleland2016.pdf ·...