Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its...

39
Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International Climate Science Cross Connects Workshop Series Boulder CO V. Balaji NOAA/GFDL and Princeton University 15 July 2014 V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 1 / 39

Transcript of Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its...

Page 1: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Climate Science and its Global Data InfrastructureImproving Data Mobility and Management for International Climate

ScienceCross Connects Workshop Series

Boulder CO

V. Balaji

NOAA/GFDL and Princeton University

15 July 2014

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 1 / 39

Page 2: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 2 / 39

Page 3: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 3 / 39

Page 4: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Data consumers

Scientists perform sequences of computations (e.g “poleward heattransport”, “length of growing season”) on datasets. Typically this isscripted in some data analysis language, and ideally it should bepossible to apply the script to diverse datasets.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 4 / 39

Page 5: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Data producers

Observational and model output data in the climate-ocean-weather(COW) community is initially generated in some “native” non-standardformat, and any subsequent relative analyses requires considerableeffort to systematise. Issues include moving and transient datasources, lossy data formats, curvilinear and other “exotic” coordinates.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 5 / 39

Page 6: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Data organizers

Data organizers are the community within this ecosystem thatfacilitates the transformation of source dependent data to a neutral andreadily consumable form. They maintain the standards for describingdata in a manner that permits these transformations, and develop toolsto perform them.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 6 / 39

Page 7: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 7 / 39

Page 8: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Multi-model ensembles for climate projection

Figure SPM.7 from the IPCC AR5 Report.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 8 / 39

Page 9: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Sources of uncertainty in Earth System Modeling

We are modeling an inherently chaoticnon-linear dynamical coupled system, withsensitive dependency on initial conditions(treated using initial condition ensemblesand data assimilation).Some empirical inputs to models arepoorly constrained by data (perturbedphysics ensembles, e.gclimateprediction.net).There are many climate components andfeedbacks we still don’t know (or agree)how to represent (structural uncertaintyand multi-model ensembles).

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 9 / 39

Page 10: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Multi-model ensembles to overcome “structuraluncertainty”

Reichler and Kim (2008), Fig. 1: compare models’ ability to simulate20th century climate, over 3 generations of models.

Models are getting better over time.The ensemble average is better than any individual model.Improvements in understanding percolate quickly across thecommunity.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 10 / 39

Page 11: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Genealogy of climate models

There is a close link between “genetic distance” and “phenotypicdistance” across climate models (Fig. 1 from Knutti et al, GRL, 2013).V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 11 / 39

Page 12: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Interannual variability of hurricane frequency

Interannual variability of W. Atlantic hurricane number from 1981-2005in a 50 km model. (Figure 7 from Zhao and Held 2009).

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 12 / 39

Page 13: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

A simple predictor of hurricane counts?

Difference between Atlantic surface temperature TA andmid-tropospheric global temperature TG dtermines hurricanegeneration rate. Figure 16 from Zhao et al (2009).V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 13 / 39

Page 14: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Emergent constraints

Sherwood et al (Nature, 2014). Observational constraints on modelspread.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 14 / 39

Page 15: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 15 / 39

Page 16: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

The global data infrastructure underpinning MIPs

MIPs, and in general any science involving cross-modelcomparisons, critically depend on the global data infrastructure –the “vast machine” (Edwards 2010) – making this sort ofdata-sharing possible.Infrastructure should not be a research project.Infrastructure should be treated as such by the national andinternational research agencies, but it is instead fundedpiecemeal, as a soft-money afterthought. This places the systemat risk (NRC 2012: “A National Strategy for Advancing ClimateModeling”, ISENES-2 Infrastructure Strategy document, 2012.)

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 16 / 39

Page 17: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

The Earth System Grid Federation

The Earth System Grid Federation (ESGF; comprising largefunded efforts at PCMDI, BADC, DKRZ, NCDC, and manymodeling centers) designs, operates and maintains serversoftware and hardware for the distribution of model data(and model-related observations including reanalysis).Software allows for archiving, browsing, cataloguing anddiscovering datasets.Services include search, download, replication, versioning,server-side analysis.Critically depends on standards!

See next talk by Dean Williams!

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 17 / 39

Page 18: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Standards underpinning the GDI

Data formats (netCDF) conforming to both the generalClimate-Forecast (CF) conventions, and specific conventions suchas the CMIP5 standards (satisfied using CMOR);URL and catalog standards such as OPeNDAP and THREDDS,making data accessible to remote locations regardless of localstorage format;ESGF software: custom data publication, node management anddata harvesting protocols developed by the ESG and the ESGFederation;the CMIP5 Data Reference Syntax (DRS) allowing for creation ofa uniform URL namespace for CMIP5 data, andthe Common Information Model (CIM) for the description ofmodels and simulations. (Includes Gridspec, but not much used.)

Overseen by piecemeal volunteer efforts such as ESGF, GO-ESSP, CFConventions and Variables Committees, ES-DOC, .... Standardextensions (e.g downscaling) may not have been adequately reviewed.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 18 / 39

Page 19: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Grid diversity may increase in CMIP6

Downstream communities may not wish to deal with novel grids, butspecialist communities are likely to insist on it for their own research.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 19 / 39

Page 20: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Cubed-sphere grid with nests

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 20 / 39

Page 21: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Data provenance and citation

Datasets are quite often used without proper acknowledgement orrecord of provenance.The effort to issue Document Object Identifiers (DOIs) in CMIP5was immature, and probably requires some extra steps.Quality control (QC-L2) was designed to provide several levels ofdata and metadata quality checking in CMIP5, but in additionproper peer review of metadata is needed.There are journals (e.g ESDD) that provide a mechanism forcitable entities around data, and can be used as a vehicle forquality control, peer review, and credit.Record of data provenance likely to become a requirement atNOAA; many journals starting to require permanent record ofdatasets and methods.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 21 / 39

Page 22: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Requirements for a robust and agile MIP infrastructure

We expect more specialized MIPs, see Meehl et al. (Eos, 2014) (andmaybe ARs... see Nature editorial 18 September 2013, “The finalassessment”). Current approach is not scalable!

Recognition by funding agencies that the science criticallydepends on a GDI currently financed and operated on a riskyad-hoc basis.ESGF servers to be continually available and operated (new datawill not appear at 6-yearly interval).Modeling centers will be unable to comply unless all MIPs to followconsistent standards established by a WGCM InfrastructurePanel. (COOKIE is a good example to follow...)WGCM Infrastructure Panel to act as data quality review body fornew experiments.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 22 / 39

Page 23: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Role of WGCM and its infrastructure panel

Provide scientific guidance and requirements for the GDI; exertgreater influence over its design and features.Provide standards governance allowing for orderly evolution ofstandards.Provide design templates (e.g CMOR extensions) for groupsdesigning MIPs and work to ensure their conformance tostandards.Work with academies and publishers to require adequate datacitation and recognition for data providers.Intercede with national agencies to provision data infrastructurewith adequate and stable long-term funding.

We expect this to be a non-trivial commitment of time and effort byPanel members.Acknowledgements: Proposal initially prepared by V. Balaji and Karl Taylor,with input and revisions made by co-authors Eric Guilyardi, MichaelLautenschlager, Bryan Lawrence, and Dean Williams.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 23 / 39

Page 24: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

The WGCM Infrastructure Panel

Chaired by V. Balaji (Princeton/GFDL) and K. Taylor (PCMDI).Strategy to develop a series of "position papers" on global datainfrastructure and its interaction with the scientific design ofexperiments. These will be presented to WGCM annual meeting.

projected data volumes for CMIP6, strategies for managing thegrowth pathdata access policies: would open access simplify thetechnical design of the infrastructure?data citations. Developing and promoting a path to data citationsusing DOIs and the emerging data journals, such as ESSD, NatureScientific Data.protocol document for the "endorsed MIPs".

Infrastructure issues that impinge on science design for CMIP6will be handled through close involvement of the WIP and CMIPpanel (e.g. joint papers)Interest from other WCRP working groups! (WGSIP, WGNE)

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 24 / 39

Page 25: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 25 / 39

Page 26: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

CMIP6 Experimental design

DECK experiments form the core; many specialized MIPs for smallercommunities, some of which will be endorsed by CMIP panel. Figurecourtesy Meehl et al (Eos 2014).V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 26 / 39

Page 27: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

CMIP6: data explosion?

Overpeck et al. (2011), Science forecast a 100-fold increase in datavolume over 10 years.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 27 / 39

Page 28: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Climate modeling, a computational profile

Intrinsic variability at all timescales from minutes to millennia;distinguishing natural from forced variability is a key challenge.coupled multi-scale multi-physics modeling;physics components have predictable data dependenciesassociated with grids;Adding processes and components improves scientificunderstanding;New physics and higher process fidelity at higher resolution;algorithms generally possess weak scalability.

In sum, climate modeling requires long-term integrations ofweakly-scaling I/O and memory-bound models of enormouscomplexity.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 28 / 39

Page 29: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Complexity, resolution, ensemble size

Computational increases can be applied along 3 axes: resolution,complexity, ensemble size.

resolution: where an N3 growth in computing is applied to (x , y , t)leading to only N2 growth in archive in (x , y): thus A ∼ C

23 !

complexity, as new subsystems and feedbacks are added tocomprehensive earth system models;UQ, as we build ensembles of simulations to sample uncertainty,both in our knowledge and representation, and of that inherent inthe chaotic system. In particular, we are interested incharacterizing the "tail" of the PDF (weather extremes) where a lotof climate risk resides.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 29 / 39

Page 30: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Hurricane statistics: a CMIP6 advance?

Interannual variability of W. Atlantic hurricane number from 1981-2005in a 50 km model. (Figure 7 from Zhao and Held 2009).CMIP5 median resolution: ∼100 km; CMIP6: 50 or 25 km.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 30 / 39

Page 31: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Complexity in ESMs

Note slope is “piecewise-constant” over 1-2 CMIP cycles!

Figure courtesy UCAR Climate FAQ.

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 31 / 39

Page 32: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

ExArch: Climate analytics on distributed exascale dataarchives

Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil,G. Aloisio, P. Kushner, D. Waliser, S. Pascoe, A. Stephens, P. Kershaw,F. Laliberte, J. Kim, S. Fiore

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 32 / 39

Page 33: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

The G8 Exascale Research Initiative

A joint initiative by research councils of Canada, France, Germany,Japan, Russia, UK, and USA;Research into exploitation of exascale computational resources;Focus on 10-year time horizon;A unique opportunity for funded international collaboration;But with restrictions – only funding to 7 participating countries;restricted eligibility within participating countries;

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 33 / 39

Page 34: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Bringing analysis to data: what’s involved

The projected frequency of intense tropical cyclones in someregion of the globe for input into an impacts model?

Query execution:evaluation of provenance and quality control meta-data todetermine which datasets to include;dispatch of queries to processing nodes, negotiatingauthentication and access control layers;Key issue is the use of user-developed analytic scripts;collection of results from the processing nodes, evaluation ofreturn codes for fault detection;further calculations to combine collected results;archive results for re-use; delivery of processed results to theend-user, perhaps in deferred fashion if the associatedcomputation needs to be scheduled on a "cloud".

Prototype development in Phase I – still waiting to see if Phase II willbe funded.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 34 / 39

Page 35: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Uncertainty: large ensembles?

Extreme value analysis can require very large ensembles (N ∼ 1000)but CMIP6 design uses comprehensive ESMs and N ∼ 3 − 10.Figure courtesy climateprediction.net.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 35 / 39

Page 36: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Pattern scaling

T (t , x , y , s) ≈ T (t , s).p(x , y) (1)

From Tebaldi and Arblaster (2014). Pattern scaling can be used to fillgaps in the coverage of models/scenarios.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 36 / 39

Page 37: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Knutti et al, revisited

“Genetic health” in the modeling ecosystem? NRC Report: maintaindiversity for structural uncertainty, reduce elsewhere.V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 37 / 39

Page 38: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Outline

1 Introduction

2 Climate science is big scienceMulti-model experimentsModel diversityEmergent constraints

3 Global data infrastructure

4 What can we expect from CMIP6?Experimental designComputational constraintsBringing analysis to data

5 Summary

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 38 / 39

Page 39: Climate Science and its Global Data Infrastructure · 2014. 7. 15. · Climate Science and its Global Data Infrastructure Improving Data Mobility and Management for International

Summary

CMIP6 probably will see uptick in number of models, resolution:may be flat in complexity and ensemble size.Sharing infrastructure is a hard problem, and not cheap. Shouldbe done with a purpose, such as:

minimize source points of structural uncertainty;scientific reproducibility of simulations,making the process of setting up a MIP lightweight.

Standards are less sexy than scale, but probably more critical;Interoperability and shared infrastructure has many aspects:common experimental protocols, common analytic methods,common documentation standards for data and data provenance,shared workflow, shared model components, shared technicallayers. (ESDOC, ESGF, ESMF, ...)

V. Balaji ([email protected]) Climate: Global Data Infrastructure 15 July 2014 39 / 39