Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic...

24
Anatomy of a Climate Science- centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes (CASCADE Team) Kevin Bensema, Surendra Byna, Soyoung Jeon, Karthik Kashinath, Burlen Loring, Pardeep Pall, Prabhat, Alexandru Romosan, Oliver Ruebel, Daithi Stone, Travis O'Brien, Christopher Paciorek, Michael Wehner, Wes Bethel, William Collins

Transcript of Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic...

Page 1: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Anatomy of a Climate Science-centric Workflow

Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution,

and Detection of Extremes (CASCADE Team)

Kevin Bensema, Surendra Byna, Soyoung Jeon, Karthik Kashinath, Burlen Loring, Pardeep Pall, Prabhat, Alexandru Romosan, Oliver Ruebel, Daithi Stone, Travis

O'Brien, Christopher Paciorek, Michael Wehner, Wes Bethel, William Collins

Page 2: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Challenges

• Scale of data already at TBs and will only grow larger.

• Processing Three to Six hours of intervals frequently.

• Foci now is on High resolution 1/4th to 1/8th degree. Extensible to higher.

• High resolution and high frequency analysis add several orders of magnitude.

Page 3: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Proposed Strategy

• Identification of use cases, extraction of common computational algorithms, scaling & optimization of current work.

• Template workflow configurations of common use cases.

• Abstraction of services to HPC environments.

• Easy to use archiving, distribution, and verification strategies.

• Standardization of parallel work environment.

Page 4: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

What it is/What it is not

• What it is not— Not a general workflow— Not a general infrastructure – Balancing between

performance & exploratory science.• What it is

—…

For Example:

t = cascade.Teca()

t['filename'] = ‘myfile’

writer = cascade.Writer(cascade.ESGF)

writer[‘input’] = t[‘out’]

n = workflow.NERSC(<resources>, writer)

n.execute()

Note: Active Work in progress & ongoing…

Start (Stage Data)

Schedule Job (Time|

Log)

Load & Run Teca task

(Time|Log)

Verify & Validate

Load & Run Teca task

(Time|Log)

Verify & Validate

Load & Run Teca task

(Time|Log)

Verify & Validate

Publish using ESGF node

Write & Distribute to

ESGF

Record Workflow

Load & Run Teca task

(Time|Log)

Verify & Validate

Load & Run Teca task

(Time|Log)

Verify & Validate

Page 5: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.
Page 6: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.
Page 7: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

What it is/What it is not

• What it is not— Not a general workflow— Not a general infrastructure – Balancing

between performance & exploratory science.• What it is

— A highly customized climate-centric API (Zonal Mean Averages, GEV, etc…)

— Workflow – Verification/Validations, Job scheduling, Staging, Deployment, etc…

—Modules – Performance & Timing Support, Calendar Support, etc…

— Template workflows

Page 8: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Climate Science-centric Workflow

• Workspace – A collaboration environment to share, track documents, visualize status, update issues.

• One-on-one – Identify use cases that require implementing new features or scaling & performance optimization of existing ones.

• Software tools – Development and Deployment of algorithms & software packages as well as building & maintaining packages for target environments.

• Workflow components – Connecting it all together.

Page 9: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Communication Infrastructure

Page 10: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Quick Note: Software Environment

• Infrastructure - cascade.lbl.gov/esg02.nersc.gov

• Confluence – Portal to publish and collaborate with team members

• Jira – Bug & Issue tracking portal.

• CDash/Jenkins – Infrastructure to report status of software build & regression tests.

• BitBucket – Main software repository.

• ESGF service – Service for distribution of data generated by CASCADE.

Page 11: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

CASCADE Team

• Detection & Attribution Team – Characterization, detection, and attribution of simulated and observed extremes in a variety of different contexts -- Analysis Algorithms

• Model Fidelity – . Evaluation and improvement of model fidelity in simulating extremes

• Statistics – Development of statistical frameworks for extremes analysis, uncertainty quantification, and model evaluation

• Formulation of highly parallel software for analysis and uncertainty quantification of extremes

Page 12: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Analysis Infrastructure Tasks

• Development of new climate-centric algorithms and evaluation of current ones. Implement scalable, parallel versions as needed.

• Performance analysis and data management.

• Deployment and Maintenance on HPC environments.

• Creating a standardized environment – Provide same execution environment on all deployed platforms, and seamless bridges different technologies (Python <-> R).

• User Support.

Page 13: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Detection & Attribution

• Single Program Multiple Data SPMD scripts – refactoring current algorithms to work in parallel.

• Distribution/Staging – Functionality to distribute data generated through ESGF also stage data at NERSC.

• TECA – Active development of Parallel Toolkit for Extreme Climate Analysis.

• Teleconnections – Ensemble analysis & software solutions to investigate of frequency of teleconnection events.

Page 14: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Model Fidelity

Page 15: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Model Fidelity

• ILIAD workflow—The parallelization of the generation of initial

conditions.—Dynamic Building, Compilation & Execution of

CESM.—Module verification – monitor execution status

& successful completion.—Module for automation of archiving of output

(initial conditions, namelist files, CESM output).

• DepCache – External tool for speeding up execution of Python libraries.

Page 16: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Statistics

• Integration of Statistical Algorithms – Working to deploy relevant statistical algorithms within CASCADE framework.

• Parallelization – Scaling statistics scripts to work in a parallel environment.

• llex Installation – Generalized Extreme Value Analysis & Peaks Over Threshold statistical analysis algorithms (Developed by Stats team members)

Page 17: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Software Suite

• Python environment— IPython, mpi4py, numpy, …—CDAT-Core (cdms2, cdtime,…)—Rpy2 (Python-R bridge)

• R environment—extRemes, ismev—Llex – GEV & POT (Dr. Chris Paciorek’s package)—pbdR - pbdMPI, pbdSLAP, pbdPROF, pbdNCDF (ORNL)

• TECA – parallel toolkit developed at LBNL (TC, ETC, AR)

- Prototype deployment at NERSC (module load cascade)

- Transitioning maintenance of NERSC ESGF Node to CASCADE analysis group.

Page 18: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Workflow Infrastructure

• Unified Workflow Service— Load balanced services that

handle job Scheduling, Validation & Verification, Fault Tolerance

• Core Modules— Calendar support— Data Reduction Operations

(Sum, Max, Min, Average, etc…)

— I/O services (Parallel Read/Write)

— Threading/MPI wrapping (Map|Foreach)

Page 19: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Additional Services

• MPO – A tool for recording scientific workflows, Developed by General Atomics & LBNL.

• Tigres – Template Interfaces for Agile Parallel Data-Intensive Science, Developed by Advanced Computing for Science Group at LBNL.

• ESGF – Support for automated distribution through ESGF installation.

Page 20: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Modules & API

• CoreModule —Timing, Logging—Standard definition of parameter inputs &

outputs—All modules are inherently Workflows of one.— implicit connectivity of workflow

• BaseAPI (Pythonic)—__getitem__,__setitem__: param[“input”] = val—cascade_static_{param|output}_spec:

{name, value, type, user_defined}—cascade_execute – core execution function

Page 21: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Example Workflow

• Example use case: Running a single module

^^^^^^^^^^^^^^

t = Teca() # Where teca is a derived class of CascadeBase

filename = 'myfile’

t['filename'] = filename

t.execute()

^^^^^^^^^^^^^^^^^

t1 = Teca() # Where Teca is a derived class of CascadeBase

t2 = TecaAnalysis() # Where TecaAnalysis is a derived class of CascadeBase

t2['inputdata'] = t1['outputdata'] # Note, this establishes a link

t2.execute()

Page 22: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Proposed Workflow

t1 = Teca()

t2 = TecaAnalysis()

t3 = TecaAnalysis()

s = Diff()

t2['inputdata'] = t1['outputdata’]

t3[‘inputdata’] = t1[‘outputdata’]

s[‘inputdata1’] = t2[‘outputdata’]

s[‘inputdata2’] = t3[‘outputdata’]

s.write(‘prefix’, ‘file’)

s.execute()

t1

t2 t3

S

output

Schedule Job

Stage Data

Perform MPI-based task

Validate/Verify

Page 23: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Recap: Anatomy of Climate Science-centric Workflow

• Software Environment – Development, Deployment, and Maintenance

• Custom Use Case Support for D&A, Model Fidelity, and Statistics team needs.

• Software Suite – Scaling, Parallelism, Performance Management, Software Services (Python, R, TECA)

• Workflow Development – Thin Client & Workflow service, Module development, Optimization (Data Movement, Workflow execution), Provenance.

Page 24: Anatomy of a Climate Science-centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes.

Thanks

• Questions?