Satellite Multi-Source Data Processing...Loose coupling of services makes polyglot data access and...

Satellite Multi-Source Data

ProcessingVardan Gyurjyan 1, Constantine Lukashin 2, Paul Stackhouse 2

1. Jefferson Lab ([email protected]), 2. NASA laRC

Outline

Scientific data processing challenges

World digital data 3V expansion

JLAB developed technology description

Technology adoption: NASA LaRC – DOE JLAB inter-agency collaborations

From Earthrise…3

To NASA Earth Observing System4

With our own eyes…5

$60K question

Are these environmental changes reversible?

Extent to which these observed changes reflect long-term climate

change

To answer we need long time, high-quality, global satellite records,

monitoring Earth surface and atmosphere continuously.

6

EOS data challenges

Climate Model Data

Expected volume of CM outputs: may exceed

Exabyte in 10-15 years.

Data is distributed nationally (NASA, NOAA,

DoE), and internationally (UK).

Multiple Climate & Earth System models (30 –

50).

Perturbed Physics Ensemble (PPE) for a single

model: about 5 Petabytes.

Climate Model output format is standard NetCDF

Earth Observation Data

Expected volume of used data products: exceed 100 Petabytes in 10-15 years.

Data is distributed nationally (NASA, NOAA), and internationally (ESA, etc.).

NASA observational data format is standard HDF (Hierarchical Data Format)

NOAA observational data format is standard NetCDF (Network Common Data Form)

Observations from different sensors are NOT synched / merged

Each sensor/mission has a separate data product line

Our own challenges

HEP and NP experiments generate substantial amount of data.

Large experimental instruments ever build.

100M sensors for each experiment specific detector.

Data needs to be processed fast.

E.g. LHC

40M events/sec

Select on the fly (level1,2, triggers) and store only the interesting events.

Reconstruct data and prepare for physics analysis.

Data volumes (4 experiments x 15 years)

Raw data: 1.6MB/sec, 3.2PB/year

Reconstructed data: 1.0MB/sec, 2.0PB/year

E.g. JLAB

CLAS raw data: 700MB/sec

GlueX raw data: 2GB/sec

8

SK

A R

ad

io Te

lesc

op

e

700TB

/sec

Global Digital Data Demography

Unprecedented 3V expansion

of data

Volumes

Velocities

Varieties

9Future scientific data

Subset of Data Producers 2016

EOS

LHC

FB

TW 0

5

10

15

20

25

30

35

40

Ex

ab

yte

Global Digital Data

Scientific NonScientific

Existing scientific data processing architectures will have difficulties handling future data volumes and data distribution.

DOE Exascale Initiative

Commercial data processing engines are well advanced (Apache Hadoop, Spark, Storm, etc.). Can we adopt them for our needs?

Data processing challenge

Data/software characteristics Science data Social media data

Origin Distributed Localized

Format Diverse Uniform

Storage Long term Short term

Processing unit File File

Processing steps Multiple Limited (map, reduce)

Language/technology Old/trusted Modern

Heritage code Yes No

Yes but…

To store, access and process data

Optimize data migration

Bring application to data (as much as it is possible)

Data location and format agnosticism

Data instream processing

File based processing

In stream processing

We need a new approach….

Say NO to data-format racism !

We all are bytes

Micro-services architecture

Flow based reactive programming (FBP)

CS architectures to help

Application is made of components that communicate data

Small, simple and independent

Easier to understand and develop

Less dependencies

Faster to build and deploy

Reduced develop-deploy-debug cycle

Easy to migrate to data

Scales independently

Independent optimizations

Improves fault isolation

Eliminates long term commitment to a single technology stack.

Easy to embrace new technologies

Micro-services: Advantages

Glue between micro-services

How they connect to achieve computational goals

Makes application truly distributed

Flow based programming (FBP)

A B

{ A calls B } v.s. { A message B }

• B reacts on message

• Named inputs

• Sync/async message passing

CLARA implements micro-services and FBP

Application is defined as a network of loosely coupled processes, called services.

Services exchange data across predefined connections by message passing, where connections are specified externally to the services.

Services can be requested from different data processing applications.

Loose coupling of services makes polyglot data access and processing solutions possible (C++, Java, Python, Fortran)

Services communicate with each other by exchanging the data quanta.

Thus, services share the same understanding of the transient data, hence the only coupling between services.

FE

Service Bus (xMsg)

Service Layer

Orchestration Layer

Registration

DPE

SC SC S

SS

DPE

SC SC S

SS

DPE

SC SC S

SS

DPE

SC SC S

SS

gateway

security

Local

Registration

Local

Registration

Local

Registration

Local

Registration

CLARA architecture

SE1

Service Container

One request at a time

SE

Service Container

Multiple simultaneous requests

SESE

SE1

C++ engine : https://claraweb.jlab.org/clara//docs/quickstart/cpp.html

Java engine : https://claraweb.jlab.org/clara//docs/quickstart/java.htmlPython engine: https://claraweb.jlab.org/clara//docs/quickstart/python.html

Service Engine

CLARA Service

https://claraweb.jlab.org/clara/docs/quickstart/cpp.html

https://claraweb.jlab.org/clara/docs/quickstart/java.html

https://claraweb.jlab.org/clara/docs/quickstart/python.html

Data Unit (event) Stream Processing

Data driven, data centric design.

The focus is on transient data event modifications. Advantage over algorithm driven design is that a much greater ignorance of the data processing code is allowed (loose coupling).

Design = service composition + event-routing.

Self routing (no routing scheduler)

Event routing graph defines application algorithm

19

Algorithm Examples

S1 + S2 + S3 + S4;

S3 + S5 + S6;

S1 S2 S3 S4

S5 S6

S1 + S2 + S3;

while( S3 == ”needs calibration") {F1 + F2 + S3;

}

S3 + S4 + S6;

S1 S2 S3 S4

F1

S6

F2

CLARA Cloud

Data Bus 1 (e.g. raw data)

Data Bus N (e.g. EBCMD grid)

Serv ice Bus (CLARA serv ices)

Data Center N

Data Center 2

Data Center 1 (e.g. JLAB)

User

Front-End

CLARA FE hosts:• Meta-data Database• Data Server• Application Server• Web Server (JNLP support)• Data Visualization Server

Distributed Client Layer

User

Front-

End

Data Bus 1

Data Bus N

Service Bus

Data Center 1

Data Center N

Da

ta C

en

ter 2

CLARA Distributed Data Provisioning

Inter-agency collaboration NASA Information And Data System (NAIADS) for Earth Science Data Fusion and Analytics

Dr. Constantine Lukashin, Principal Investigator

NASA Langley Research Center, Hampton, VA

Aron Bartle, Co-Investigator

Mechdyne, Virginia Beach, VA

Dr. Vardan Gyurjyan, Co-Investigator

Thomas Jefferson National Accelerator Facility, Newport News, VA

Dr. Carlos Roithmayr, Co-Investigator


Dr. Jun Wang, Collaborator

University of Nebraska, Lincoln, NE

Chris Currey, Collaborator


Dr. Gagik Gavalian, Collaborator

Thomas Jefferson National Accelerator Facility, Newport News, VA

John Kusterer, Collaborator


Proposal submitted in response to NASA Research Announcement NNH14ZDA001N – AIST Research Opportunities in

Space and Earth Sciences (ROSES) 2014, A41: Advanced Information Systems Technology

July 9, 2014

NASA Langley Interest in JLAB CLARA

framework. V. Gyurjyan, et al. “CLARA: A Contemporary Approach to Physics Data Processing” Journal of Physics: Conference Series 331 (2011) 03201

Meetings at the administrative and

technical levels

Created inter-agency collaboration

Submitted the proposal in response to NASA Research Announcement

NNH14ZDA001N – AIST. Research

Opportunities in Space and Earth Sciences (ROSES) 2014, A41: Advanced Information

Systems Technology

NAIADS proposal was one of 24 projects

selected for federal funding, with $1M

budget for 2-year period.

• NASA Science Mission DirectorateResearch Opportunities in Space and Earth Sciences – 2014 NNH14ZDA001N-AIST

A.41 Advanced Information Systems Technology (AIST)

• NASA's Science Mission Directorate, NASA Headquarters, Washington, DC, has selected proposals for the Advanced Information Systems Technology Program (AIST- 14) in support of the Earth Science Division (ESD). The AIST-14 will provide technologies to reduce the risk and cost of evolving NASA information systems to support future Earth observation and to transform those observations into Earth information.

• Through ESD’s Earth Science Technology Office a total of 24 proposals will be awarded over a 2-year period. The total amount of all the awards is roughly $25M.

• Increase the accessibility and utility of Earth science data and models, and

• Enable new Earth observation measurements and information products.

• A total of 124 proposals were evaluated of which 24 have been selected for award.

NAIADS Data fusion and processing

• OBJECTIVES:

• To demonstrate NAIADS approach and full functionality using existing data;To benchmark NAIADS performance;Available data: 9 years of near-coincident measurements of from SCIAMACHY and MODIS; Create new fused SCIAMACHY/MODIS/ECMWF data product (requested by a number of projects).

• SCIAMACHY Level-1 Data:

• Spectral measurement for every footprint: 30 km x 230 km; Swath 950 km (4 footprints) from 10 AM Sun-synch orbit.

• ECMWF Data (re-analysis):

• Gridded (0.125o);6 weather parameters; Map every 6 hours;

• MODIS/Terra Level-2 Data:

• Level-2 Cloud and Aerosol DataSpatial scale: 1 / 5 km and 10 km spatial; Swath 2300 km (global coverage daily); 10:30 AM Sun-synch orbit.

NAIADS deployment on AWS

AWS c4.8xlarge instances,

36 vCPUs ~= 18 physical cores

Staging data from AWS S3

Data processing rate based on

average workflow execution over 10 SCIAMACHY files.

Vertical scaling up to 1.4KHz on single AWS node

Data processing continuous web

monitoring

9 years of data has been

processed. More data is currently being processed.

New inter-agency collaboration

New interagency agreement

between NASA LaRC and JLAB DOE to use CLARA for multi-

satellite data processing.

Total budget $250K for the time

period from July 1st 2017 to July 1st

2018.

Project PI: Paul W. Stackhouse

[email protected]

Project goal

Successful demonstration of CLARA to SRB will open opportunities for NASA

missions and R&A to utilize framework to move to Cloud Computing Environments

SRB is a general data fusion project and is thus similar to other global general data

processing projects where data sets other than from a single instrument are processed

(i.e., CERES, DSCOVER, etc.).

Demonstrate Data Production Improvement for NASA’s Surface Radiation Budget

(SRB) Project

Requirements:

SRB is planning to process 34+ years of data for scientific analysis and societal benefits

Multi-satellite fusion and increasing spatial resolution

Become “operational” for regular processing of new observations to lengthen record

Benefits:

Improved data production capability to enable efficient production (factor of 10 speed up at

least)

Support additional fusion data sets

Support higher data resolution

Support faster reprocessing with improved inputs/algorithms

Project also demonstrates utilization/adaptation of older vintage codes (i.e., Fortran 77)

for cloud environment

SRB Data Production Flow

ISCCP HXS Files

70GB/month

Stage 1 Files

114 GB/month (1°)

114 GB/month (0.5°)

Stage 2 Fortran code

5 hours/month (1°)

7 hours/month (0.5°)

Stage 2 Files

2.6 GB/month (1°)

10 GB/month (0.5°)

Stage 1 Fortran code

9 hours/month (1°x1°)

14 hours/month (0.5°x0.5°)

ISCCP =

International

Satellite Cloud

Climatology

Project

ISCCP process all

world’s

geosynchronous

and NOAA MODIS

polar orbiting data

SRB

reads/processes

these data using 2

stages

• ISCCP HXS: 29TB

• Stage 1 processing: 153 days (1°), 238 days (0.5°). 2200 lines of Fortran code.

• Stage 1 files: 47TB each 1° and 0.5°

• Stage 2 processing: 85 days (1°), 119 days (0.5°). 1700 lines of Fortran code.

• Stage 2 files: 1.1 TB (1°), 4.1TB (0.5°)

Total Time/Storage for 34 years

(1983-2016)

21 hours to process 1 month data

Stage 2 CLARA Service

C++

Event Builder CLARA Service

C++

NAIADS-SRB CLARA Data Flow Diagram

Stage 1

Fortran

Grid-pixel

Container

C++

Grid-pixel

Dispatcher

C++

Stage 2

Fortran

Writer CLARA Service

C++File

Grid

Next Grid

Container is empty. Next month/day/hour

File

45 minutes to process 1 month data

• Single node• Single threaded Stage1

Contemporary (big, distributed) scientific data processing novel approach

Based on micro-services technology, implementing subset of the FIPA specifications

Capable of preserving and using decades of experience and heritage code.

Makes possible transitioning of monolithic software applications to micro-services architecture,

capable of addressing modern data processing imperatives (speed, agility, scalability) in a

contemporary hardware infrastructures ( cloud computing, vertical and horizontal scaling, etc.)

Makes reality inter-disciplinary data correlational studies

Successful inter-agency (DOE JLAB and NASA LaRC) collaborations.

Ready to try?

https://claraweb.jlab.org/clara

Conclusions

Satellite Multi-Source Data Processing...Loose coupling of services makes polyglot data access and...

Documents

Transcript of Satellite Multi-Source Data Processing...Loose coupling of services makes polyglot data access and...