Satellite Multi-Source Data Processing...Loose coupling of services makes polyglot data access and...
Transcript of Satellite Multi-Source Data Processing...Loose coupling of services makes polyglot data access and...
Satellite Multi-Source Data
ProcessingVardan Gyurjyan 1, Constantine Lukashin 2, Paul Stackhouse 2
1. Jefferson Lab ([email protected]), 2. NASA laRC
Outline
Scientific data processing challenges
World digital data 3V expansion
JLAB developed technology description
Technology adoption: NASA LaRC – DOE JLAB inter-agency collaborations
From Earthrise…3
To NASA Earth Observing System4
With our own eyes…5
$60K question
Are these environmental changes reversible?
Extent to which these observed changes reflect long-term climate
change
To answer we need long time, high-quality, global satellite records,
monitoring Earth surface and atmosphere continuously.
6
EOS data challenges
Climate Model Data
Expected volume of CM outputs: may exceed
Exabyte in 10-15 years.
Data is distributed nationally (NASA, NOAA,
DoE), and internationally (UK).
Multiple Climate & Earth System models (30 –
50).
Perturbed Physics Ensemble (PPE) for a single
model: about 5 Petabytes.
Climate Model output format is standard NetCDF
Earth Observation Data
Expected volume of used data products: exceed 100 Petabytes in 10-15 years.
Data is distributed nationally (NASA, NOAA), and internationally (ESA, etc.).
NASA observational data format is standard HDF (Hierarchical Data Format)
NOAA observational data format is standard NetCDF (Network Common Data Form)
Observations from different sensors are NOT synched / merged
Each sensor/mission has a separate data product line
Our own challenges
HEP and NP experiments generate substantial amount of data.
Large experimental instruments ever build.
100M sensors for each experiment specific detector.
Data needs to be processed fast.
E.g. LHC
40M events/sec
Select on the fly (level1,2, triggers) and store only the interesting events.
Reconstruct data and prepare for physics analysis.
Data volumes (4 experiments x 15 years)
Raw data: 1.6MB/sec, 3.2PB/year
Reconstructed data: 1.0MB/sec, 2.0PB/year
E.g. JLAB
CLAS raw data: 700MB/sec
GlueX raw data: 2GB/sec
8
SK
A R
ad
io Te
lesc
op
e
700TB
/sec
Global Digital Data Demography
Unprecedented 3V expansion
of data
Volumes
Velocities
Varieties
9Future scientific data
Subset of Data Producers 2016
EOS
LHC
FB
TW 0
5
10
15
20
25
30
35
40
Ex
ab
yte
Global Digital Data
Scientific NonScientific
Existing scientific data processing architectures will have difficulties handling future data volumes and data distribution.
DOE Exascale Initiative
Commercial data processing engines are well advanced (Apache Hadoop, Spark, Storm, etc.). Can we adopt them for our needs?
Data processing challenge
Data/software characteristics Science data Social media data
Origin Distributed Localized
Format Diverse Uniform
Storage Long term Short term
Processing unit File File
Processing steps Multiple Limited (map, reduce)
Language/technology Old/trusted Modern
Heritage code Yes No
Yes but…
To store, access and process data
Optimize data migration
Bring application to data (as much as it is possible)
Data location and format agnosticism
Data instream processing
File based processing
In stream processing
We need a new approach….
Say NO to data-format racism !
We all are bytes
Micro-services architecture
Flow based reactive programming (FBP)
CS architectures to help
Application is made of components that communicate data
Small, simple and independent
Easier to understand and develop
Less dependencies
Faster to build and deploy
Reduced develop-deploy-debug cycle
Easy to migrate to data
Scales independently
Independent optimizations
Improves fault isolation
Eliminates long term commitment to a single technology stack.
Easy to embrace new technologies
Micro-services: Advantages
Glue between micro-services
How they connect to achieve computational goals
Makes application truly distributed
Flow based programming (FBP)
A B
{ A calls B } v.s. { A message B }
• B reacts on message
• Named inputs
• Sync/async message passing
CLARA implements micro-services and FBP
Application is defined as a network of loosely coupled processes, called services.
Services exchange data across predefined connections by message passing, where connections are specified externally to the services.
Services can be requested from different data processing applications.
Loose coupling of services makes polyglot data access and processing solutions possible (C++, Java, Python, Fortran)
Services communicate with each other by exchanging the data quanta.
Thus, services share the same understanding of the transient data, hence the only coupling between services.
FE
Service Bus (xMsg)
Service Layer
Orchestration Layer
Registration
DPE
SC SC S
SS
DPE
SC SC S
SS
DPE
SC SC S
SS
DPE
SC SC S
SS
gateway
security
Local
Registration
Local
Registration
Local
Registration
Local
Registration
CLARA architecture
SE1
Service Container
One request at a time
SE
Service Container
Multiple simultaneous requests
SESE
SE1
C++ engine : https://claraweb.jlab.org/clara//docs/quickstart/cpp.html
Java engine : https://claraweb.jlab.org/clara//docs/quickstart/java.htmlPython engine: https://claraweb.jlab.org/clara//docs/quickstart/python.html
Service Engine
CLARA Service
Data Unit (event) Stream Processing
Data driven, data centric design.
The focus is on transient data event modifications. Advantage over algorithm driven design is that a much greater ignorance of the data processing code is allowed (loose coupling).
Design = service composition + event-routing.
Self routing (no routing scheduler)
Event routing graph defines application algorithm
19
Algorithm Examples
S1 + S2 + S3 + S4;
S3 + S5 + S6;
S1 S2 S3 S4
S5 S6
S1 + S2 + S3;
while( S3 == ”needs calibration") {F1 + F2 + S3;
}
S3 + S4 + S6;
S1 S2 S3 S4
F1
S6
F2
CLARA Cloud
Data Bus 1 (e.g. raw data)
Data Bus N (e.g. EBCMD grid)
Serv ice Bus (CLARA serv ices)
Data Center N
Data Center 2
Data Center 1 (e.g. JLAB)
User
Front-End
CLARA FE hosts:• Meta-data Database• Data Server• Application Server• Web Server (JNLP support)• Data Visualization Server
Distributed Client Layer
User
Front-
End
Data Bus 1
Data Bus N
Service Bus
Data Center 1
Data Center N
Da
ta C
en
ter 2
CLARA Distributed Data Provisioning
Inter-agency collaboration NASA Information And Data System (NAIADS) for Earth Science Data Fusion and Analytics
Dr. Constantine Lukashin, Principal Investigator
NASA Langley Research Center, Hampton, VA
Aron Bartle, Co-Investigator
Mechdyne, Virginia Beach, VA
Dr. Vardan Gyurjyan, Co-Investigator
Thomas Jefferson National Accelerator Facility, Newport News, VA
Dr. Carlos Roithmayr, Co-Investigator
NASA Langley Research Center, Hampton, VA
Dr. Jun Wang, Collaborator
University of Nebraska, Lincoln, NE
Chris Currey, Collaborator
NASA Langley Research Center, Hampton, VA
Dr. Gagik Gavalian, Collaborator
Thomas Jefferson National Accelerator Facility, Newport News, VA
John Kusterer, Collaborator
NASA Langley Research Center, Hampton, VA
Proposal submitted in response to NASA Research Announcement NNH14ZDA001N – AIST Research Opportunities in
Space and Earth Sciences (ROSES) 2014, A41: Advanced Information Systems Technology
July 9, 2014
NASA Langley Interest in JLAB CLARA
framework. V. Gyurjyan, et al. “CLARA: A Contemporary Approach to Physics Data Processing” Journal of Physics: Conference Series 331 (2011) 03201
Meetings at the administrative and
technical levels
Created inter-agency collaboration
Submitted the proposal in response to NASA Research Announcement
NNH14ZDA001N – AIST. Research
Opportunities in Space and Earth Sciences (ROSES) 2014, A41: Advanced Information
Systems Technology
NAIADS proposal was one of 24 projects
selected for federal funding, with $1M
budget for 2-year period.
• NASA Science Mission DirectorateResearch Opportunities in Space and Earth Sciences – 2014 NNH14ZDA001N-AIST
A.41 Advanced Information Systems Technology (AIST)
• NASA's Science Mission Directorate, NASA Headquarters, Washington, DC, has selected proposals for the Advanced Information Systems Technology Program (AIST- 14) in support of the Earth Science Division (ESD). The AIST-14 will provide technologies to reduce the risk and cost of evolving NASA information systems to support future Earth observation and to transform those observations into Earth information.
• Through ESD’s Earth Science Technology Office a total of 24 proposals will be awarded over a 2-year period. The total amount of all the awards is roughly $25M.
• Increase the accessibility and utility of Earth science data and models, and
• Enable new Earth observation measurements and information products.
• A total of 124 proposals were evaluated of which 24 have been selected for award.
NAIADS Data fusion and processing
• OBJECTIVES:
• To demonstrate NAIADS approach and full functionality using existing data;To benchmark NAIADS performance;Available data: 9 years of near-coincident measurements of from SCIAMACHY and MODIS; Create new fused SCIAMACHY/MODIS/ECMWF data product (requested by a number of projects).
• SCIAMACHY Level-1 Data:
• Spectral measurement for every footprint: 30 km x 230 km; Swath 950 km (4 footprints) from 10 AM Sun-synch orbit.
• ECMWF Data (re-analysis):
• Gridded (0.125o);6 weather parameters; Map every 6 hours;
• MODIS/Terra Level-2 Data:
• Level-2 Cloud and Aerosol DataSpatial scale: 1 / 5 km and 10 km spatial; Swath 2300 km (global coverage daily); 10:30 AM Sun-synch orbit.
NAIADS deployment on AWS
AWS c4.8xlarge instances,
36 vCPUs ~= 18 physical cores
Staging data from AWS S3
Data processing rate based on
average workflow execution over 10 SCIAMACHY files.
Vertical scaling up to 1.4KHz on single AWS node
Data processing continuous web
monitoring
9 years of data has been
processed. More data is currently being processed.
New inter-agency collaboration
New interagency agreement
between NASA LaRC and JLAB DOE to use CLARA for multi-
satellite data processing.
Total budget $250K for the time
period from July 1st 2017 to July 1st
2018.
Project PI: Paul W. Stackhouse
Project goal
Successful demonstration of CLARA to SRB will open opportunities for NASA
missions and R&A to utilize framework to move to Cloud Computing Environments
SRB is a general data fusion project and is thus similar to other global general data
processing projects where data sets other than from a single instrument are processed
(i.e., CERES, DSCOVER, etc.).
Demonstrate Data Production Improvement for NASA’s Surface Radiation Budget
(SRB) Project
Requirements:
SRB is planning to process 34+ years of data for scientific analysis and societal benefits
Multi-satellite fusion and increasing spatial resolution
Become “operational” for regular processing of new observations to lengthen record
Benefits:
Improved data production capability to enable efficient production (factor of 10 speed up at
least)
Support additional fusion data sets
Support higher data resolution
Support faster reprocessing with improved inputs/algorithms
Project also demonstrates utilization/adaptation of older vintage codes (i.e., Fortran 77)
for cloud environment
SRB Data Production Flow
ISCCP HXS Files
70GB/month
Stage 1 Files
114 GB/month (1°)
114 GB/month (0.5°)
Stage 2 Fortran code
5 hours/month (1°)
7 hours/month (0.5°)
Stage 2 Files
2.6 GB/month (1°)
10 GB/month (0.5°)
Stage 1 Fortran code
9 hours/month (1°x1°)
14 hours/month (0.5°x0.5°)
ISCCP =
International
Satellite Cloud
Climatology
Project
ISCCP process all
world’s
geosynchronous
and NOAA MODIS
polar orbiting data
SRB
reads/processes
these data using 2
stages
• ISCCP HXS: 29TB
• Stage 1 processing: 153 days (1°), 238 days (0.5°). 2200 lines of Fortran code.
• Stage 1 files: 47TB each 1° and 0.5°
• Stage 2 processing: 85 days (1°), 119 days (0.5°). 1700 lines of Fortran code.
• Stage 2 files: 1.1 TB (1°), 4.1TB (0.5°)
Total Time/Storage for 34 years
(1983-2016)
21 hours to process 1 month data
Stage 2 CLARA Service
C++
Event Builder CLARA Service
C++
NAIADS-SRB CLARA Data Flow Diagram
Stage 1
Fortran
Grid-pixel
Container
C++
Grid-pixel
Dispatcher
C++
Stage 2
Fortran
Writer CLARA Service
C++File
Grid
Next Grid
Container is empty. Next month/day/hour
File
45 minutes to process 1 month data
• Single node• Single threaded Stage1
Contemporary (big, distributed) scientific data processing novel approach
Based on micro-services technology, implementing subset of the FIPA specifications
Capable of preserving and using decades of experience and heritage code.
Makes possible transitioning of monolithic software applications to micro-services architecture,
capable of addressing modern data processing imperatives (speed, agility, scalability) in a
contemporary hardware infrastructures ( cloud computing, vertical and horizontal scaling, etc.)
Makes reality inter-disciplinary data correlational studies
Successful inter-agency (DOE JLAB and NASA LaRC) collaborations.
Ready to try?
https://claraweb.jlab.org/clara
Conclusions