CLARA: Data-Stream Processing Framework · 2020-05-14 · not included vertex tracker ... Passive...

Post on 31-Jul-2020

0 views 0 download

Transcript of CLARA: Data-Stream Processing Framework · 2020-05-14 · not included vertex tracker ... Passive...

CLARA: Data-Stream Processing Framework

Framework for streaming readout

V. Gyurjyan for JLAB EPSCI Group

gurjyan@jlab.org

Outline

• Problem statement and possible solutions- Hardware diversification- Parallelization- Streaming

• Micro-services architecture- Micro-service vs monolith

• Flow based programming paradigm- Reactive communication

• CLARA: reactive data-stream processing frameworkthat implements micro-services architecture and FBP

• Summary

2

Problem we face

3

0

10

20

30

40

Exab

yte

GlobalDigitalData

Scientific NonScientific

Experiment Conditions Event Rate DataRate Comments

Moller Production/integratedmode 1920Hz 130MB/s CanbehandledwiththetraditionalDAQ.

EICL=1034 cm-2s-1 450-550Hz

Notincludedbackgroundnoiserates.

20-25GB/snotincludedvertextrackerthatwillgenerate~240GB/s

~10Kz/µb,trackmultiplicity=~5JLABEICdetectordesignwillhavemillionsofchannels.Onlynon-vertexdetectorscombinedwillhave~1Mchannelsplusvertexdetector:estimated20-50Mchannels.Intotal~1000ROC’s. Controlnightmare(startingstoppingarun).Streamingreadouthaslesscontrolrequirements.

TIDIS rTPC hitratesenormous(~800KHz/pad)

4GB/s

Howtomatchupsuper Bigbyte detectedelectronswith rTPC detectedspectatorprotonsisabigquestion.ConventionaltriggeredDAQwillbechallenged.

SoLID30sectorGEM 30GB/s

30separateDAQ’seach1GB/s?HowtocombineGEMreadoutwithotherdetectors? HandlingGEMhitssharingadjacentsectors.

CLAS12 Phase2 100KHz 5-7GB/s

CPU based architecture limitations

4

MIPS/clockspeedplateau

Squeezingmorecoresperchipbecomesdifficult

A Roadmap for HEP Software and Computing R&D for the 2020s. HEP Software Foundation, Feb. 2018

“Frameworks face the challenge of handling the massive parallelism and heterogeneity that will be present in future computing facilities, including multi-core and many-core systems, GPUs, Tensor Processing Units (TPUs), and tiered memory systems, each integrated with storage and high-speed network interconnections.”

The Scale-Cube

5

Scalebysp

littin

gsim

ilarthings,Y-axis

suchase

vents.M

ulti-threading

TheArtofScalability.by MartinL.Abbott and MichaelT.Fisher.ISBN-13: 978-0134032801

Verticalsc

alingorm

ulti-treading

Why decomposition into independent modules

6

• Smallerandindependentcodebases.Reinforceamaximumindependenceandisolationoffunctionalcomponents.

• Faulttolerant• Overallmicro-servicesbasedapplicationevolvesmuchfaster• Nootherdependenciesotherthandata(loosecoupling)can

runonheterogeneoushardwareandsoftwareinfrastructures.• Relativelyeasyevolution,dueto

• Requirementchanges• Environmentchanges• Errorsorsecuritybreaches• Newequipmentaddedorremoved• Improvementstothesystem

• Encouragescontributionandinclusionofnewtechnologies

Micro-services vs Monolithic architecture

Presentation Logic Data

7

Pros• Strongcoupling,networkindependent• Fullcontrolofyourapplication

Cons• Noagilityforisolating,compartmentalizingand

decouplingdataprocessingfunctionalities,suitabletorunondiversehardware/softwareinfrastructures

• Noagilityforrapiddevelopmentorscalability

Pros• Technologyindependent• Fastiterations• Smallteams• Faultisolation• Scalable

Cons• Complexitynetworking(distributedsystem)• Requiresadministrationandreal-timeorchestration

FBP paradigm and reactive programming

8

S1 S2

• S1: Proactive,responsibleforchangeinS2• S2:Passive,unawareofthedependency

S1 S2

• S1: Broadcastsit’sownresult• S2:SubscribesS1changeeventsandchangesitself

Passiveprogramming

ReactiveprogrammingEnableseventdrivenstreamprocessing

Subscriber/ConsumerPublisher/Producer

t2 t0t1

Feedbacktocontrolbackpressure

Flowbasedprogrammingparadigmassumesreactiveprogramming

CLARA Framework

Reactive, data-stream processing framework that implements micro-services architecture and FBP• Provides service abstraction (data processing station)

to present user algorithm (engine) as an independentservice.

• Defines service communication channel (data-streampipe) outside of the user engine.

• Stream-unit level workflow management system andAPI

• Defines streaming transient-data structure

• Supports C++, JAVA, Python languages

9

http://claraweb.jlab.org

• CLARA: A Contemporary Approach to Physics Data Processing, 2011, J. Phys.: Conf. Ser. 331 032013 doi:10.1088/1742-6596/331/3/032013

• Development of A Clara Service for Neutron Reconstruction, 2011, APS: 2011APS..DNP.EA024C

• Component Based Dataflow Processing Framework, 2015, IEEE DOI: 10.1109/BigData.2015.7363971, ISBN: 978 1-4799-9926-2

• Earth Science Data Fusion with Event Building Approach, 2015, IEEE DOI: 10.1109/BigData.2015.7363972, ISBN: 978 1-4799-9926-2

• CLARA: The CLAS12 Reconstruction and Analysis framework, 2016, J. Phys.: Conf. Ser. 762 012009 doi:10.1088/1742-6596/762/1/012009

Publications

• V.Gyurjyan,S.Mancilla,R.Oyarzun,S.Paul,A.Rodrigues• Design:2009• Bettareleaseandfirstapplication:2010• 3mastertheses

Authorsandchronology

RewardsResearchOpportunitiesinSpaceandEarthSciences(ROSES)2015Funding for 3 years by the NASA's Earth Science Technology Office (ESTO)and the Advanced Information Systems Technology (AIST) Program.NAIADS Project ID: AIST-14-0014.SRB Project ID:LARC-14-0014-2

UsersDocumentation

Basic components and a user code interface

10

Data Processing Station Data-Stream Pipe Orchestrator

Data Processing StationData processing Engine Data Processing Micro-Service

EngineTutorials•https://claraweb.jlab.org/clara/docs/quickstart/java.html•https://claraweb.jlab.org/clara/docs/quickstart/cpp.html•https://claraweb.jlab.org/clara/docs/quickstart/python.html

Single Interface

Data Processing Station

11

Multi-threading

Configuration

Communication

• 0MQ• POSIX-SHM • In-memory Data Grid (IDG)

Runtime Environment

• C++• JAVA• Python

LanguageBindings

•https://github.com/JeffersonLab/clara-java.git• https://github.com/JeffersonLab/clara-cpp.git• https://github.com/JeffersonLab/clara-python.git

0

0.01

0.02

0.03

0.04

0.05

0.06

0 20 40 60 80

Threads

CLAS12ReconstructionApplicationVerticalScaling

Node :IntelXeonE5-2697Av4@2.6GHz

0

10

20

30

40

50

60

0 20 40 60 80

Amdahl'sLawCurveFit

P=0.995

configuration:io-services:writer:compression:2

services:MAGFIELDS:magfieldSolenoidMap:Symm_solenoid_r601_phi1_z1201_13June2018.datmagfieldTorusMap:Full_torus_r251_phi181_z251_08May2018.datvariation:rga_fall2018

DCHB:variation:rga_fall2018dcGeometryVariation:rga_fall2018dcT2DFunc:"Polynomial"

DCTB:variation:rga_fall2018dcGeometryVariation:rga_fall2018

EC:variation:rga_fall2018

mime-types:- binary/data-hipo

6.78

147.5

5.8

172.4

9.73

102.8

9.51

105.2

13.2

75.8

12.17

82.2

20.67

48.4

19.89 50.3

M S E C ( 5 0 38 ) H Z ( 5 0 3 8 ) MSEC ( 4 0 13 ) H Z ( 4 0 1 3 )

farm19 farm18 farm16 farm14

2020-05-0811:48:30.940:Benchmarkresults:2020-05-0811:48:30.941:averageeventtime=0.14ms2020-05-0811:48:30.943:MAGFIELDS2000eventstotaltime=0.02saverageeventtime=0.01ms2020-05-0811:48:30.945:FTCAL 2000eventstotaltime=0.26saverageeventtime=0.13ms2020-05-0811:48:30.946:FTHODO 2000eventstotaltime=0.29saverageeventtime=0.15ms2020-05-0811:48:30.948:FTEB 2000eventstotaltime=0.13saverageeventtime=0.06ms2020-05-0811:48:30.949:DCHB 2000eventstotaltime=1126.76saverageeventtime=563.38ms2020-05-0811:48:30.951:FTOFHB 2000eventstotaltime=3.93saverageeventtime=1.96ms2020-05-0811:48:30.952:EC 2000eventstotaltime=1.87saverageeventtime=0.94ms2020-05-0811:48:30.953:CVT 2000eventstotaltime=150.14saverageeventtime=75.07ms2020-05-0811:48:30.955:CTOF 2000eventstotaltime=4.75saverageeventtime=2.37ms2020-05-0811:48:30.956:CND 2000eventstotaltime=1.49saverageeventtime=0.74ms2020-05-0811:48:30.957:BAND 2000eventstotaltime=0.02saverageeventtime=0.01ms2020-05-0811:48:30.959:HTCC 2000eventstotaltime=0.11saverageeventtime=0.05ms2020-05-0811:48:30.960:LTCC 2000eventstotaltime=0.05saverageeventtime=0.03ms2020-05-0811:48:30.961:EBHB 2000eventstotaltime=1.60saverageeventtime=0.80ms2020-05-0811:48:30.963:DCTB 2000eventstotaltime=988.61saverageeventtime=494.30ms2020-05-0811:48:30.964:FTOFTB2000eventstotaltime=3.98saverageeventtime=1.99ms2020-05-0811:48:30.965:EBTB 2000eventstotaltime=2.86saverageeventtime=1.43ms2020-05-0811:48:30.966:RICH 2000eventstotaltime=2.15saverageeventtime=1.07ms2020-05-0811:48:30.967:WRITER 2000eventstotaltime=7.09saverageeventtime=3.55ms2020-05-0811:48:30.968:TOTAL 2000eventstotaltime=2296.38saverageeventtime=1148.19ms

Data Stream Pipe

12

LanguageBindings

•https://github.com/JeffersonLab/xmsg-java.git• https://github.com/JeffersonLab/xmsg-cpp.git• https://github.com/JeffersonLab/xmsg-python.git

OMQ / POSIX_SHM / IDG

Transient Stream UnitGoogle ProtoBuf

• Meta-data• Serialization• Encryption

Workflow orchestrator

13

Orchestrator

Hardware Optimizations

Service Registration/Discovery

Data-Set Handling and Distribution

Farm (batch or cloud) Interface

Application Deployment and Execution

Application Monitoring, Real-time Benchmarking

Command-Line Interface

Exception Logging and Reporting

CLAS12 Data Processing Applications

14

MAGF FCAL FTHODO FTEB DCHB FTOFHB EC CVT

EBTB FTOFTB DCTB EBHB LTCC HTCC BAND CND

CDG

DataQualityAssurance

EventReconstructionApplication

J/𝝍/𝑻𝑪𝑺 𝝅𝟎, 𝒆 − 𝜸𝜸 𝑰𝒏𝒄𝒍𝒖𝒔𝒊𝒗𝒆𝑯𝒂𝒅𝒓𝒐𝒏 𝑴𝒆𝒔𝒐𝒏𝑿/𝑽𝑺 𝒆 − 𝝅 + 𝑵 𝒆 − 𝑷 𝒑𝒑𝑿

DataProvisioning

PhysicsAnalysisApplication

DataAcquisitionSystem

Detectorreconstructionservices(DRS)• Cluster,segmentfinder• Roadfinder

Heterogeneous data-stream processing (LDRD-2018)

15

Detector1Sector1

SIS

SIS

SIS

SIS

VTP

Streaminterfaceservices(SIS)• RunsonFPGA• 0suppression• Electronicnoiseremoval

FlashADCs/TDCs

VXICrate

DetectorNSectorN

ERS1

EP

DRSX

DSTP

ERSN

ANAS1

ANAS2

ANASN

Detectorreconstructionservices(DRS)• Fulldetectorreconstruction• Calibration,alignment• Kalman filter• Note:someDRSservicesmightrunonGPGPU/TPU

EventpersistencyRawdata

Somedetectorsmightneedotherdetector’sdatatocompletetheirreconstruction.

Eventreconstructionservices

DSTpersistencyReconstructeddata

Analysisservices

Note:FPGAbasedCLARAservice’scodebase(C++)willbedeployedinCPUfirstforverificationandqualitycontrol,andonlyafterwillbedeployedintoFPGAs.Touseexistinghardwareinastreamingmodedatamustbereducedwhilestreaming.

DRS

DRS

SRS

Sectorreconstructionservices(SRS)• Geometrymatching• Partialsoftwaretrigger

SSP

EBS

EBservice(GT)orCODA(triggered)

VXS

FADC

Hall-B VTP Test-setup 2

Clara Test Implementation

Hall B VTP Test 2(emulates multi-stream system: stream/FADC)

FADC

FADC

VTP

CSS

SRO10

CSS

CSS

CSS

CSS

CSS

CSS

CSS

CSS

CSS

CSA

CSA

CSA

CSP CER10Gb/sec

StreamSource

StreamAggregator

EventRecorderStreamProcessor

Summary

• To address scientific data 3V expansion we need to design frameworks capable of leveraging data streams, as well as massive parallelism and heterogeneity of feature computing facilities.

• CLARA is a mature data stream processing framework that utilizes micro-services architecture and flow-based programming paradigm, currently in production-use at JLAB and NASA Langley.

• CLARA together with JANA are being tested on the Hall-B SRO test-setup 2 for evaluation, and setting up a foundation for an integrated data processing framework for future experiments at home and elsewhere.

17

Thankyou

Backups

Talk Title Here 18

Structure

19

FE

ServiceLayer

OrchestrationLayer

Registration

DPE

SC SC

DPE

SC SC

DPE

SC SC

DPE

SC SC

gateway

security

LocalRegistration LocalRegistration

LocalRegistration LocalRegistration

Meta-Data Data

ZeroMQ(xMsg) in-MemoryData-GridPOSIXSharedMemory(FIPC)

Event Reconstruction Application (sub-event level parallelization)

20

FCAL FTHODO FTEB

DCHB

FTOFTB

EC

CVT

EBTB

CTOFTB

DCTBEBHB

LTCC

HTCC

RICH

CND

MAGF

FTOFHB

Heterogeneous deployment algorithm

21

In-ProcessSHM

FarmNode

Java-DPE

In-MemoryData-Grid

FTOFDCHBc EC

C++-DPE

DCHBg

A𝐶𝑅𝐷𝐶𝐻𝐵 − 𝐺𝑃𝑈(𝑡𝑖)NO

P

A𝐶𝑅𝐷𝐶𝐻𝐵 − 𝐶𝑃𝑈(𝑡𝑖)NO

P

A𝐶𝑅𝐹𝑇𝑂𝐹(𝑡𝑖)NO

P

Pg=∑ UVWXYW(NO)Z[\

∑ UV]U^_`abc(NO)Z[\

Pc=∑ UVWXYW(NO)Z[\

∑ UV]U^_`Ubc(NO)Z[\

𝑖𝑓𝑃𝑔 < 𝑃𝑐routedata−streamthroughDCHBg

Data-quantum size and GPU occupancy

22

In-ProcessSHM

FarmNode

Java-DPE

In-MemoryData-Grid

FTOF EC

C++-DPE

DCHBg

setdata-quantumsize

Data-processing chain per NUMA

23

In-ProcessSHM

FarmNode

Java-DPE

StartDPEpinedtoaNUMAsocket.

In-ProcessSHM

NUMA1

Backpressurecontrol

NUMA0

Java-DPE

Results

24

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120 140

ProcessingRate[Hz]

NumberofThreads

Ratevs.ThreadsforaSingleNUMASocketCLAS12ReconstructionApplication:v.5.9.0,DataFile:clas_004013.hipo,NUMA0

AMDRome XeonE5-2687A XeonGold6148

0

20

40

60

80

0 10 20 30

• AMDEPYC7502:1.5MHz,128/128,NUMA-2• XeonE-2687A:2.6GHz,32/32,NUMA-2• XeonGold6148:2.4GHz,40/40,NUMA-4

NUMAsocketphysicalcorelimitforeachnode

0

0.01

0.02

0.03

0.04

0.05

0.06

0 10 20 30 40 50 60 70

Threads

CLAS12ReconstructionApplicationVerticalScaling

DataFile:clas_005038.evio.00130.hipoNode :IntelXeonE5-2697Av4@2.6GHzClara:v4.3.11Plugin1:coatjava-6.3.1Plugin2:grapes-2.1

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70

CLAS12ReconstructionApplicationVerticalScalingAmdahl'sLawCurveFit

CalculatedSpeedup Amdahl'sLawSpeedup

DataFile:clas_005038.evio.00130.hipoNode :IntelXeonE5-2697Av4@2.6GHzClara:v4.3.11Plugin1:coatjava-6.3.1Plugin2:grapes-2.1

P=0.995

99.5%parallelefficiencyoverphysicalcores