CLARA: Data-Stream Processing Framework · 2020-05-14 · not included vertex tracker ... Passive...
Transcript of CLARA: Data-Stream Processing Framework · 2020-05-14 · not included vertex tracker ... Passive...
CLARA: Data-Stream Processing Framework
Framework for streaming readout
V. Gyurjyan for JLAB EPSCI Group
Outline
• Problem statement and possible solutions- Hardware diversification- Parallelization- Streaming
• Micro-services architecture- Micro-service vs monolith
• Flow based programming paradigm- Reactive communication
• CLARA: reactive data-stream processing frameworkthat implements micro-services architecture and FBP
• Summary
2
Problem we face
3
0
10
20
30
40
Exab
yte
GlobalDigitalData
Scientific NonScientific
Experiment Conditions Event Rate DataRate Comments
Moller Production/integratedmode 1920Hz 130MB/s CanbehandledwiththetraditionalDAQ.
EICL=1034 cm-2s-1 450-550Hz
Notincludedbackgroundnoiserates.
20-25GB/snotincludedvertextrackerthatwillgenerate~240GB/s
~10Kz/µb,trackmultiplicity=~5JLABEICdetectordesignwillhavemillionsofchannels.Onlynon-vertexdetectorscombinedwillhave~1Mchannelsplusvertexdetector:estimated20-50Mchannels.Intotal~1000ROC’s. Controlnightmare(startingstoppingarun).Streamingreadouthaslesscontrolrequirements.
TIDIS rTPC hitratesenormous(~800KHz/pad)
4GB/s
Howtomatchupsuper Bigbyte detectedelectronswith rTPC detectedspectatorprotonsisabigquestion.ConventionaltriggeredDAQwillbechallenged.
SoLID30sectorGEM 30GB/s
30separateDAQ’seach1GB/s?HowtocombineGEMreadoutwithotherdetectors? HandlingGEMhitssharingadjacentsectors.
CLAS12 Phase2 100KHz 5-7GB/s
CPU based architecture limitations
4
MIPS/clockspeedplateau
Squeezingmorecoresperchipbecomesdifficult
A Roadmap for HEP Software and Computing R&D for the 2020s. HEP Software Foundation, Feb. 2018
“Frameworks face the challenge of handling the massive parallelism and heterogeneity that will be present in future computing facilities, including multi-core and many-core systems, GPUs, Tensor Processing Units (TPUs), and tiered memory systems, each integrated with storage and high-speed network interconnections.”
The Scale-Cube
5
Scalebysp
littin
gsim
ilarthings,Y-axis
suchase
vents.M
ulti-threading
TheArtofScalability.by MartinL.Abbott and MichaelT.Fisher.ISBN-13: 978-0134032801
Verticalsc
alingorm
ulti-treading
Why decomposition into independent modules
6
• Smallerandindependentcodebases.Reinforceamaximumindependenceandisolationoffunctionalcomponents.
• Faulttolerant• Overallmicro-servicesbasedapplicationevolvesmuchfaster• Nootherdependenciesotherthandata(loosecoupling)can
runonheterogeneoushardwareandsoftwareinfrastructures.• Relativelyeasyevolution,dueto
• Requirementchanges• Environmentchanges• Errorsorsecuritybreaches• Newequipmentaddedorremoved• Improvementstothesystem
• Encouragescontributionandinclusionofnewtechnologies
Micro-services vs Monolithic architecture
Presentation Logic Data
7
Pros• Strongcoupling,networkindependent• Fullcontrolofyourapplication
Cons• Noagilityforisolating,compartmentalizingand
decouplingdataprocessingfunctionalities,suitabletorunondiversehardware/softwareinfrastructures
• Noagilityforrapiddevelopmentorscalability
Pros• Technologyindependent• Fastiterations• Smallteams• Faultisolation• Scalable
Cons• Complexitynetworking(distributedsystem)• Requiresadministrationandreal-timeorchestration
FBP paradigm and reactive programming
8
S1 S2
• S1: Proactive,responsibleforchangeinS2• S2:Passive,unawareofthedependency
S1 S2
• S1: Broadcastsit’sownresult• S2:SubscribesS1changeeventsandchangesitself
Passiveprogramming
ReactiveprogrammingEnableseventdrivenstreamprocessing
Subscriber/ConsumerPublisher/Producer
t2 t0t1
Feedbacktocontrolbackpressure
Flowbasedprogrammingparadigmassumesreactiveprogramming
CLARA Framework
Reactive, data-stream processing framework that implements micro-services architecture and FBP• Provides service abstraction (data processing station)
to present user algorithm (engine) as an independentservice.
• Defines service communication channel (data-streampipe) outside of the user engine.
• Stream-unit level workflow management system andAPI
• Defines streaming transient-data structure
• Supports C++, JAVA, Python languages
9
http://claraweb.jlab.org
• CLARA: A Contemporary Approach to Physics Data Processing, 2011, J. Phys.: Conf. Ser. 331 032013 doi:10.1088/1742-6596/331/3/032013
• Development of A Clara Service for Neutron Reconstruction, 2011, APS: 2011APS..DNP.EA024C
• Component Based Dataflow Processing Framework, 2015, IEEE DOI: 10.1109/BigData.2015.7363971, ISBN: 978 1-4799-9926-2
• Earth Science Data Fusion with Event Building Approach, 2015, IEEE DOI: 10.1109/BigData.2015.7363972, ISBN: 978 1-4799-9926-2
• CLARA: The CLAS12 Reconstruction and Analysis framework, 2016, J. Phys.: Conf. Ser. 762 012009 doi:10.1088/1742-6596/762/1/012009
Publications
• V.Gyurjyan,S.Mancilla,R.Oyarzun,S.Paul,A.Rodrigues• Design:2009• Bettareleaseandfirstapplication:2010• 3mastertheses
Authorsandchronology
RewardsResearchOpportunitiesinSpaceandEarthSciences(ROSES)2015Funding for 3 years by the NASA's Earth Science Technology Office (ESTO)and the Advanced Information Systems Technology (AIST) Program.NAIADS Project ID: AIST-14-0014.SRB Project ID:LARC-14-0014-2
UsersDocumentation
Basic components and a user code interface
10
Data Processing Station Data-Stream Pipe Orchestrator
Data Processing StationData processing Engine Data Processing Micro-Service
EngineTutorials•https://claraweb.jlab.org/clara/docs/quickstart/java.html•https://claraweb.jlab.org/clara/docs/quickstart/cpp.html•https://claraweb.jlab.org/clara/docs/quickstart/python.html
Single Interface
Data Processing Station
11
Multi-threading
Configuration
Communication
• 0MQ• POSIX-SHM • In-memory Data Grid (IDG)
Runtime Environment
• C++• JAVA• Python
LanguageBindings
•https://github.com/JeffersonLab/clara-java.git• https://github.com/JeffersonLab/clara-cpp.git• https://github.com/JeffersonLab/clara-python.git
0
0.01
0.02
0.03
0.04
0.05
0.06
0 20 40 60 80
Threads
CLAS12ReconstructionApplicationVerticalScaling
Node :[email protected]
0
10
20
30
40
50
60
0 20 40 60 80
Amdahl'sLawCurveFit
P=0.995
configuration:io-services:writer:compression:2
services:MAGFIELDS:magfieldSolenoidMap:Symm_solenoid_r601_phi1_z1201_13June2018.datmagfieldTorusMap:Full_torus_r251_phi181_z251_08May2018.datvariation:rga_fall2018
DCHB:variation:rga_fall2018dcGeometryVariation:rga_fall2018dcT2DFunc:"Polynomial"
DCTB:variation:rga_fall2018dcGeometryVariation:rga_fall2018
EC:variation:rga_fall2018
mime-types:- binary/data-hipo
6.78
147.5
5.8
172.4
9.73
102.8
9.51
105.2
13.2
75.8
12.17
82.2
20.67
48.4
19.89 50.3
M S E C ( 5 0 38 ) H Z ( 5 0 3 8 ) MSEC ( 4 0 13 ) H Z ( 4 0 1 3 )
farm19 farm18 farm16 farm14
2020-05-0811:48:30.940:Benchmarkresults:2020-05-0811:48:30.941:averageeventtime=0.14ms2020-05-0811:48:30.943:MAGFIELDS2000eventstotaltime=0.02saverageeventtime=0.01ms2020-05-0811:48:30.945:FTCAL 2000eventstotaltime=0.26saverageeventtime=0.13ms2020-05-0811:48:30.946:FTHODO 2000eventstotaltime=0.29saverageeventtime=0.15ms2020-05-0811:48:30.948:FTEB 2000eventstotaltime=0.13saverageeventtime=0.06ms2020-05-0811:48:30.949:DCHB 2000eventstotaltime=1126.76saverageeventtime=563.38ms2020-05-0811:48:30.951:FTOFHB 2000eventstotaltime=3.93saverageeventtime=1.96ms2020-05-0811:48:30.952:EC 2000eventstotaltime=1.87saverageeventtime=0.94ms2020-05-0811:48:30.953:CVT 2000eventstotaltime=150.14saverageeventtime=75.07ms2020-05-0811:48:30.955:CTOF 2000eventstotaltime=4.75saverageeventtime=2.37ms2020-05-0811:48:30.956:CND 2000eventstotaltime=1.49saverageeventtime=0.74ms2020-05-0811:48:30.957:BAND 2000eventstotaltime=0.02saverageeventtime=0.01ms2020-05-0811:48:30.959:HTCC 2000eventstotaltime=0.11saverageeventtime=0.05ms2020-05-0811:48:30.960:LTCC 2000eventstotaltime=0.05saverageeventtime=0.03ms2020-05-0811:48:30.961:EBHB 2000eventstotaltime=1.60saverageeventtime=0.80ms2020-05-0811:48:30.963:DCTB 2000eventstotaltime=988.61saverageeventtime=494.30ms2020-05-0811:48:30.964:FTOFTB2000eventstotaltime=3.98saverageeventtime=1.99ms2020-05-0811:48:30.965:EBTB 2000eventstotaltime=2.86saverageeventtime=1.43ms2020-05-0811:48:30.966:RICH 2000eventstotaltime=2.15saverageeventtime=1.07ms2020-05-0811:48:30.967:WRITER 2000eventstotaltime=7.09saverageeventtime=3.55ms2020-05-0811:48:30.968:TOTAL 2000eventstotaltime=2296.38saverageeventtime=1148.19ms
Data Stream Pipe
12
LanguageBindings
•https://github.com/JeffersonLab/xmsg-java.git• https://github.com/JeffersonLab/xmsg-cpp.git• https://github.com/JeffersonLab/xmsg-python.git
OMQ / POSIX_SHM / IDG
Transient Stream UnitGoogle ProtoBuf
• Meta-data• Serialization• Encryption
Workflow orchestrator
13
Orchestrator
Hardware Optimizations
Service Registration/Discovery
Data-Set Handling and Distribution
Farm (batch or cloud) Interface
Application Deployment and Execution
Application Monitoring, Real-time Benchmarking
Command-Line Interface
Exception Logging and Reporting
CLAS12 Data Processing Applications
14
MAGF FCAL FTHODO FTEB DCHB FTOFHB EC CVT
EBTB FTOFTB DCTB EBHB LTCC HTCC BAND CND
CDG
DataQualityAssurance
EventReconstructionApplication
J/𝝍/𝑻𝑪𝑺 𝝅𝟎, 𝒆 − 𝜸𝜸 𝑰𝒏𝒄𝒍𝒖𝒔𝒊𝒗𝒆𝑯𝒂𝒅𝒓𝒐𝒏 𝑴𝒆𝒔𝒐𝒏𝑿/𝑽𝑺 𝒆 − 𝝅 + 𝑵 𝒆 − 𝑷 𝒑𝒑𝑿
DataProvisioning
PhysicsAnalysisApplication
DataAcquisitionSystem
Detectorreconstructionservices(DRS)• Cluster,segmentfinder• Roadfinder
Heterogeneous data-stream processing (LDRD-2018)
15
Detector1Sector1
SIS
SIS
SIS
SIS
VTP
Streaminterfaceservices(SIS)• RunsonFPGA• 0suppression• Electronicnoiseremoval
FlashADCs/TDCs
VXICrate
DetectorNSectorN
ERS1
EP
DRSX
DSTP
ERSN
ANAS1
ANAS2
ANASN
Detectorreconstructionservices(DRS)• Fulldetectorreconstruction• Calibration,alignment• Kalman filter• Note:someDRSservicesmightrunonGPGPU/TPU
EventpersistencyRawdata
Somedetectorsmightneedotherdetector’sdatatocompletetheirreconstruction.
Eventreconstructionservices
DSTpersistencyReconstructeddata
Analysisservices
Note:FPGAbasedCLARAservice’scodebase(C++)willbedeployedinCPUfirstforverificationandqualitycontrol,andonlyafterwillbedeployedintoFPGAs.Touseexistinghardwareinastreamingmodedatamustbereducedwhilestreaming.
DRS
DRS
SRS
Sectorreconstructionservices(SRS)• Geometrymatching• Partialsoftwaretrigger
SSP
EBS
EBservice(GT)orCODA(triggered)
VXS
FADC
Hall-B VTP Test-setup 2
Clara Test Implementation
Hall B VTP Test 2(emulates multi-stream system: stream/FADC)
FADC
FADC
VTP
CSS
SRO10
CSS
CSS
CSS
CSS
CSS
CSS
CSS
CSS
CSS
CSA
CSA
CSA
CSP CER10Gb/sec
StreamSource
StreamAggregator
EventRecorderStreamProcessor
Summary
• To address scientific data 3V expansion we need to design frameworks capable of leveraging data streams, as well as massive parallelism and heterogeneity of feature computing facilities.
• CLARA is a mature data stream processing framework that utilizes micro-services architecture and flow-based programming paradigm, currently in production-use at JLAB and NASA Langley.
• CLARA together with JANA are being tested on the Hall-B SRO test-setup 2 for evaluation, and setting up a foundation for an integrated data processing framework for future experiments at home and elsewhere.
17
Thankyou
Backups
Talk Title Here 18
Structure
19
FE
ServiceLayer
OrchestrationLayer
Registration
DPE
SC SC
DPE
SC SC
DPE
SC SC
DPE
SC SC
gateway
security
LocalRegistration LocalRegistration
LocalRegistration LocalRegistration
Meta-Data Data
ZeroMQ(xMsg) in-MemoryData-GridPOSIXSharedMemory(FIPC)
Event Reconstruction Application (sub-event level parallelization)
20
FCAL FTHODO FTEB
DCHB
FTOFTB
EC
CVT
EBTB
CTOFTB
DCTBEBHB
LTCC
HTCC
RICH
CND
MAGF
FTOFHB
Heterogeneous deployment algorithm
21
In-ProcessSHM
FarmNode
Java-DPE
In-MemoryData-Grid
FTOFDCHBc EC
C++-DPE
DCHBg
A𝐶𝑅𝐷𝐶𝐻𝐵 − 𝐺𝑃𝑈(𝑡𝑖)NO
P
A𝐶𝑅𝐷𝐶𝐻𝐵 − 𝐶𝑃𝑈(𝑡𝑖)NO
P
A𝐶𝑅𝐹𝑇𝑂𝐹(𝑡𝑖)NO
P
Pg=∑ UVWXYW(NO)Z[\
∑ UV]U^_`abc(NO)Z[\
Pc=∑ UVWXYW(NO)Z[\
∑ UV]U^_`Ubc(NO)Z[\
𝑖𝑓𝑃𝑔 < 𝑃𝑐routedata−streamthroughDCHBg
Data-quantum size and GPU occupancy
22
In-ProcessSHM
FarmNode
Java-DPE
In-MemoryData-Grid
FTOF EC
C++-DPE
DCHBg
setdata-quantumsize
Data-processing chain per NUMA
23
In-ProcessSHM
FarmNode
Java-DPE
StartDPEpinedtoaNUMAsocket.
In-ProcessSHM
NUMA1
Backpressurecontrol
NUMA0
Java-DPE
Results
24
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120 140
ProcessingRate[Hz]
NumberofThreads
Ratevs.ThreadsforaSingleNUMASocketCLAS12ReconstructionApplication:v.5.9.0,DataFile:clas_004013.hipo,NUMA0
AMDRome XeonE5-2687A XeonGold6148
0
20
40
60
80
0 10 20 30
• AMDEPYC7502:1.5MHz,128/128,NUMA-2• XeonE-2687A:2.6GHz,32/32,NUMA-2• XeonGold6148:2.4GHz,40/40,NUMA-4
NUMAsocketphysicalcorelimitforeachnode
0
0.01
0.02
0.03
0.04
0.05
0.06
0 10 20 30 40 50 60 70
Threads
CLAS12ReconstructionApplicationVerticalScaling
DataFile:clas_005038.evio.00130.hipoNode :[email protected]:v4.3.11Plugin1:coatjava-6.3.1Plugin2:grapes-2.1
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70
CLAS12ReconstructionApplicationVerticalScalingAmdahl'sLawCurveFit
CalculatedSpeedup Amdahl'sLawSpeedup
DataFile:clas_005038.evio.00130.hipoNode :[email protected]:v4.3.11Plugin1:coatjava-6.3.1Plugin2:grapes-2.1
P=0.995
99.5%parallelefficiencyoverphysicalcores