Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European...
-
Upload
brooke-snow -
Category
Documents
-
view
213 -
download
0
Transcript of Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European...
![Page 1: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/1.jpg)
Architectural Software Support for Processing ClustersJohannes Gutleber, Luciano Orsini
European Organization for Nuclear ResearchDiv. EP/CMD, The CMS CollaborationCERN, 1211 Geneva 23, Switzerland
![Page 2: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/2.jpg)
2
The Issue
1988The biggest problem with creating distributed computing systems is devising a method of intercomputer communication that is reliable, fast and simple.
J.E. Tomayko, NASA CR-182505, p.228, Mar 1988
2000High-speed networks […] can obtain communication speeds close to those of supercomputers, but realizing this potential is a challenging problem.
H. Bal, ACM Op Sys Rev, p. 79, Oct 2000
![Page 3: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/3.jpg)
3
The Approach
• invest in alternative communication paradigms• optimise communication libraries
Do not…
• Lightweight framework for homogeneous communication• Configure with low-level communication libraries• Plug-in application components• homogeneous subsystem interface design support
Provide architectural software support
![Page 4: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/4.jpg)
4
Architectural Software Support
• Architecture support comprises– a processing model– subsystem addressing– configuration and control– Application Programmer Interface requirements
Everything that is needed tobuild and operate a
Distributed application
![Page 5: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/5.jpg)
5
Motivation
• In large scale data acquisition systems we have to cope with– Long operational lifetimes (10-15 yrs)– Modifications due to generation jumps (networking, processing)– Deployance of one application in various different environments– Bridging of hardware/software performance gaps
• From the special case we can extrapolate to general cluster based systems– Search engines, document retrieval systems– Plant control systems – Medical imaging networks in hospitals
Available tools don´t match the requirements
![Page 6: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/6.jpg)
6
HDM/FPGAHDM/IOP
Architecture Basis: I2O
• A specification for hardware and operating system independent device driver framework
• Targeted at collaboration between...
Messaging Layer
Host andIntelligentdevices
Intelligent deviceintercommunication
PCI busUNIX - OSM Windows - OSM
![Page 7: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/7.jpg)
7
I2O IOP Environment
• Inbound/Outbound queue (pass frame pointers, Zcopy)• Homogeneous frame format• Event driven processing• Uniform hardware access API
IRQ
bar ( )
Network
HDM, framework
foo( )
Inbound outbound
![Page 8: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/8.jpg)
8
I2O Message Frame
Used to implement an active message model
MessageSize MessageFlags VersionOffset
TargetAddressInitiatorAddressFunction (= FFh)
InitiatorContext
TransactionContext
XFunctionCodeOrganizationID
PrivatePayload = function parameters
PrivatePayload
3 2 1 031 24 23 16 15 8 7 0
Sta
nd
ard
Fra
me
Pri
vate
Fra
me
Ext
en
sio
n
Assigned by application and returned in reply (cookie)
Assigned by message layer. Used for routing back reply
![Page 9: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/9.jpg)
9
I2O Messaging
• A Message frame contains two addresses– initiatorTid = where the message comes from– destinationTid = to which DDM/ISM it shall go
• Message is associated with a handler function– Predefined Functions for I2O messages– Private frame extension for application specific messages
• Message length limited to 265 KB. Frame should only contain control information– Message data should go into Scatter-Gather Lists
• I2O frame byte order is little endian
![Page 10: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/10.jpg)
10
Peer and Peer2Peer Operations
• Peer Operation uses the queue pair on one PCI segment• Peer-to-Peer commands for network communication
Executive
Peer TransportAgent
Executive
Peer TransportAgent
PeerTransport
DDM
Messaging Layer
Executive
Messaging Layer
Device Driver
Module
Non-I2Omessages
I2O message frames
![Page 11: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/11.jpg)
11
I2O Peer Operation for Clusters
• Application component device• Processing node IOP• Controller node host
• Homogeneous communication– frameSend for local, remote, host– single addressing scheme (Tid)
• Application framework
Executive
Messaging Layer
Peer TransportAgent
Messaging Layer
Executive
Peer TransportAgent
ˆ
‚
ƒ
„ …
†
‡
‰PeerTransport
Application Application
I2O Message Frames
![Page 12: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/12.jpg)
12
TargetAddrClassId
InstanceDispatcher
Applications are I2O Classes
in XDAQthey are
equivalent toC++ classes
Listener
DDmAdapter UtilAdapter UserAdapter
Application
Each class exposes an
interface that is implemented by the application
![Page 13: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/13.jpg)
13
Polling Peer Transport Agent+ low OS service overhead- executive uses CPU continuously- no blocking PTs
Peer Transport Configurations
PTATCP
Myri
DLPI
FIFOPTA
TCP
Myri
DLPI
FIFO
Thread per Peer Transport- higher OS service overhead+ no CPU monopolisation+ allows integration with other software
![Page 14: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/14.jpg)
14
I2O for Cluster Configuration
executive tasks
RUIO (IOP480)VxWorks
PPC (MVME2306)VxWorks, Linux
WorkstationIntel Linux,
Sparc Solaris
![Page 15: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/15.jpg)
15
Boot
• Executives on each node in the cluster wait for I2O configuration messages
• Configuration and Control can be done through– Native I2O messages– XML/HTTP mapping Zzz..zzz…zzz..
Parameter set/get isAlso done through I2O/XML
![Page 16: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/16.jpg)
16
I2O Configuration Commands
• Where (e.g. IOP 34) ExecSysTabSet• How (e.g TCP, DLPI, Myrinet)• Who (e.g RU1 – Tid 10, RU2 – Tid 20) ExecDeviceAssign
Detector Frontend
Computing Services
ReadoutSystems
FilterSystems
Event Manager Builder Networks
Level 1Trigger
RunControl
![Page 17: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/17.jpg)
17
Ready
• What ExecSwDownload (e.g.libRU.so, libEVM.so)
LocalApp2
RemoteApp2
RemoteApp1
RemoteApp3
LocalApp3
![Page 18: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/18.jpg)
18
Operational
App2App1
frameSend (...)App3
DdmSystemChange
![Page 19: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/19.jpg)
19
Efficiency Evolution
• Roundtrip test, reporting half-roundtrip-time• Calculate difference to the bare-bones use of Myrinet GM library
June July August September October November
10
5
3
4
2
1
original efficiency, paper450 MHz, PCI 32/33
on-demand buffer-pool allocation450 MHz, PCI 32/33
750 MHz, PCI 32/33
µsecs
time
![Page 20: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/20.jpg)
20
Point To Point Efficiency
GM/XDAQ Latencies
y = -0,0000x + 2,1289
0
10
20
30
40
50
60
70
80
90
100
110
120
0 1024 2048 3072 4096
Bytes transferred
Mic
rose
cond
s
XDAQ
GM 1.2.3
![Page 21: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/21.jpg)
21
SOAP
CMS Data Acquisition System
XML
Java
I2O
I2OO(500) real-timesystems
Giga E´NetMyrinet, Infiniband
100 kHz input@ 2KB per node
Custom readout
O(500) builder units
O(2000) physics Analysis nodes
Prototype cluster 2000: 32 x 32 PCs2.5 Gbps Myrinet 2000Gigabit Ethernet
![Page 22: Architectural Software Support for Processing Clusters Johannes Gutleber, Luciano Orsini European Organization for Nuclear Research Div. EP/CMD, The CMS.](https://reader030.fdocuments.in/reader030/viewer/2022032802/56649e115503460f94afdb7c/html5/thumbnails/22.jpg)
22
Summary
• Lightweight middleware• 2.1 sec per remote function invocation
(50 000 calls/s on GM)
– Abstraction from hardware– Ease of adaptability and extensibility
is feasibile.• Need architectural support
– to efficiently integrate layers– to be able to keep pace with technology
evolution w/o a need for change– to construct homogeneous applications
for heterogeneous processing clusters
OS and Device Drivers
HTTP
Ethernet Myrinet
XDAQ
Util/DDM
Processing
Sensor readout
TCP
PCI