Distributed Components for Integrating Large-Scale High Performance Computing
Applications
Nanbor Wang, Roopa Pundaleeka and Johan Carlsson{nanbor,roopa,johan}@txcorp.com
Tech-X CorporationBoulder, CO
CCA Meeting, October 11, 2007
Funded by DOE OASCR SBIR Grant #DE-FG02-04ER84099
Distributed Components 2 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Outline
• Motivation• Distributed and Parallel High-Performance
Computing (DPHPC)• Exploring Diverse Distributed Technologies
– Distributed Proxy Components– New Transport Mechanism to Babel RMI
• Babel RMI• Babel RMI over CORBA IIOP
– Performance Comparisons• Future Work
Distributed Components 3 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Motivations for Distributed and ParallelComponent-Based Software Engineering• Existing component standards and frameworks designed with
enterprise applications in mind– No support for features that are important for HPC scientific
applications: interoperability with scientific programming languages (FORTRAN) and parallel computing infrastructure (MPI)
• Need to address needs of HPC scientific applications: combustion modeling, global climate modeling, fusion plasma simulations
• Motivating scenarios for Distributed and Parallel HPC (DPHPC):– Integrate separately-developed and established codes – Provide a different paradigm for partitioning problems – multi-physics
simulations– Provide ways to better utilize high-CPU number hardware– Combine computing resources of multiple clusters/computing centers– Enable parallel data streaming between computing task and post-
processing task
Distributed Components 4 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Distributed Proxy CCA Components
• Connect distributed parallel components by composing remote-capable proxy components into applications
• Hide the distributed aspect from the localized parallel CCA framework
• Provide low-cost mechanisms for connecting incompatible CCA infrastructures, e.g., Ccafeine, Dune, Ccain, and SciRUN
Component BComponent A
Conceptual Component A
Remoting ComponentA
Remoting Component B
Distributed Middleware Bus
Component A Component B
Distributed Components 5 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
BABEL RMI CLIENTBABEL RMI CLIENT BABEL RMI SERVERBABEL RMI SERVER
Babel RMI Interface
Simple Protocol
New Transport Mechanism for Babel RMI
• Babel generates mapping for remote invocations, and uses Simple Protocol
• Babel has the capability to allow users to take advantage of various remoting technologies through third party RMI libraries
• We are developing a CORBA protocol library for Babel RMI using TAO (version 1.5.1 or later) – TAO is the C++ based CORBA middleware framework– This protocol is essentially a bridge between Babel and TAO
BABEL RMI CLIENTBABEL RMI CLIENT BABEL RMI SERVERBABEL RMI SERVER
Babel RMI Interface
TAOIIOP TAO
Distributed Components 6 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Adding CORBA protocol for Babel RMI
Goal– Utilize CORBA wire protocol for Babel RMI for
communication between Babel clients and servants– Allow interoperability between existing CORBA and
Babel objects (e.g., with SciRUN CORBA support)– Maintain performance of CORBA IIOP protocol
Direct mapping approach – Requires support of certain Babel types; complex
numbers, multidimensional arrays and exceptions– Exchange messages in CORBA format– Allow development of new SIDL-compatible CORBA
objects
Distributed Components 7 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Client-side Operation Invocations
• CORBA uses Common Data Representation (CDR) – a binary serialization format, for transferring messages. Data packed directly
• No more CORBA – BABEL transformations for data
Distributed Components 8 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Server-side Request Handling
• A default TAO servant handles all Babel invocations
• Requests are dispatched to target Babel objects based on the instance/object ID
• Need to extend TAO’s PortableServer class to expose the Input (for reading input parameters) and Output (to sending the results) CDRs– SIDL Call and Response objects get a reference to
the Input and Output CDRs respectively
Distributed Components 9 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Server-side Request Handling in TAOIIOP
1. Default TAO object (TaoIIOPObject) extends TAO PortableServer::ServantBase, and implements
the ‘dispatch’ method, which gets the Input and Output CDRs in ServerReq obj.
2. The dispatch method creates the sidl::rmi::Response, which stores the CDR
3. Gets a reference to the target SIDL object from the InstanceRegistry
4. Executes the target method
1. Pack methods are called on the response object
for return, inout and out parameters
2. The results are directly packed into the CDR
Distributed Components 10 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Features Implemented in TAOIIOP
• All Babel types except opaque• Exception Handling• One-way method Invocation• Non-Blocking / Asynchronous Method
Invocation
Distributed Components 11 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
TAOIIOP V2.0 Optimizations
• Initial implementation provides a proof-of-concept but has many extra memory allocations, copying and conversions
• Added support to be able to directly add Babel types to CORBA CDR
• No conversions between Babel types and CORBA types to support discrepancies
• Aggregation of memory allocations
Distributed Components 12 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Performance Comparison 1
Throughput for 1D Array of Doubles
0
1
2
3
4
5
6
0 20 40 60 80 100 120
Payload (MB)
Th
rou
gh
pu
t (M
B/s
ec
)
TaoIIOP 2.0 Simple Protocol
CORBA DistComp TaoIIOP 1.0
Distributed Components 13 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Performance Comparison 2
Throughput for 2D Array of Doubles
0
1
2
3
4
5
6
0 20 40 60 80 100 120
Payload (MB)
Th
rou
gh
pu
t (M
B/s
ec
)
TaoIIOP 2.0 Simple Protocol
CORBA DistComp TaoIIOP 1.0
Distributed Components 14 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Performance Comparison 3
Throughput for 1D Array of Floating Point Complex Number
0
1
2
3
4
5
6
0 20 40 60 80 100 120
Payload (MB)
Th
rou
gh
pu
t (M
B/s
ec
)
TaoIIOP 2.0 Simple Protocol
CORBA DistComp TaoIIOP 1.0
Distributed Components 15 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Performance Comparison 4
Throughput for 2D Array of Floating Point Complex Numbers
0
1
2
3
4
5
6
0 20 40 60 80 100 120
Payload (MB)
Th
rou
gh
pu
t (M
B/s
ec
)
TaoIIOP 2.0 Simple Protocol
CORBA DistComp TaoIIOP 1.0
Distributed Components 16 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Performance Analysis
• TaoIIOP V1.0 takes a performance hit consistently– Performing extra conversions for arrays and complex number
types between CORBA and Babel– Multiple, fine-grained memory allocations– Not taking advantage of TAO’s key optimization mechanisms
• Distributed proxy components suffers a bit again because data marshalling
• TaoIIOP V2.0 has a performance gain of 10% for double and 30% for complex numbers, compared to TaoIIOP 1.0– Optimizations: Made CORBA-Babel mapping types native in
TAO by implementing optimized, zero-copy version of marshaling and demarshaling support
Distributed Components 17 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Application of DPHPC
• We have developed an example application to demonstrate the use of DPHPC– Separating post-simulation
data processing (after each time step
– Based on Vorpal, a C++ plasma and beam simulation code
– Implemented by Fang (Cherry) Liu (Indiana Univ) during summer internship
• Visible speed up using DPHPC– The actually trend of
speedups are counter-intuitive– We are exploring different
RMI approaches (TAOIIOP, oneway) and examining ways to optimize the use case
Post-processing Style
0
20
40
60
80
100
120
140
160
100 1000 10000 100000
Number of Particles Added per Time Step
Tim
e (se
c)
Single Application DPHPC Simple Protocol RMI - Non-blocking
Distributed Components 18 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Summary
• Implemented the distributed proxy components and the TaoIIOP Babel RMI protocol for connecting distributed CCA applications into a large-scale systems
• Conducted performance benchmarking on preliminary prototype implementation (version 1.0) to identify key optimizations needed
• Implemented the optimizations to minimize the overhead (version 2.0)
• Developed a preliminary example application for remote high performance parallel application with local clusters for data analysis and/or visualization– Work performed by summer intern Fang Liu from Indiana
University
Top Related