Download - Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson {nanbor,roopa,johan}@txcorp.com.

Distributed Components for Integrating Large-Scale High Performance Computing

Applications

Nanbor Wang, Roopa Pundaleeka and Johan Carlsson{nanbor,roopa,johan}@txcorp.com

Tech-X CorporationBoulder, CO

CCA Meeting, October 11, 2007

Funded by DOE OASCR SBIR Grant #DE-FG02-04ER84099

Distributed Components 2 Nanbor Wang, Roopa Pundaleeka and Johan Carlsson

Outline

• Motivation• Distributed and Parallel High-Performance

Computing (DPHPC)• Exploring Diverse Distributed Technologies

– Distributed Proxy Components– New Transport Mechanism to Babel RMI

• Babel RMI• Babel RMI over CORBA IIOP

– Performance Comparisons• Future Work


Motivations for Distributed and ParallelComponent-Based Software Engineering• Existing component standards and frameworks designed with

enterprise applications in mind– No support for features that are important for HPC scientific

applications: interoperability with scientific programming languages (FORTRAN) and parallel computing infrastructure (MPI)

• Need to address needs of HPC scientific applications: combustion modeling, global climate modeling, fusion plasma simulations

• Motivating scenarios for Distributed and Parallel HPC (DPHPC):– Integrate separately-developed and established codes – Provide a different paradigm for partitioning problems – multi-physics

simulations– Provide ways to better utilize high-CPU number hardware– Combine computing resources of multiple clusters/computing centers– Enable parallel data streaming between computing task and post-

processing task


Distributed Proxy CCA Components

• Connect distributed parallel components by composing remote-capable proxy components into applications

• Hide the distributed aspect from the localized parallel CCA framework

• Provide low-cost mechanisms for connecting incompatible CCA infrastructures, e.g., Ccafeine, Dune, Ccain, and SciRUN

Component BComponent A

Conceptual Component A

Remoting ComponentA

Remoting Component B

Distributed Middleware Bus

Component A Component B


BABEL RMI CLIENTBABEL RMI CLIENT BABEL RMI SERVERBABEL RMI SERVER

Babel RMI Interface

Simple Protocol

New Transport Mechanism for Babel RMI

• Babel generates mapping for remote invocations, and uses Simple Protocol

• Babel has the capability to allow users to take advantage of various remoting technologies through third party RMI libraries

• We are developing a CORBA protocol library for Babel RMI using TAO (version 1.5.1 or later) – TAO is the C++ based CORBA middleware framework– This protocol is essentially a bridge between Babel and TAO

BABEL RMI CLIENTBABEL RMI CLIENT BABEL RMI SERVERBABEL RMI SERVER

Babel RMI Interface

TAOIIOP TAO


Adding CORBA protocol for Babel RMI

Goal– Utilize CORBA wire protocol for Babel RMI for

communication between Babel clients and servants– Allow interoperability between existing CORBA and

Babel objects (e.g., with SciRUN CORBA support)– Maintain performance of CORBA IIOP protocol

Direct mapping approach – Requires support of certain Babel types; complex

numbers, multidimensional arrays and exceptions– Exchange messages in CORBA format– Allow development of new SIDL-compatible CORBA

objects


Client-side Operation Invocations

• CORBA uses Common Data Representation (CDR) – a binary serialization format, for transferring messages. Data packed directly

• No more CORBA – BABEL transformations for data


Server-side Request Handling

• A default TAO servant handles all Babel invocations

• Requests are dispatched to target Babel objects based on the instance/object ID

• Need to extend TAO’s PortableServer class to expose the Input (for reading input parameters) and Output (to sending the results) CDRs– SIDL Call and Response objects get a reference to

the Input and Output CDRs respectively


Server-side Request Handling in TAOIIOP

1. Default TAO object (TaoIIOPObject) extends TAO PortableServer::ServantBase, and implements

the ‘dispatch’ method, which gets the Input and Output CDRs in ServerReq obj.

2. The dispatch method creates the sidl::rmi::Response, which stores the CDR

3. Gets a reference to the target SIDL object from the InstanceRegistry

4. Executes the target method

1. Pack methods are called on the response object

for return, inout and out parameters

2. The results are directly packed into the CDR


Features Implemented in TAOIIOP

• All Babel types except opaque• Exception Handling• One-way method Invocation• Non-Blocking / Asynchronous Method

Invocation


TAOIIOP V2.0 Optimizations

• Initial implementation provides a proof-of-concept but has many extra memory allocations, copying and conversions

• Added support to be able to directly add Babel types to CORBA CDR

• No conversions between Babel types and CORBA types to support discrepancies

• Aggregation of memory allocations


Performance Comparison 1

Throughput for 1D Array of Doubles

0

1

2

3

4

5

6

0 20 40 60 80 100 120

Payload (MB)

Th

rou

gh

pu

t (M

B/s

ec

)

TaoIIOP 2.0 Simple Protocol

CORBA DistComp TaoIIOP 1.0



Throughput for 2D Array of Doubles

0

1

2

3

4

5

6

0 20 40 60 80 100 120

Payload (MB)

Th

rou

gh

pu

t (M

B/s

ec

)





Throughput for 1D Array of Floating Point Complex Number

0

1

2

3

4

5

6

0 20 40 60 80 100 120

Payload (MB)

Th

rou

gh

pu

t (M

B/s

ec

)





Throughput for 2D Array of Floating Point Complex Numbers

0

1

2

3

4

5

6

0 20 40 60 80 100 120

Payload (MB)

Th

rou

gh

pu

t (M

B/s

ec

)




Performance Analysis

• TaoIIOP V1.0 takes a performance hit consistently– Performing extra conversions for arrays and complex number

types between CORBA and Babel– Multiple, fine-grained memory allocations– Not taking advantage of TAO’s key optimization mechanisms

• Distributed proxy components suffers a bit again because data marshalling

• TaoIIOP V2.0 has a performance gain of 10% for double and 30% for complex numbers, compared to TaoIIOP 1.0– Optimizations: Made CORBA-Babel mapping types native in

TAO by implementing optimized, zero-copy version of marshaling and demarshaling support


Application of DPHPC

• We have developed an example application to demonstrate the use of DPHPC– Separating post-simulation

data processing (after each time step

– Based on Vorpal, a C++ plasma and beam simulation code

– Implemented by Fang (Cherry) Liu (Indiana Univ) during summer internship

• Visible speed up using DPHPC– The actually trend of

speedups are counter-intuitive– We are exploring different

RMI approaches (TAOIIOP, oneway) and examining ways to optimize the use case

Post-processing Style

0

20

40

60

80

100

120

140

160

100 1000 10000 100000

Number of Particles Added per Time Step

Tim

e (se

c)

Single Application DPHPC Simple Protocol RMI - Non-blocking


Summary

• Implemented the distributed proxy components and the TaoIIOP Babel RMI protocol for connecting distributed CCA applications into a large-scale systems

• Conducted performance benchmarking on preliminary prototype implementation (version 1.0) to identify key optimizations needed

• Implemented the optimizations to minimize the overhead (version 2.0)

• Developed a preliminary example application for remote high performance parallel application with local clusters for data analysis and/or visualization– Work performed by summer intern Fang Liu from Indiana

University