jefferson-coop

35
 This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48. Cooperative Parallelism: An evolutionary programming model for exploiting massively parallel systems David Jefferson, John May, Nathan Barton, Rich Becker, Jarek Knap Gary Kumfert, James Leek, John Tannahill Lawrence Livermore National Laboratory 

Transcript of jefferson-coop

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 1/35

 

This work was performed under the auspices of the U.S. Department of Energy

by University of California Lawrence Livermore National Laboratoryunder contract No. W-7405-Eng-48.

Cooperative Parallelism:

An evolutionary programming model 

for exploiting massively parallel systems

David Jefferson, John May,

Nathan Barton, Rich Becker, Jarek Knap

Gary Kumfert, James Leek, John TannahillLawrence Livermore National Laboratory 

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 2/35

 

Blue Gene / L

65,536 x 2 processors, 360 Tflops (peak)

Petaflop (peak) machine in 2 years

Petaflop (sustained) in 5 years

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 3/35

 

Co-op is a new programming paradigm and

components model for petascale simulation 

• Petascale performance driven by need for multiphysics, multiscalemodels

– fluid -- molecule

– continuum metal -- crystal

– plasma -- charged particle

– classical -- quantum

• Multiphysics, multiscale models call for a simulation componentsarchitecture

– whole, parallel simulation codes used as building blocks in larger simulations

– allows composition (federation) and reuse of codes already mature and trusted

• Multiphysics, multiscale models naturally exhibit MPMD parallelism– different subsystems, or length and time scales, require multiphysics

– multiphysics most efficient with different codes in parallel

• Efficient use of petascale resources requires more dynamic simulation algorithms

– much more flexible use of resources: dynamic (sub)allocation of processor nodes

– adaptive sampling family of multiscale algorithms

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 4/35

 

Co-op allows parallel simulations to be

used as components in larger computations

• Large parallel modelstreated as single objects:– coupled with little knowledge

of each others’ internals

• Coupled models:– different languages

– different paralleldecomposition

– different physics

• Components:– dynamically launched

– internally parallel

– externally parallel

–communicate in parallel

time

spacestate space

scale

ensemble coupling for parametric

sensitivity or optimization

{ }

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 5/35

 

Strain rate localization can be predicted

with multiscale expanding cylinder model

1/8 exploding cylinder 

• expands radially

• rings with reflecting

strain rate waves

• develops diagonal

shear bands

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 6/35

Classic SPMD

embedding of fine-

scale calculations

• nodes statically

allocated and

scheduled

• fine scale models

executed sequentiallytime for onemajor cycle

64nodes

fine scale physics

coarse scale model

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 7/35

 

Adaptive Sampling: a class of dynamic

algorithms for multiscale simulation

• Apply fine scale model where continuum model is

invalid…

• …but just a sample of the

elements

• Elsewhere, interpolate material

response function from results

previously calculated

• Much less fine scale work;

remaining computation may be

seriously unbalanced, however.

• More than an order of magnitude

of performance improvement 

may be achieved.

• Adaptive sampling is not AMR! coarse model is

generally accurate

coarse model assumptions

break down

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 8/35

 

Co-op model adds layer of dynamic MPMD

parallelism to familiar SPMD paradigm

MPMD federation

SPMD symponent

Processcomposed of threads

that use shared variables, locks, etc .

composed of processes

that use MPI 

composed of symponents

that use remote method 

invocation (RMI)

Thread

Sequential, with vector, pipeline,or multi-issue parallelism

Familiar 

parallelism

layers

New parallelism

layer 

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 9/35

 

Adaptive sampling app with integrated

fine scale DB

ALE3D CouplerLib

FSDB

CSM

n = 100 processes

z/p = 104zones/process

z = 106zones

T = 104timesteps

? = 100 µσεχ/τιµεστεπ

? = 1 0−2

(εϖαλ φραχτιον)

ΦΣΜΜαστερ

ΦΣΜ Σερϖερσ

ΦΣΜ

Μαστερ

ΦΣΜ Σερϖερσ

Continuum

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 10/35

 

Co-op Architecture

• NodeSet allocate / deallocate

– Contiguous node sets only

– Suballocation from original allocation

– Algorithms somewhat like memory allocation

• Symponent launch

– Array of symponents can be launched on array of nodesets

by single call

• Component termination detection– Parent symponent notified if child terminates

• Component kill

– Must work when target is deadlocked, looping, etc.

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 11/35

 

Remote Method Invocation (RMI)

• General semantics– Operation done by a thread  on a symponent 

– It can be nonblocking : caller gets a ticket and can later check, or wait for,completion of the RMI

– Exceptions supported

– Concurrent RMIs on same symponent executed in nondeterministic order 

• Three kinds of RMI recognized– Sequential body, threaded execution

• Inter-thread synchronization required

• MPI in body not permitted

• Thread concurrency limited by OS

– Parallel body, serialized execution• Atomic

• No recursion; no circularity (results in deadlock)

• MPI permitted and needed in body

– One way • “Call” does not involve a return

• Essentially an asynchronous, one-sided “active” message

– Others might be recognized in the future

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 12/35

 

More about RMI

• Inter-symponent synchronization– RMIs queued, and executed only when callee executes

AtConsistentState() method

– Last RMI signaled by special RMI: continue()

• Intra-symponent synchronization– Sequential body, threaded RMIs must use proper POSIX inter-thread

synchronization

• Implementation– Babel RMI over TCP

– Persistent connections at the moment (except for one-way)• Soon to be non-persistent

– Future implementations over • MPI-2

• UDP

• Native packet transports

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 13/35

 

Babel and Co-op are intimately related

• Symponents are Babel objects

• Co-op RMI implemented over Babel RMI

• Symponent APIs expressed in Babel’s SIDL language

• Any thread with a reference to a symponent can call RMIs on it

• References can be passed as args, results

• Caller and callee can be in different languages

• Co-op rests totally on Babel for 

– RMI syntax

– SIDL specification language

– Language interoperability

– Parts of implementation of RMI

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 14/35

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 15/35

time

64

nodes

fine scale physics

coarse scale model

MPMD refactoring and parallelized fine

scale models

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 16/35

64

nodes

time

full fine scalesimulations

coarse scale model

Adaptive Sampling

• evaluation fraction is the most critical performance parameter 

interpolated fine scale

behavior 

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 17/35

 

QuickTimeª and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Adaptive sampling + active load balancing

yields dramatic speedup

adaptive sample fine

scale simulations

coarse scale model

database retrieval and

interpolation

time

nodes

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 18/35

 

Performance of adaptive sampling

using the Co-op programming model

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0.0 5.0 10.0 15.0 20.0 25.0

Sim Time (µse

 

adaptive sampling

adaptive sampling with load balancing

classic model embedding

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 19/35

 

Conclusions

• MP/MS simulation drives need for petascale

performance

• MP/MS simulation requires

– componentized model construction– MPMD execution

– dynamic instantiation of components

• hence dynamic node allocation

–language interoperability

• Adaptive Sampling is amazigly powerful

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 20/35

 

End

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 21/35

 

PSI Project Overview

David Jefferson

Lawrence Livermore National Lab

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 22/35

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 23/35

Distribution of Coarse-scale and

Fine-scale Models across Processors

Coarse-scalemodel

Wallclock

time

Onecoars

escale

timestep

Many instances of fine-scale

model

. . .

. . .

MPMD f t i ll b tt

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 24/35

time

64nodes

• remote fine scale

models

• nodes dynamically 

allocated and

scheduled

• improved performance

due to better balance

fine scale physics

coarse scale model

MPMD refactoring allows better 

scheduling of fine scale model executions

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 25/35

time

125

nodes

Additional parallelism then becomes

available

• fine scale model executions independent

• “nearest neighbor” DB queries are mostly independent and

easily parallelizable as well

adaptive sample fine

scale simulations

coarse scale model

database retrieval and

interpolation

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 26/35

Multiscale material science application

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 27/35

 

Multiscale material science application

with parallel FS database

ALE3D Coupler

Lib

CSM

n = 100 processes

z/p = 104zones/process

z = 106zones

T = 104timesteps

? = 100 µσεχ/τιµεστεπ

? = 10−2

(εϖαλ φραχτιον)

ΦΣΜΜαστερ

ΦΣΜ Σερϖερσ

ΦΣΜΜαστερ

ΦΣΜ Σερϖερσ

∆ΒΜαστερ

∆Β Σερϖερσ

∆ΒΜαστερ

∆Β Σερϖερσ

∆ΒΧλονε 1

∆ΒΧλονε κ 

θυερψ()µαξ = ζ / ?ινσερτ()µαξ = ζ / ?

µεαν = ?ζ / ?

ρυνΦΣΜ()µαξ = ζ / ?

µεαν = ?ζ / ?

The PSI Project

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 28/35

 

The PSI Project

• Development Co-op model of hybrid

componentized MPMD computation.– Definition of computational model and semantic

issues

– Implementation of Co-op runtime system

– Implementation of extensions to Babel• Development of multiscale simulation

technology using Co-op– Theory and practice of adaptive sampling

– Implementation of adaptive sampling coupler withinCo-op framework

– Implementation of Fine Scale Model “database”suitable for adaptive sampling

• M-tree database with nearest neighbor queries

C C biliti

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 29/35

 

Co-op Capabilities

• NodeSet allocate/deallocate– Suballocation of nodeset of any size from job’s static allocation

– Free sets of nodesets, not nodes

• Symponent launch / kill– Any process can launch an SPMD executable as a new symponent with any

number of processes on a nodeset whose size divides n.

– Parent-child hierarchy: parent process notified of child death; child killed if parent dies

– Launch uses SLURM srun– Runaway or wedged symponent can be killed & its nodeset recovered

• Symponent remote references– Symponents can have remote references to one another, which they use for 

making RMI calls

– Remote references can be used as arguments in RMI calls

• Symponents and Babel– Symponents are Babel objects, and present SIDL interfaces

– Symponents inherit interfaces in type hierarchy, so they can be treated inobject-oriented fashion

– A symponent RMI is a Babel RMI• Full type safety

• Language independence / interoperability

Co op Capabilities

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 30/35

 

Co-op Capabilities

• Symponent RMI & synchronization– RMI calls are from a thread to a symponent 

– RMIs are one-sided , unexpected , and by default nonblocking – Any number of in- and out-args of any size and type can be used

– Full exception-throwing capability

– RMI’s can only be executed when callee calls atConsistentState()

– Special “system” RMIs inherited by all symponents: continue() and

kill()– Two kinds of user RMIs

• Sequential body, threaded execution, executes in Rank 0 only– Body executes in rank 0 process only

– Body is sequential, and does not need MPI

– Concurrent RMIs must synchronize with one another as threads• Parallel body, serialized execution, executes in all processes

– Each may be parallel, running on all processes of calleesymponent, but multiple RMI calls are serially executed, and henceatomic

– Normally use MPI 

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 31/35

time

64

nodes

Adaptive Sampling substitutes DB

retrieval and interpolation for full fine

scale evaluation

• subscale results tabulated in a DB

• faster DB queries and interpolations substituted for slower 

fine scale model executions

adaptive sample finescale simulations

coarse scale model

database retrieval and

interpolation

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 32/35

 

Linux

Current implementation of Co-op runs

multiscale models on Linux cluster 

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 33/35

 

Co-oplib

MPI

Linux

Co-opd

launch (SLURM / srun)

Current implementation of Co-op runs

multiscale models on Linux cluster 

Not shown: SLURM

daemons and srun()

processes

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 34/35

 

Linux

Co-opd

Babel

Co-oplib

MPI

launch (SLURM / srun)

RMI (over UDP)

CSM

Current implementation of Co-op runs

multiscale models on Linux cluster 

Not shown: SLURM

daemons and srun()

processes

8/7/2019 jefferson-coop

http://slidepdf.com/reader/full/jefferson-coop 35/35

Babel

Co-oplib

MPI

launch (SLURM / srun)

RMI (over UDP)

FSMs

CSM

Linux

Co-opd

Current implementation of Co-op runs

multiscale models on Linux cluster 

Not shown: SLURM

daemons and srun()

processes