Challenges using Roadrunner

10
Slide 1 Challenges using Roadrunner Ken Koch Los Alamos National Laboratory Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Transcript of Challenges using Roadrunner

Page 1: Challenges using Roadrunner

Slide 1

Challenges using Roadrunner

Ken KochLos Alamos National Laboratory

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Page 2: Challenges using Roadrunner

The Roadrunner Petaflop SystemMore at http://www lanl gov/roadrunner

Slide 2

More at http://www.lanl.gov/roadrunner

Two QS22’s(2 Cells each)

Triblade Node with PCIe-connected CellsSPESPU

PowerXCell 8i: an improved Cell processor

LS21 with

Expansionblade

( )

PowerPC

8 optimizedvector cores

PPCCPU

2 Opterons

Design objective: One Cell processor for every Opteron core, plus the same memory footprint for each (4GB each),

with the fastest feasible interconnects

Connected Unit cluster 12 240 PowerXCell 8i chips ⇒ 1 33 PF 49 TBCell

to PCIeDDR2

memory

Vastly improved double precision performanceLarger DDR2-based memory

Connected Unit cluster180 compute nodes w/ Cells

+ 12 I/O nodes

12,240 PowerXCell 8i chips ⇒ 1.33 PF, 49 TB6,120 dual-core Opterons ⇒ 44 TF, 49 TB

17 CUs3264 nodes PPE

SPE (8)

Posix ThreadsDMAIBM ALFLANL CML

PowerPC compiler

SPE compiler SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)

ringbus

wor

k to

geth

er

SPEProgram

12 links per CU to each of 8 switches

Eight 2nd-stage 288-port IB 4X DDR switches

288-port IB 4X DDR Switch288-port IB 4X DDR Switch

Opteron

IBM DaCSIBM ALFLANL CML

x86 compiler

p

PCIe

ee p

rogr

ams

wx86

Program

PPEProgram

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Eight 2 stage 288 port IB 4X DDR switches

MPI (cluster) IB(one per node)Th

re

Page 3: Challenges using Roadrunner

Slide 3

IBM created the PowerXCell 8i, a improved variant of the PlayStation 3 Cell processorthe PlayStation 3 Cell processor.

• Cell Broadband Engine (CBE*) developed by Sony-Toshiba-IBM

S S

SPESPU

• used in Sony PlayStation 3• 8 Synergistic Processing Elements (SPEs)

• 128-bit vector cores• 256 kB local memory

( S S )(LS = Local Store)• Direct Memory Access (DMA)

engine (25.6 GB/s each)• Chip interconnect (EIB)• Run SPE-code as POSIX threads

PowerPC

toRun SPE code as POSIX threads(SPMD, MPMD, streaming)

• 1 PowerPC PPE runs Linux OS

to PCIememory

Design: SPEs provide optimalDesign: SPEs provide optimal flop/s per watt in minimal area

This is an Exascale trend

8 optimizedvector cores

PPCCPU

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

* trademark of Sony Computer Entertainment, Inc.

Page 4: Challenges using Roadrunner

Slide 4

Three types of processors are programmed to work togethertogether.

SPE (8)SPECell

SPE (8)SPE (8)• Parallel computing on Cell• data partitioning & work queue pipelining• process management & synchronization

SPE (8)

Posix ThreadsDMAIBM ALF

SPE compiler SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)SPE (8)

ringbus

• Remote communication to/from CellPPE

IBM ALFLANL CML

PowerPC compiler

bus

• data communication & synchronization• process management & synchronization• computationally-intense offload

IBM DaCSIBM ALFLANL CML

86

PCIe

• MPI remains as the foundation

Opteron

MPI (cluster)

x86 compiler

IB(oneper

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

pernode)

Page 5: Challenges using Roadrunner

Several Challenges seen on Roadrunner

Slide 5

Several Challenges seen on Roadrunner

• Exposing on-node parallelism for Cell’s 8 SPUs• Threads are used for 8 SPUs

• Tiling work to fit and stream in and out of the 256KB local memoriesmemories• Breaking up the data into chunks• Using DMA engine to overlap read-ahead/compute/write-

behind

• Breaking application into three collaborating parts (O t PPC SPU )(Opteron, PPC, SPUs)• Most MPI programs are SPMD and bulk synchronous• Relay for MPI messages (Cell Opteron IB and back up)

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Page 6: Challenges using Roadrunner

Slide 6

How do you keep the 256KB SPUs busy?How do you keep the 256KB SPUs busy?

Break the work into astream of pieces

SPE

SPE

SPE

SPE

SPE

grid tilesti l

problemdomain

SPE

SPE

SPEor particlebundles(can includeghost zones)

domainof a Cell

processor SPE

data chunks stream in & outof 8 SPEs using asynch DMAs

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

g yand triple-buffering

Page 7: Challenges using Roadrunner

Slide 7

Put it all together: MPI+DaCS+DMA+SIMDPut it all together: MPI+DaCS+DMA+SIMDpipelinedwork units

HostCPU

CellPPE

SPESPESPESPESPESPESPESPE

upload

downloadDaCS

MPI • DMAs are simply block memory transfers

• HW asynchronous (no SPE stalls)

• DDR2 memory latency and

DMA Get (first prefetch)Switch work buffersDMA Get (first prefetch)Switch work buffers

“relay” of DaCS ⇔ MPI messages

MPI

DDR2 memory latency and BW performanceDMA Get (prefetch)

DMA Wait (complet current)ComputeDMA Put (store behind)

DMA Get (prefetch)DMA Wait (complet current)ComputeDMA Put (store behind)

Compute & memoryDMA transfers areoverlapped in HW!

DMA Get:mfc_get( LS_addr, Mem_addr, size, tag, 0, 0);

DMA Put:( )DMA Wait (previous put)Switch work buffers

DMA Wait (put)

( )DMA Wait (previous put)Switch work buffers

DMA Wait (put)

mfc_put( Mem_addr, LS_addr, size, tag, 0, 0);

DMA Wait:mfc_write_tag_mask(1<<tag);mfc_read_tag_status_all();

MPI & DaCS can alsobe fully asynchronous

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Page 8: Challenges using Roadrunner

Other Challenges

Slide 8

Other Challenges

• Lack of programming tools beyond compilers• Like the earliest days of parallel computing

• Busy code developers worried about portability and longevity of an exotic platformlongevity of an exotic platform• Social acceptance• “Too busy. Too difficult. Not mainstream.”

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Page 9: Challenges using Roadrunner

Roadrunner Open Science

Slide 9

Roadrunner Open Science

• Ten Science projects targeted Roadrunner

• February through September 2009• Significant system and Panasas integration and stabilization

work was ongoing during this same timework was ongoing during this same time

• 7 of 10 Projects made extensive Science runs• 2 had good code running but ran out of time2 had good code running but ran out of time• 1 ran into technical and staffing difficulties• A couple are running again with

the machine in Classifiedthe machine in Classifiedmode

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Page 10: Challenges using Roadrunner

The ten Roadrunner Open Science projectsSlide 10

Science (code) Description Status

Laser Plasma Instabilities (VPIC)

Study the nonlinear physics of laser backscatter energy transfer and plasma instabilities related to the National Ignition Facility (NIF). Completed

Magnetic Reconnection(VPIC)

Study the continuous breaking and rearrangement of magnetic field lines in plasmas relevant to both space and laboratory applications. Completed(VPIC) plasmas relevant to both space and laboratory applications.

Thermonuclear Burn Kinetics (VPIC)

Study how the TN burn process impacts the velocity distributions of the reacting particle populations and the impact that has on sustaining the burn. (ASC effort)

Code completeOpen science incomplete

Spall and Ejecta( )

Study how materials break up internally, Spall, and how pieces fly off, Ejecta, as shock waves force the material to break apart at the atomic scale (ASC Mostly completed(SPaSM) as shock waves force the material to break apart at the atomic scale. (ASC Weapons Science effort)

Mostly completed

HIV Phylogenetics(ML)

Determine “best” evolutionary relationship trees from a large set of actual genetic HIV genetic sequences (phylogenetic tree) for HIV vaccine targeting. Completed

Properties of Metallic N i (P R )

Apply the parallel-replica approach at the atomistic scale for simulating t i l ti f i i l f it h i f t d i CompletedNanowires (ParRep) material properties of nanowires crucial for switches in future nanodevices. Completed

DNS of Reacting Turbulence (CFDNS)

Study thermonuclear burning in turbulent conditions in Type Ia supernovae using Direct Numerical Simulations (DNS) with full rad-hydro. Completed

The Roadrunner Create a repository of particle simulations of the distribution of matter in the universe to look at galaxy-scale concentrations and structures (dark matter Mostly completedUniverse (RRU) universe to look at galaxy-scale concentrations and structures (dark matter halos).

Mostly completed

SupernovaeLight-Curves (Cassio)

Study the impact of 2D asymmetries on the radiative light output in core collapse supernovae. Coupled RAGE on Opteron-only with Jayenne-Milagro IMC (accelerated).

Code completeOpen science incomplete

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Cellulosomes (Gromacs) Study the effectiveness of the decomposition of cellulosic sheets of plant fiber by cellusome bacteria related to biofuels (cellulosic alcohol) production

Code work stopped due to performance issues & manpower