Carnegie Mellon High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical &...

Carnegie Mellon

High Performance Computing on the Cell Broadband Engine

Vas ChellappaElectrical & Computer EngineeringCarnegie Mellon University

Dec 3 2008

Carnegie Mellon

2

Designing “faster” processors

Need for speed

Parallelism: forms Superscalar Pipelining Vector Multi-core Multi-node

Carnegie Mellon

3

Designing “faster” processors

Need for speed

Parallelism: forms (limitations) Superscalar (power density) Pipelining (latch overhead: frequency scaling, branching) Vector (programming, only numeric) Multi-core (memory wall, programming) Multi-node (interconnects, reliability)

Carnegie Mellon

4

Multi-core Parallelism

Future is definitely multi-core parallelism But what problems/limitations do multi-cores have?

Increased programming burden Scaling issues: power, interconnects etc.

Carnegie Mellon

5

The Cell BE Approach

Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles

programmer driven DMA in background

Cell BE ChipCell BE Chip

Main MemMain Mem

EIB

SPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LS

PPE

Carnegie Mellon

6

Presentation Overview

Cell Broadband Engine: Design

Programming on the Cell

Exercise: implement addition of vectors

Wrap-up

Carnegie Mellon

7

Cell Broadband Engine

Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner)

Compute: Heterogeneous multi-core (1 PPE + 8 SPEs) 204 Gflop/s (only SPEs) High-speed on-chip interconnect

Memory system: Explicit scratchpad-type “local store” DMA based programming

Challenges: Parallelization, vectorization, explicit memory New design: new programming paradigm

Main Mem

EIBSPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LS

Carnegie Mellon

8

Cell BE Processor: A Closer Look

Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS)

Cell BE ChipCell BE Chip

Main MemMain Mem

EIB

SPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LS

PPE

Carnegie Mellon

9

Power Processing Element (PPE)

Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions

Virtualization, address translation/protection, exception handling

Carnegie Mellon

10

Synergistic Processing Element (SPE)

SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)

Carnegie Mellon

11

Synergistic Processing Unit (SPU)

Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE)

25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel

128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic

Carnegie Mellon

12

Local Stores (LS) and Memory Flow Cont. (MFC)

Local Stores Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU)

Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals

Carnegie Mellon

13

Element Interconnect Bus (EIB)

4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak

Carnegie Mellon

14

Direct Memory Access (DMA)

Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance DMA lists Get, put: SPE-centric

view Mailboxes/signals are

also DMAs

Carnegie Mellon

15

Systems using the Cell

Sony PlayStation 3 6 available SPEs

7th: hypervisor 8th: defective (yield issues)

Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects

IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet

Carnegie Mellon

16

IBM Roadrunner

Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal

Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack

Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells)

Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband)

Carnegie Mellon

17





Wrap-up

Carnegie Mellon

18

Programming on the Cell: Philosophy Major differences to traditional processors

Not designed for scalar performance Explicit memory access Heterogeneous multi-core

Using the SPEs SPMD model (Single Program Multiple Data) Streaming model

Carnegie Mellon

19

Programming Tips What kind of code good/bad for SPEs?

No branching (no prediction) Use branch hinting No scalar (no support)

Use intrinsics for vectorization, DMA Context switches are expensive

Program + data reside in LS. These have to be swapped in/out DMA code: alignment, alignment, alignment! Libraries available to emulate software-managed cache

Carnegie Mellon

20

DMA Programming

Main idea: hide memory accesses with multibuffering Compute on one buffer in LS Write back / read in other batches of data Like a completely controlled cache

Inter-chip communication Message boxes Signals DMA

Carnegie Mellon

21

Tools for Cell Programming

IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library

Other tools: Assembly visualizer

Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)

Carnegie Mellon

22

Program Design

Use knowledge of architecture to model Back of the envelope calculations

Cost of processing? Cost of communication? Trends? Limits?

How close is the model? What programming improvements can be made to fit the

architecture better?

Carnegie Mellon

23





Wrap-up

Carnegie Mellon

24

Creating PPE Program, SPE Threads

Each program consists of PPE and SPE sections Program is started up on PPE PPE creates SPE threads

pthreads implementation Not full

PPE data structure to keep track of SPE threads PPE/SPE shared data structure for argument passing

X, Y, Z addresses Thread id Returned cycle count

Carnegie Mellon

25

DMA Access

spu_writech(MFC_WrTagMask, -1);

spu_mfcdma64(source_address,

dest_high_address, dest_low_address,

size_in_byes,

tag_id, MFC_GET_CMD);

spu_mfcstat(MFC_TAG_UPDATE_ALL);

Use my DMA_BL_GET, DMA_BL_PUT macros

Carnegie Mellon

26

Compiling

Compile ppe, spe programs separately Details: specify SPE program name, call from PPE 32/64 bit (watch out for pointer sizes etc.) Cell SDK has sample Makefiles We will use a simple Makefile

Carnegie Mellon

27

Performance Evaluation: Timing

Performance measure: runtime, Gflop/s

Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the

real-word scenario the best?

Carnegie Mellon

28

Exercise 1: Add/Mul Two Arrays

Goal: X[] += Y[] * Z[]

Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs

Carnegie Mellon

29

Part 1 Goal:

Understand skeleton code Get infrastructure up and running (compiler, basic code)

Evaluate: scalar, sequential code performance

PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance

Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU

Your tasks: Compile Transform code Timer code

Carnegie Mellon

30

Part 2 Goal

Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy)

Evaluate: Parallel code performance Vectorized parallel code performance

PPU: Start up 4 SPU threads Performance evaluation: how?

SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization

Your tasks: Parallelize Vectorize Performance?

(vector float)d = spu_madd(a,b,c);

Carnegie Mellon

31

Part 3

Goal: hide memory accesses How?

Carnegie Mellon

32





Wrap-up

Carnegie Mellon

33

Exercise Debriefing

How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference

Do our optimizations work for a large size range? Smaller sizes: lower packet sizes?

Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?

Carnegie Mellon

34

WHT on the Cell

Vectorization: as before Parallelization: locality-aware! Explicit memory access

Provide code Multibuffering? How?

Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier

Carnegie Mellon

35

WHT: Data Exchange

Carnegie Mellon

36

WHT: Data Exchange

Carnegie Mellon

37

WHT: Data Exchange

Carnegie Mellon

38

WHT: Data Exchange

Carnegie Mellon

39

DMA Issues

External multibuffering (streaming) Strategies for problem sizes

Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts

Using all memory banks

Carnegie Mellon

40

Cell Philosophy

Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing

Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets

Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism

Carnegie Mellon

41

Wrap-Up

Programming Cell BE for high-performance computing

Cell: chip multiprocessor designed for HPCApplications from video gaming to supercomputers

Programming burden is factor for performanceParallelization, vectorization, memory handling

Automated tools yield limited performance Programmers must understand μ-arch., tradeoffs

For performance (esp. on Cell)

Carnegie Mellon High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical &...

Documents

Transcript of Carnegie Mellon High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical &...