Parallel Programming Stuff Jud Leonard February 28, 2008.

25
Parallel Programming & Stuff Jud Leonard February 28, 2008

description

3 Outline Parallel problems –Simulation Models –Imaging –Monte Carlo methods –Embarrassing Parallelism Software issues due to parallelism –Communication –Synchronization –Simultaneity –Debugging

Transcript of Parallel Programming Stuff Jud Leonard February 28, 2008.

Page 1: Parallel Programming  Stuff Jud Leonard February 28, 2008.

Parallel Programming & StuffJud Leonard

February 28, 2008

Page 2: Parallel Programming  Stuff Jud Leonard February 28, 2008.

2

SiCortex Systems

Page 3: Parallel Programming  Stuff Jud Leonard February 28, 2008.

3

Outline

• Parallel problems– Simulation Models– Imaging– Monte Carlo methods– Embarrassing Parallelism

• Software issues due to parallelism– Communication– Synchronization– Simultaneity– Debugging

Page 4: Parallel Programming  Stuff Jud Leonard February 28, 2008.

4

Limits to Scaling

• Amdahl’s Law: serial eventually dominates– Seldom the limitation in practice– Gustafson: Big problems have lots of parallelism

• Often in practice, communication dominates– Each node treats a smaller volume– Each node must communicate with more partners– More, smaller messages in the fabric

• Improved communication enables scaling

• Communication is key to higher performance

Page 5: Parallel Programming  Stuff Jud Leonard February 28, 2008.

5

Physical System Simulations

• Spatial partition of problem– Works best if compute load evenly distributed

• Weather, Climate• Fluid dynamics

– Complex boundary management after load balancing• Partition criteria must balance:

– Communication– Compute– Storage

Page 6: Parallel Programming  Stuff Jud Leonard February 28, 2008.

6

Example: 3D Convolution• Operate on N3 array with M3 processors• Result is a weighted sum of neighbor points

• Single-processor– no communication cost– Compute time ≈ N3

• 3D partition– Communication ≈ (N/M)2

– Compute Time ≈ (N/M)3

Page 7: Parallel Programming  Stuff Jud Leonard February 28, 2008.

7

Scalability of 3D Convolution

Effect of Cost Ratio on Scaling Efficiency

Scaling Efficiency

1%

10%

100%

1 10 100 1000 10000 100000 1000000 10000000 1E+08 1E+09 1E+10Number of Processors

Elap

sed

Tim

e

10x scaling 100x scaling

Page 8: Parallel Programming  Stuff Jud Leonard February 28, 2008.

8

Example: Logic Simulation

• Modern chips contain many millions of gates– Enormous inherent parallelism in model

• Product quality depends on test coverage– Economic incentive

• Perfect application for parallel simulation– Why has nobody done it?

• Communication costs• Complexity of partition problem

– Multidimensional non-linear optimization

Page 9: Parallel Programming  Stuff Jud Leonard February 28, 2008.

9

Example: Seismic Imaging

• Similar to Radar, Sonar, MRI…• Record echoes of a distinctive signal

– Correlate across time and space– Estimate remote structure from variation in echo

delay at multiple sensors• Terabytes of data

– Need efficient algorithms– Every sensor affected by the whole structure– How to partition for efficiency?

Page 10: Parallel Programming  Stuff Jud Leonard February 28, 2008.

10

New Issues due to Parallelism

• Communication costs– My memory is more accessible than others

• Planning, sequencing halo exchanges– Bulk transfers most efficient

• but take longer– Subroutine syntax vs Language intrinsic– Coherence and synchronization explicitly managed– Issues of grain size

• Synchronization– Coordination of “loose” parallelism

• Identification of necessary sync points

Page 11: Parallel Programming  Stuff Jud Leonard February 28, 2008.

11

Mind Games

• Simultaneity– Contrary to habitual sequential mindset– Access to variables is not well-ordered between

parallel threads– Order is not repeatable

• Debugging– Printf?– Breakpoints?– Timestamps?

Page 12: Parallel Programming  Stuff Jud Leonard February 28, 2008.

12

Interesting Problems - Parallelism

• Event-driven simulation• Load balancing• Debugging

– Correctness• Dependency• Synchronization

– Performance• Critical paths

Page 13: Parallel Programming  Stuff Jud Leonard February 28, 2008.

13

The Kautz Digraph

• Log diameter (base 3, in our case)– Reach any of 972 nodes in 6 or fewer steps

• Multiple disjoint paths– Fault tolerance– Congestion avoidance

• Large bisection width– No choke points as network grows

• Natural tree structure– Parallel broadcast & multicast– Parallel barriers & collectives

Page 14: Parallel Programming  Stuff Jud Leonard February 28, 2008.

14

Alphabetic Construction

• Node names are strings of length k (diameter)– Alphabet of d+1 letters (d = degree)– No letter repeats in adjacent positions– ABAC: allowed– ABAA: not allowed

• Network order = (d+1)dk-1

– d+1 choices for first letter– d choices for (k-1) letters

• Connections correspond to shifts– ABAC, CBAC, DBAC -> BACA, BACB, BACD

Page 15: Parallel Programming  Stuff Jud Leonard February 28, 2008.

15

Noteworthy• Most paths simply shift in destination ID

– ABCD -> BCDB -> CDBA -> DBAD -> BADC• Unless tail overlaps head

– ABCD -> BCDA -> CDAB• A few nodes have bidirectionally-connected

neighbors– ABAB <-> BABA

• A “necklace” consists of nodes whose names are merely rotations of each other– ABCD -> BCDA -> CDAB -> DABC -> ABCD again

Page 16: Parallel Programming  Stuff Jud Leonard February 28, 2008.

16

Whatsa Kautz Graph?

3

2

0

1

Diam Order1 42 123 364 1085 3246 972

Page 17: Parallel Programming  Stuff Jud Leonard February 28, 2008.

17

Kautz Graph Topology

11

10

9

8 7 6

0 1 2

3

4

5

Diam Order1 42 123 364 1085 3246 972

Page 18: Parallel Programming  Stuff Jud Leonard February 28, 2008.

18

Whatsa Kautz Graph?

35

34

33

32

31

30

29

28

27

26 25 24 23 22 21 20 19 18

0 1 2 3 4 5 6 7 8

9

10

11

12

13

14

15

16

17

Diam Order1 42 123 364 1085 3246 972

Page 19: Parallel Programming  Stuff Jud Leonard February 28, 2008.

19

Interconnect Fabric

• Logarithmic diameter– Low latency– Low contention– Low switch degree

• Multiple paths – Fault tolerant to link,

node, or module failures– Congestion avoidance

• Cost-effective– Scalable– Modular

L2 Cache PCIe

Fabric Switch

DMA

Memory Control

CPU

CacheCPU

Cache

CPU

CacheCPU

Cache

CPU

CacheCPU

Cache

DDR DIMMDDR DIMM

Page 20: Parallel Programming  Stuff Jud Leonard February 28, 2008.

20

DMA Engine API• Per-process structures:

– Command and Event queues in user space– Buffer Descriptor table (writable by kernel only)– Route Descriptor table (writable by kernel only)– Heap (User readable/writable)– Counters (control conditional execution)

• Simple command set:– Send Event: immediate data for remote event queue– Put Im Heap: immediate data for remote heap– Send Command: nested command for remote exec– Put Buffer to Buffer: RDMA transfer– Do Command: conditionally execute command string

Page 21: Parallel Programming  Stuff Jud Leonard February 28, 2008.

21

Interesting Problems - SiCortex• Collectives optimized for Kautz digraph

– Optimization for a subset– Primitive operations

• Partitions– Best subsets to choose – Best communication pattern within a subset

• Topology mapping– N-dimensional mesh– Tree– Systolic array

• Global shared memory

Page 22: Parallel Programming  Stuff Jud Leonard February 28, 2008.

22

Brains and Beauty, too!

Page 23: Parallel Programming  Stuff Jud Leonard February 28, 2008.

23

ICE9 Die Layout

Page 24: Parallel Programming  Stuff Jud Leonard February 28, 2008.

24

27-node ModulePCIe Express Module Options

ICE9 Node Chip

DDR2 DIMM

Dual Gigabit Ethernet

Fibre Channel10 Gb Ethernet

InfiniBand

Power regulatorBackpanel Connector

Module Service Processor

MSP Ethernet

Page 25: Parallel Programming  Stuff Jud Leonard February 28, 2008.

25

What’s new or unique? What’s not?• Designed for HPC• It’s not x86

– Performance = low power• Communication

– Kautz digraph topology– Messaging: 1st class op– Mesochronous cluster

• Open source everything• Performance counters• Reliable by design

– ECC everywhere– Thousands of monitors

• Factors of 3• Lighted gull wing doors!

• Linux (Gentoo)• Little-endian• MIPS-64 ISA• Pathscale compiler• GNU toolchain• IEEE Floating Point• MPI• PCI Express I/O