Pictorial Sample of Parallel Applications. Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner,...

Pictorial Sample of Parallel Applications

Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux2

Major Classes of Parallel Computing Architectures

• MPP: classic Massively Parallel Processor with distributed memory (RISC-based processors)

_ Each processor has own separate memory _ CRAY T3D/E, IBM SP2 original series_ Can be Ambiguous - can use shared address space model “virtual shared

memory”

• SMP: Symmetric MultiProcessor, occasionally used to denote “Shared-Memory-Parallel”

_ Processors share a single global area of RAM_ Often crossbar-switched interconnect (Sun HPC 1000)_ symmetry all processors have equal access to memory and other parts of

the system (e.g., I/O)


Major Classes of Parallel Computing Architectures (cont)

• PVP: Parallel Vector Processor_ Cray J90/C90/T90 Series_ NEC SX series, Fujitsu VX series

• ccNUMA: “cache-coherent Non-Uniform Memory Access”_ Shared physical address space with automatic hardware replication_ Sometimes called DSM or Distributed Shared Memory_ SGI Origin 2000

• Clusters _ SMP Clusters and PVP Clusters_ Beowulfs, Appleseeds, and NOWs

• Hybrids (includes the newer SP’s)For the details, consult:

_ Hennessy and Patterson, Computer Architecture, Morgan Kaufmann Publishers, 1996_ Culler and Singh, Parallel Computer Architecture, Morgan Kaufmann Publishers, 1999

‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06

100+ Tflop / 30 TB

Time (CY)

Cap

abili

ty

1+ Tflop / 0.5 TB

PlanDevelop

Use

30+ Tflop / 10 TB

Intel Tflop Demo 12/96

IBM SST Demo 9/98SGI/Cray Demo 11/98

Red

3+ Tflop / 1.5 TBBlue

10+ Tflop / 4 TBWhite

50+ Tflop / 25 TB Mid Life Kicker

ASCI Computing Systems


Simplest Diagrams of SMP, MPP

MPP Massively Parallel ProcessorSMP Symmetric Multi-Processor

cpu cpu cpu

Fast Interconnect

Memory (shared)

cpu

Mem

cpu

Mem

cpu

Mem

•2-128 processors circa late 90’s 40 - 4000 circa 90’s•Memory shared Memory physically distributed•High-powered Processors and High-Powered Micros (Alpha, PowerPC)Micros (some vector, mostly micro)•SMP means equal access including I/O•Sometimes term is generalized to meanShared Memory Parallel

Interconnect (varies)


Simplistic View of MPP Node

omega IBM SP2, Meiko3D Torus Cray T3E

cpu

memory

Switch / Network

Caches

RS6000 - IBMDec Alpha - CraySparc - TMC, Meiko

~16 - 512 MBytesRam per processorNewer machines ~ GBs

Processor

CommunicationMechanism

Network

Memory

Shared Memory T3D/EMessage Passing SP2


Memory Hierarchy Trapezoid:

Registers

L1 Cache

L2 Cache

Main Memory

Disk

DecreasingCost

DecreasingAccessTime

• Caches are used (except Tera) because the access time from top to bottom is very different


Schematic of a T3E node (Compaq/Dec EV5)

• Chips are becoming increasingly complicated

• In the first chips used on the T3D, there was only 1 level of cache, so performance was greatly affected by problem size w.r.t. cache

• Stream Buffers improve data flow into cache

• Most MPP’s today have specialized RISC chips

E’s: Int func units

F’s: Floating point functional units

WB: Write Buffer

I-Instruction, D-data, S-Secondary Data

SB: Stream Buffers

DRAM: Local memory


Diagram of the Torus Interconnect

• In T3E/D, global data layout is not affected by processor access times

• Whether or not data fits in cache, is an issue


IBM Switch and I/O Nodes

Switch

ComputeNodes

I/ONodes

vsd

Stripe Grp MgrToken Mgr Srvr

Metanode

• Traffic over the switch can affect performance

• IBM supports GPFS a true parallel file system


Cache Coherence and NUMA

• SMPs were introduced in the 70’s and include the “vector supercomputers.”

• SMP’s suffered from bottlenecks based on network contention and differences between processors speed and memory access rate.

• Cache-Coherency led to the introduction of ccNUMA in the early 1990’s

Interconnection Network

P

C

P

C

Memory

I/O

P

C

P

C

Memory

I/OMemory

I/O

P

C

P

C

P

C

P

C


Pfister’s construction of the NUMA Architecture

• Scales more like MPPs than bus-based SMP’s

• Hardware tells you that the separate memories are one main memory

• One address space over entire machine

• Amount of time to access a memory value depends on whether it is local to a node or remote in another node

_ factor of 2 - 3 or (much) more

P

C

P

C

Memory

I/OMemory

I/O

P

C

P

C

P

C

P

C

Memory

I/OMemory

I/O

P

C

P

C


Parallel Architectures, cont.

• ccNUMA generalization of the SMP “cache coherent Non-Uniform Memory Access” (RISC Based)

_ Example: SGI Origin 2000, Sequent NUMA-Q_ Scalable network

• Clusters_ A parallel or distributed system that consists of a

collection of interconnected whole computers, that is utilized as a single, unified computing resource1

_ A layer of software lets users, managers, or even applications think there's just one "thing" there: the cluster, and not its constituent computers2

1,2 Pfister, Gregory, “In Search of Clusters,” 1995 and “Sizing Up Parallel Architectures,” 1998


Interconnection Network

Complicated memory hierarchiesMultiple CPUs at the processing node, cache misses very expensiveCoarse to fine-grain interprocessor communicationLots of I/O, internet communication

Many different types of controlMessages, shared memory; tasks, processes, threads

Each node runs its own version of the same operating system

...cpu cpu cpu

Memory (shared)

C CC

cpu cpu cpu

Memory (shared)

C CC

cpu cpu cpu

Memory (shared)

C CC

Cluster Example: SMP nodes bundled into large scale systems connected by some type of interconnect

(Compaq-DEC Alpha Cluster)


Beowulf Clusters

• Somewhere between MPP and NOWs

_ MPPs : larger with a faster interconnect

_ NOWs soak up extra cycles on networks - not usually dedicated

_ Interconnect is isolated from external network

_ Global process ID

• 10.9 Gflops/s by Caltech in ‘97

• See http://www.beowulf.org


Hybrid systems are becoming common, e.g., ASCI Blue-Pacific

• Each node has 4 IBM/PowerPC 604e processors

• Distributed memory across nodes

• Shared memory within a single node

• Optimal programming model may require a combination of message passing across the nodes and some form of threads or OpenMP within the nodes


Architecture Characteristics based on Chip Design

Parallel Vector:

• Flat uniform access memory

• Large register resource

• large effective issue rate on loops (16+ instructions/clock)

• Current vector SMP systems may not provide cost-effective means of scaling performance to desired levels

• Processor’s not keeping pace with Moore’s Law

RISC-Based:

• Cache memory hierarchy

• Scalar-optimized registers

• Large (4+) issue rate on loops, provides advantage on scalar code, too.

• Hierarchical memory system designs may not scale on certain problems

• Manufacturer's peak performance rarely achieved

The fight is not over, the ideal parallel architecture is a moving target.


Modern vendors incorporate the best of each system into their new designs

• PVP Parallel Vector Platform_ SGI/Cray SV1, NEC SX4,5

• Convergence of Architectures?T90

PV+ SN 2

J90++J90se

J90

T3E

T3E 900

SN 1

O2000+

Origin2000

Parallel Vector

Scalable RISC

Applications programmers must rely on software standards to make their work portable.

Vendor Roadmap circa 1997


Then, along comes Tera...

• Buys out Cray_ Convergence of

Architectures?? That is ridiculous…(Burton Smith)

• Up to 128 threads/proc

• No data caches

• 64-bit data, addresses, and instructions

• Incorporating CMOS (Complex Meal Oxide Silicon technology)


Current ASCI systems are clustered for very large problems. Here, the LLNL Blue-Pacific (SST)

Sector S

Sector Y

Sector K

24

24

24

Each SP sector comprised of• 488 Silver nodes• 24 HPGN Links

System Parameters• 3.89 TFLOP/s Peak• 2.6 TB Memory• 62.5 TB Global disk

HPGNHPGN

HiPPI

2.5 GB/node Memory24.5 TB Global Disk8.3 TB Local Disk



FDDI

SST Achieved >1.2TFLOP/son sPPM

Problem >70x LargerThan Ever Solved Before!

66

12


• IBM RS6000 SP Silver Node_ Sixteen thin nodes per frame_ 4-Way SMP

—PowerPC 604e at 332 MHz

– 32KB D & 32 KB I L1 cache

– 256KB L2 cache

—1.5 and 2.5 GB SDRAM Memory

—1.3 GB/s memory bandwidth

—9.1 and 18.2 GB Local Disk

—2.656 GigaFLOP/s Peak

—114 LFK GM

—SPECfp95 of 12.6

—SPECint95 of 14.4_ TB3MX Switch Adapter

—150 MB/s bi-directional_ Each node runs AIX 4.3.2

Silver Node Architecture

PowerPC604e

L2

PowerPC604e

L2

PowerPC604e

L2

PowerPC604e

L2

MemoryCard

MemoryCard

MemoryController

SwitchAdapter

PCIBridge

LocalI/O

SCSI DiskEthernet

PCI Slots

SP SwitchLink

NodeSupervisor

TTY

128b @ 83MHz


4 processor cards8-way SMP

Up to 4 Memory cardsUp to 16 GB, 32 GB future

Switched Data5 (32B 1:2) X 4 (64B 1:4)3.2 GB/s per port with 200 MHz processors (B/F=2)

SP Switch adapters directly into Node Crossbar (5th card)Address Rebroadcast for performance200 MHz CPU6.4 GF/node

Up to 2 Switch Adapters

2 x 500 MB/s links each

7 External I/O links

2x250 MB/s each

6XX bus16B; 1:2

POWER3

L2 L2

Proc Card Controller

(Logical) Address

Bus

64B; 1:4

MemoryMemory

Crossbar Switch

6XX Data Bus32B, 1:2

1 internal I/O link2x250 MB/s

POWER3

Nighthawk Node Architecture


I/O Hardware Architecture of SST

488 Node IBM SP Sector

56 GPFSServers

432 Thin Silver Nodes

Each SST Sector• Has local and two global I/O file systems• 2.2 GB/s delivered global I/O performance• 3.66 GB/s delivered local I/O performance• Separate SP first level switches• Independent command and control

Full system mode• Application launch over full 1,464 Silver nodes• 1,048 MPI/us tasks, 2,048 MPI/IP tasks• High speed, low latency communication between all nodes

GPFS GPFS GPFS GPFS GPFS GPFS GPFS GPFS

24 SP Links to Second Level

Switch

System Data and Control Networks


Hitachi SR 8000-F1/112(Rank 5 in TOP 500 / June 2000)

• System:_ 112 nodes, _ 1.34 TFLOP/s peak_ 1.03 TFLOP/s Linpack_ 0.9 TB memory

• Node:_ 8 CPUs, 12 GFLOP/s_ 8 GB, SMP_ pseudo-vector_ ext. b/w: 950 MB/s

• CPU:_ 1.5 GFLOP/s, 375 MHz_ 4 GB/s memory b/w

• Installed: 1.Q 2000 at LRZ


Earth Simulator ProjectESRDC / GS 40 (NEC)

• System: 640 nodes, 40 TFLOP/s

10 TB memory

optical 640x640 crossbar

50m x 20m without peripherals

• Node: 8 CPUs, 64 GFLOP/s

16 GB, SMP

ext. b/w: 2x16 GB/s

• CPU: Vector

8 GFLOP/s, 500 MHz

Single-Chip, 0.15 µs

32 GB/s memory b/w

• Virtual Earth - simulating_ Climate change (global warming)_ El Niño, hurricanes, droughts_ Air pollution (acid rain, ozone hole)_ Diastrophism (earthquake, volcanism)

• Installation: 2002http://www.gaia.jaeri.go.jp/public/e_publicconts.html

opticalsingle-stage

crossbar640*640 (!)

.....

.....

Node 1

Node 640


Compute Storage I/O100 GF 125 TB 3 GB/s3 TF 7.5 PB 90 GB/s10 TF 25 PB 300 GB/s

ApplicationPerformance

Computing Speed

Parallel I/ONetwork Speed

Memory

ArchivalStorage

FLOPS

Petabytes

Gigabytes/sec

Gigabits/sec

110 3 10 5

10 11

10 12

10 13

10 14

0.05

0.5

5Terabytes

50

0.131.3

13013

5

50

500

5000

1.3

130

13

0.13

‘96

‘97

‘00

2003Year

Pro

gra

ms

10 2

The key to a usable system is balance


MuSST hardware architecture is well balanced for ASCI requirements

MuSST (PERF) System• 4 Login/Network nodes w/16 GB SDRAM• 8 PDEBUG nodes w/16 GB SDRAM• 258 w/16GB, 226 w/8GB PBATCH nodes• 12.8 GB/s delivered global I/O performance• 5.12 GB/s delivered local I/O performance• 24 Gb Ethernet External Network

Programming/Usage Model• Application launch over ~492 NH-2 nodes• 16-way MuSPPA, Shared Memory, 32b MPI• 4,096 MPI/US tasks, 2,048 MPI/IP tasks• Likely usage is 4 MPI tasks/node with 4 threads/MPI task• Single STDIO interface

512 NH-2 Node IBM SP16 GPFSServers

484 NH-2 PBATCH Nodes

GPFS GPFS GPFS GPFS GPFS GPFS GPFS GPFS

8 NH-2 PDEBUG Nodes

LoginNet

JumboNFS/Login

LoginNet

JumboNFS/Login

LoginNet

JumboNFS/Login

LoginNet

JumboNFS/Login

System Data and Control Networks


Performance Issues: Chip and machine level definitions/concepts

• Peak Floating Point Performance: the maximum performance rate possible for a given operation MFLOPS, GFLOPS(109), TFLOPS (1012)

• Performance in OPS is based on the clock rate of the processor and the number of operations per clock

• On the chip, there are various ways to increase the number of operations per clock cycle

• Theoretical Peak Performance (Dongarra, SC97)

# arithmetic function units/CPU R* = ------------------------------------------- P clock cycle


Actual Application Performance

• Algorithm and problem characteristics_ problem size, regularity of data, geometry

• Programming model (appropriate for architecture?)

• Compiler efficiency

• Available operations (vector units?, floating point support)

• R* chip performance (clock, instructions per clock)

• Processor/memory performance (cache hits and misses, specialized hardware, e.g., stream buffers)

• Optimization of problem for single CPU

• Operational environment (effect of other processes, scheduling)

• I/O quantity and performance


Useful performance of a parallel computer

• Chip performance and thus R* is only the first factor in determining the speed of the application

• Bandwidth: The rate of data transferred in MB/sec

• Latency: How long it takes to get that first bit of information to a processor (message of size zero)_ Effective Latency takes into consideration cache hits/misses

• Part of obtaining good performance is finding ways to reduce the so-called Processor -Memory Performance Gap


Some benchmark suites to compare useful performance

• NAS Parallel Benchmarks science.nas.nasa.gov/Software/NPB_ set of 8 programs designed to help evaluate the performance of parallel

supercomputers. The benchmarks, derived from CFD applications, consist of five kernels and three pseudo-applications

• SPEC Standard Performance Evaluation Corporation www.spec.org_ standardized set of relevant benchmarks and metrics for performance evaluation of

modern computer systems_ has started a set of benchmarks based on large industrial applications

• Linpack Parallel Benchmark suite performance.netlib.org_ Provides a forum for comparison of massively parallel machines by solving a system of

linear equations and allowing the performance numbers to reflect the largest problem run on the machine

_ lists Rmax (Gflop/s for largest problem) and R* (a theoretical #) providing input for Top500 list

• Stream Benchmark www.cs.virginia.edu/stream_ sustainable memory bandwidth (cpu’s getting faster quicker than memory)

• B_eff and B_eff_io www.hlrs.de/mpi/b_eff/ and www.hlrs.de/mpi/b_eff_io


Some perspective on Amdahl’s Law

1954 FORTRAN invented by John Backus for IBM for 701

1955 IBM introduces 704 (5kflops), Gene Amdahl chief designer

1964 CDC 6600 “supercomputer”

1966 Flynn’s taxonomy (MIMD,...)

1967 Amdahl’s Law

1971 Intel’s single-chip CPU (meant for a calculator)

1972 64PE ILLIAC-IV (5Mflops/PE)

1973 Ethernet Local Area Network

1976 Cray-1 ~160 Mflops

1979 IBM’s John Cocke, first RISC

1985 TMC’s CM-1 with 256x256 1bit PEs

1993 CRI’s T3D

1

10

100

1000

10000

4 16 64 246 1024

f = 0.8f= 0.9f = 0.95f = 0.99f = 0.999f = 1.0

Speedup 1

(1 f ) f / P

where f is the fraction of the code thatis parallelizable and P is the PE number

Sp

eed

up

Number of PE’s


Reporting Performance: Fundamental Measurements

• Parallel speedup (n, P) where n is problem size, P is processors S= T(N,1)/T(N,P)

_ Ideally, speed up should be measured w.r.t. the best serial algorithm, not just the parallel algorithm on a serial computer

• Parallel efficiency_ Measure how well the implementation scales on the given architecture

• Wall clock speedup

• Scaled speedup_ Important since one real utility of MPP’s is solving the big problem _ Also, no matter what, the absolute speedup should always decrease

(exceptional cases of superlinear excluded)

• Amdahl’s Law is still often underestimated


Example: parallel speedup comparedto 1 processor of a C90

miÝ Ý r

i

F

i Ý

r

i(t)

F

i

V(r

ij)

0

0.5

1

1.5

2

0 50 100 150 200 250 300

T3D PERFORMANCE

GF

LO

PS

NUMBER OF PROCESSORS

YMP C90 1 PROC.

MOLECULAR DYNAMICS PROVIDESTHE SOURCE TERM FOR DIFFUSION

As ( 15 keV)

Atom trajectories obtained from integration of Atom trajectories obtained from integration of classical equations of motionclassical equations of motion

Example from Tomas Diaz de la Rubia, LLNL working with AT&T


Example of wall clock speedup

Memory: 102 Mbytes Data: 210 Mbytes

0

1

2

3

4

5

6

0

To

tal

Tim

e t

o S

olv

e

8 16 24 32 40 48 56Number of Nodes

64

256 nodes: 12 minutes

720 MFlops

Example: B - Si self-interstitial complex in SiExample: B - Si self-interstitial complex in Si

Slide courtesy of L. Yang working with Xerox


Example of parallel efficiency

Using Shared Memory Access Library (SHMEM) in C

0

1000

2000

3000

4000

0.8

0.9

1

0 16 32 48 64

Tot

al t

ime

(sec

) / I

tera

tion

Parallel E

fficiency

Number of nodes

System of 216 GaAs atomsSlide courtesy of L. Yang working with Xerox


Example: Parallel efficiency of OVERFLOW/MLP

• OVERFLOW CFD code at NASA/Ames

• high, sustained GFLOP/s rate

• with Multi Level Parallelism (MLP)

• scalable on large CPU counts

• on 512 processor ccNUMA Origin 2000

75

60

45

30

15

0

Per

form

ance

(G

FL

OP

/s)

0 128 256 384 512Number of CPUs


Examples: fixed problem size and problem size scaled with number PEs

Performance results show nearly ideal speed-up for explicit case(even for fixed problem size)

0.01

0.1

1

10

1 10 100

T3ELinearT3DLinear

CP

U s

ec/t

ime

step

Number of Processors

For scaled problem the speedup is virtually linear for both machinesExample from

NIMROD

• T3E is factor of 4.5 faster_ 2X processor speed_ chaining_ cache effects


Algorithm Design can affect both performance and scalability

MultiGrid preconditioned CG (MGCG) performance

Fast Scalable

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

0 5 10 15 20 25 30

Number of Iterations

No

rm o

f R

elat

ive

Res

idu

al

0.00.51.01.52.02.5

0

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 64 128 256

Number of Processors

Sca

led

Sp

eed

up

• ParFlow MGCG Linear Solver is 100 Times Faster Than Competing Solvers

Example from ISPC: Bosl and Ashby, LLNL

Pictorial Sample of Parallel Applications. Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner,...

Documents

Transcript of Pictorial Sample of Parallel Applications. Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner,...