Pictorial Sample of Parallel Applications. Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner,...
-
Upload
percival-morrison -
Category
Documents
-
view
212 -
download
0
Transcript of Pictorial Sample of Parallel Applications. Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner,...
Pictorial Sample of Parallel Applications
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux2
Major Classes of Parallel Computing Architectures
• MPP: classic Massively Parallel Processor with distributed memory (RISC-based processors)
_ Each processor has own separate memory _ CRAY T3D/E, IBM SP2 original series_ Can be Ambiguous - can use shared address space model “virtual shared
memory”
• SMP: Symmetric MultiProcessor, occasionally used to denote “Shared-Memory-Parallel”
_ Processors share a single global area of RAM_ Often crossbar-switched interconnect (Sun HPC 1000)_ symmetry all processors have equal access to memory and other parts of
the system (e.g., I/O)
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux3
Major Classes of Parallel Computing Architectures (cont)
• PVP: Parallel Vector Processor_ Cray J90/C90/T90 Series_ NEC SX series, Fujitsu VX series
• ccNUMA: “cache-coherent Non-Uniform Memory Access”_ Shared physical address space with automatic hardware replication_ Sometimes called DSM or Distributed Shared Memory_ SGI Origin 2000
• Clusters _ SMP Clusters and PVP Clusters_ Beowulfs, Appleseeds, and NOWs
• Hybrids (includes the newer SP’s)For the details, consult:
_ Hennessy and Patterson, Computer Architecture, Morgan Kaufmann Publishers, 1996_ Culler and Singh, Parallel Computer Architecture, Morgan Kaufmann Publishers, 1999
‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06
100+ Tflop / 30 TB
Time (CY)
Cap
abili
ty
1+ Tflop / 0.5 TB
PlanDevelop
Use
30+ Tflop / 10 TB
Intel Tflop Demo 12/96
IBM SST Demo 9/98SGI/Cray Demo 11/98
Red
3+ Tflop / 1.5 TBBlue
10+ Tflop / 4 TBWhite
50+ Tflop / 25 TB Mid Life Kicker
ASCI Computing Systems
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux5
Simplest Diagrams of SMP, MPP
MPP Massively Parallel ProcessorSMP Symmetric Multi-Processor
cpu cpu cpu
Fast Interconnect
Memory (shared)
cpu
Mem
cpu
Mem
cpu
Mem
•2-128 processors circa late 90’s 40 - 4000 circa 90’s•Memory shared Memory physically distributed•High-powered Processors and High-Powered Micros (Alpha, PowerPC)Micros (some vector, mostly micro)•SMP means equal access including I/O•Sometimes term is generalized to meanShared Memory Parallel
Interconnect (varies)
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux6
Simplistic View of MPP Node
omega IBM SP2, Meiko3D Torus Cray T3E
cpu
memory
Switch / Network
Caches
RS6000 - IBMDec Alpha - CraySparc - TMC, Meiko
~16 - 512 MBytesRam per processorNewer machines ~ GBs
Processor
CommunicationMechanism
Network
Memory
Shared Memory T3D/EMessage Passing SP2
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux7
Memory Hierarchy Trapezoid:
Registers
L1 Cache
L2 Cache
Main Memory
Disk
DecreasingCost
DecreasingAccessTime
• Caches are used (except Tera) because the access time from top to bottom is very different
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux8
Schematic of a T3E node (Compaq/Dec EV5)
• Chips are becoming increasingly complicated
• In the first chips used on the T3D, there was only 1 level of cache, so performance was greatly affected by problem size w.r.t. cache
• Stream Buffers improve data flow into cache
• Most MPP’s today have specialized RISC chips
E’s: Int func units
F’s: Floating point functional units
WB: Write Buffer
I-Instruction, D-data, S-Secondary Data
SB: Stream Buffers
DRAM: Local memory
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux9
Diagram of the Torus Interconnect
• In T3E/D, global data layout is not affected by processor access times
• Whether or not data fits in cache, is an issue
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux10
IBM Switch and I/O Nodes
Switch
ComputeNodes
I/ONodes
vsd
Stripe Grp MgrToken Mgr Srvr
Metanode
• Traffic over the switch can affect performance
• IBM supports GPFS a true parallel file system
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux11
Cache Coherence and NUMA
• SMPs were introduced in the 70’s and include the “vector supercomputers.”
• SMP’s suffered from bottlenecks based on network contention and differences between processors speed and memory access rate.
• Cache-Coherency led to the introduction of ccNUMA in the early 1990’s
Interconnection Network
P
C
P
C
Memory
I/O
P
C
P
C
Memory
I/OMemory
I/O
P
C
P
C
P
C
P
C
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux12
Pfister’s construction of the NUMA Architecture
• Scales more like MPPs than bus-based SMP’s
• Hardware tells you that the separate memories are one main memory
• One address space over entire machine
• Amount of time to access a memory value depends on whether it is local to a node or remote in another node
_ factor of 2 - 3 or (much) more
P
C
P
C
Memory
I/OMemory
I/O
P
C
P
C
P
C
P
C
Memory
I/OMemory
I/O
P
C
P
C
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux13
Parallel Architectures, cont.
• ccNUMA generalization of the SMP “cache coherent Non-Uniform Memory Access” (RISC Based)
_ Example: SGI Origin 2000, Sequent NUMA-Q_ Scalable network
• Clusters_ A parallel or distributed system that consists of a
collection of interconnected whole computers, that is utilized as a single, unified computing resource1
_ A layer of software lets users, managers, or even applications think there's just one "thing" there: the cluster, and not its constituent computers2
1,2 Pfister, Gregory, “In Search of Clusters,” 1995 and “Sizing Up Parallel Architectures,” 1998
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux14
Interconnection Network
Complicated memory hierarchiesMultiple CPUs at the processing node, cache misses very expensiveCoarse to fine-grain interprocessor communicationLots of I/O, internet communication
Many different types of controlMessages, shared memory; tasks, processes, threads
Each node runs its own version of the same operating system
...cpu cpu cpu
Memory (shared)
C CC
cpu cpu cpu
Memory (shared)
C CC
cpu cpu cpu
Memory (shared)
C CC
Cluster Example: SMP nodes bundled into large scale systems connected by some type of interconnect
(Compaq-DEC Alpha Cluster)
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux15
Beowulf Clusters
• Somewhere between MPP and NOWs
_ MPPs : larger with a faster interconnect
_ NOWs soak up extra cycles on networks - not usually dedicated
_ Interconnect is isolated from external network
_ Global process ID
• 10.9 Gflops/s by Caltech in ‘97
• See http://www.beowulf.org
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux16
Hybrid systems are becoming common, e.g., ASCI Blue-Pacific
• Each node has 4 IBM/PowerPC 604e processors
• Distributed memory across nodes
• Shared memory within a single node
• Optimal programming model may require a combination of message passing across the nodes and some form of threads or OpenMP within the nodes
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux17
Architecture Characteristics based on Chip Design
Parallel Vector:
• Flat uniform access memory
• Large register resource
• large effective issue rate on loops (16+ instructions/clock)
• Current vector SMP systems may not provide cost-effective means of scaling performance to desired levels
• Processor’s not keeping pace with Moore’s Law
RISC-Based:
• Cache memory hierarchy
• Scalar-optimized registers
• Large (4+) issue rate on loops, provides advantage on scalar code, too.
• Hierarchical memory system designs may not scale on certain problems
• Manufacturer's peak performance rarely achieved
The fight is not over, the ideal parallel architecture is a moving target.
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux18
Modern vendors incorporate the best of each system into their new designs
• PVP Parallel Vector Platform_ SGI/Cray SV1, NEC SX4,5
• Convergence of Architectures?T90
PV+ SN 2
J90++J90se
J90
T3E
T3E 900
SN 1
O2000+
Origin2000
Parallel Vector
Scalable RISC
Applications programmers must rely on software standards to make their work portable.
Vendor Roadmap circa 1997
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux19
Then, along comes Tera...
• Buys out Cray_ Convergence of
Architectures?? That is ridiculous…(Burton Smith)
• Up to 128 threads/proc
• No data caches
• 64-bit data, addresses, and instructions
• Incorporating CMOS (Complex Meal Oxide Silicon technology)
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux20
Current ASCI systems are clustered for very large problems. Here, the LLNL Blue-Pacific (SST)
Sector S
Sector Y
Sector K
24
24
24
Each SP sector comprised of• 488 Silver nodes• 24 HPGN Links
System Parameters• 3.89 TFLOP/s Peak• 2.6 TB Memory• 62.5 TB Global disk
HPGNHPGN
HiPPI
2.5 GB/node Memory24.5 TB Global Disk8.3 TB Local Disk
1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk
1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk
FDDI
SST Achieved >1.2TFLOP/son sPPM
Problem >70x LargerThan Ever Solved Before!
66
12
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux21
• IBM RS6000 SP Silver Node_ Sixteen thin nodes per frame_ 4-Way SMP
—PowerPC 604e at 332 MHz
– 32KB D & 32 KB I L1 cache
– 256KB L2 cache
—1.5 and 2.5 GB SDRAM Memory
—1.3 GB/s memory bandwidth
—9.1 and 18.2 GB Local Disk
—2.656 GigaFLOP/s Peak
—114 LFK GM
—SPECfp95 of 12.6
—SPECint95 of 14.4_ TB3MX Switch Adapter
—150 MB/s bi-directional_ Each node runs AIX 4.3.2
Silver Node Architecture
PowerPC604e
L2
PowerPC604e
L2
PowerPC604e
L2
PowerPC604e
L2
MemoryCard
MemoryCard
MemoryController
SwitchAdapter
PCIBridge
LocalI/O
SCSI DiskEthernet
PCI Slots
SP SwitchLink
NodeSupervisor
TTY
128b @ 83MHz
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux22
4 processor cards8-way SMP
Up to 4 Memory cardsUp to 16 GB, 32 GB future
Switched Data5 (32B 1:2) X 4 (64B 1:4)3.2 GB/s per port with 200 MHz processors (B/F=2)
SP Switch adapters directly into Node Crossbar (5th card)Address Rebroadcast for performance200 MHz CPU6.4 GF/node
Up to 2 Switch Adapters
2 x 500 MB/s links each
7 External I/O links
2x250 MB/s each
6XX bus16B; 1:2
POWER3
L2 L2
Proc Card Controller
(Logical) Address
Bus
64B; 1:4
MemoryMemory
Crossbar Switch
6XX Data Bus32B, 1:2
1 internal I/O link2x250 MB/s
POWER3
Nighthawk Node Architecture
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux23
I/O Hardware Architecture of SST
488 Node IBM SP Sector
56 GPFSServers
432 Thin Silver Nodes
Each SST Sector• Has local and two global I/O file systems• 2.2 GB/s delivered global I/O performance• 3.66 GB/s delivered local I/O performance• Separate SP first level switches• Independent command and control
Full system mode• Application launch over full 1,464 Silver nodes• 1,048 MPI/us tasks, 2,048 MPI/IP tasks• High speed, low latency communication between all nodes
GPFS GPFS GPFS GPFS GPFS GPFS GPFS GPFS
24 SP Links to Second Level
Switch
System Data and Control Networks
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux24
Hitachi SR 8000-F1/112(Rank 5 in TOP 500 / June 2000)
• System:_ 112 nodes, _ 1.34 TFLOP/s peak_ 1.03 TFLOP/s Linpack_ 0.9 TB memory
• Node:_ 8 CPUs, 12 GFLOP/s_ 8 GB, SMP_ pseudo-vector_ ext. b/w: 950 MB/s
• CPU:_ 1.5 GFLOP/s, 375 MHz_ 4 GB/s memory b/w
• Installed: 1.Q 2000 at LRZ
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux25
Earth Simulator ProjectESRDC / GS 40 (NEC)
• System: 640 nodes, 40 TFLOP/s
10 TB memory
optical 640x640 crossbar
50m x 20m without peripherals
• Node: 8 CPUs, 64 GFLOP/s
16 GB, SMP
ext. b/w: 2x16 GB/s
• CPU: Vector
8 GFLOP/s, 500 MHz
Single-Chip, 0.15 µs
32 GB/s memory b/w
• Virtual Earth - simulating_ Climate change (global warming)_ El Niño, hurricanes, droughts_ Air pollution (acid rain, ozone hole)_ Diastrophism (earthquake, volcanism)
• Installation: 2002http://www.gaia.jaeri.go.jp/public/e_publicconts.html
opticalsingle-stage
crossbar640*640 (!)
.....
.....
Node 1
Node 640
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux26
Compute Storage I/O100 GF 125 TB 3 GB/s3 TF 7.5 PB 90 GB/s10 TF 25 PB 300 GB/s
ApplicationPerformance
Computing Speed
Parallel I/ONetwork Speed
Memory
ArchivalStorage
FLOPS
Petabytes
Gigabytes/sec
Gigabits/sec
110 3 10 5
10 11
10 12
10 13
10 14
0.05
0.5
5Terabytes
50
0.131.3
13013
5
50
500
5000
1.3
130
13
0.13
‘96
‘97
‘00
2003Year
Pro
gra
ms
10 2
The key to a usable system is balance
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux27
MuSST hardware architecture is well balanced for ASCI requirements
MuSST (PERF) System• 4 Login/Network nodes w/16 GB SDRAM• 8 PDEBUG nodes w/16 GB SDRAM• 258 w/16GB, 226 w/8GB PBATCH nodes• 12.8 GB/s delivered global I/O performance• 5.12 GB/s delivered local I/O performance• 24 Gb Ethernet External Network
Programming/Usage Model• Application launch over ~492 NH-2 nodes• 16-way MuSPPA, Shared Memory, 32b MPI• 4,096 MPI/US tasks, 2,048 MPI/IP tasks• Likely usage is 4 MPI tasks/node with 4 threads/MPI task• Single STDIO interface
512 NH-2 Node IBM SP16 GPFSServers
484 NH-2 PBATCH Nodes
GPFS GPFS GPFS GPFS GPFS GPFS GPFS GPFS
8 NH-2 PDEBUG Nodes
LoginNet
JumboNFS/Login
LoginNet
JumboNFS/Login
LoginNet
JumboNFS/Login
LoginNet
JumboNFS/Login
System Data and Control Networks
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux28
Performance Issues: Chip and machine level definitions/concepts
• Peak Floating Point Performance: the maximum performance rate possible for a given operation MFLOPS, GFLOPS(109), TFLOPS (1012)
• Performance in OPS is based on the clock rate of the processor and the number of operations per clock
• On the chip, there are various ways to increase the number of operations per clock cycle
• Theoretical Peak Performance (Dongarra, SC97)
# arithmetic function units/CPU R* = ------------------------------------------- P clock cycle
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux29
Actual Application Performance
• Algorithm and problem characteristics_ problem size, regularity of data, geometry
• Programming model (appropriate for architecture?)
• Compiler efficiency
• Available operations (vector units?, floating point support)
• R* chip performance (clock, instructions per clock)
• Processor/memory performance (cache hits and misses, specialized hardware, e.g., stream buffers)
• Optimization of problem for single CPU
• Operational environment (effect of other processes, scheduling)
• I/O quantity and performance
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux30
Useful performance of a parallel computer
• Chip performance and thus R* is only the first factor in determining the speed of the application
• Bandwidth: The rate of data transferred in MB/sec
• Latency: How long it takes to get that first bit of information to a processor (message of size zero)_ Effective Latency takes into consideration cache hits/misses
• Part of obtaining good performance is finding ways to reduce the so-called Processor -Memory Performance Gap
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux31
Some benchmark suites to compare useful performance
• NAS Parallel Benchmarks science.nas.nasa.gov/Software/NPB_ set of 8 programs designed to help evaluate the performance of parallel
supercomputers. The benchmarks, derived from CFD applications, consist of five kernels and three pseudo-applications
• SPEC Standard Performance Evaluation Corporation www.spec.org_ standardized set of relevant benchmarks and metrics for performance evaluation of
modern computer systems_ has started a set of benchmarks based on large industrial applications
• Linpack Parallel Benchmark suite performance.netlib.org_ Provides a forum for comparison of massively parallel machines by solving a system of
linear equations and allowing the performance numbers to reflect the largest problem run on the machine
_ lists Rmax (Gflop/s for largest problem) and R* (a theoretical #) providing input for Top500 list
• Stream Benchmark www.cs.virginia.edu/stream_ sustainable memory bandwidth (cpu’s getting faster quicker than memory)
• B_eff and B_eff_io www.hlrs.de/mpi/b_eff/ and www.hlrs.de/mpi/b_eff_io
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux32
Some perspective on Amdahl’s Law
1954 FORTRAN invented by John Backus for IBM for 701
1955 IBM introduces 704 (5kflops), Gene Amdahl chief designer
1964 CDC 6600 “supercomputer”
1966 Flynn’s taxonomy (MIMD,...)
1967 Amdahl’s Law
1971 Intel’s single-chip CPU (meant for a calculator)
1972 64PE ILLIAC-IV (5Mflops/PE)
1973 Ethernet Local Area Network
1976 Cray-1 ~160 Mflops
1979 IBM’s John Cocke, first RISC
1985 TMC’s CM-1 with 256x256 1bit PEs
1993 CRI’s T3D
1
10
100
1000
10000
4 16 64 246 1024
f = 0.8f= 0.9f = 0.95f = 0.99f = 0.999f = 1.0
Speedup 1
(1 f ) f / P
where f is the fraction of the code thatis parallelizable and P is the PE number
Sp
eed
up
Number of PE’s
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux33
Reporting Performance: Fundamental Measurements
• Parallel speedup (n, P) where n is problem size, P is processors S= T(N,1)/T(N,P)
_ Ideally, speed up should be measured w.r.t. the best serial algorithm, not just the parallel algorithm on a serial computer
• Parallel efficiency_ Measure how well the implementation scales on the given architecture
• Wall clock speedup
• Scaled speedup_ Important since one real utility of MPP’s is solving the big problem _ Also, no matter what, the absolute speedup should always decrease
(exceptional cases of superlinear excluded)
• Amdahl’s Law is still often underestimated
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux34
Example: parallel speedup comparedto 1 processor of a C90
miÝ Ý r
i
F
i Ý
r
i(t)
F
i
V(r
ij)
0
0.5
1
1.5
2
0 50 100 150 200 250 300
T3D PERFORMANCE
GF
LO
PS
NUMBER OF PROCESSORS
YMP C90 1 PROC.
MOLECULAR DYNAMICS PROVIDESTHE SOURCE TERM FOR DIFFUSION
As ( 15 keV)
Atom trajectories obtained from integration of Atom trajectories obtained from integration of classical equations of motionclassical equations of motion
Example from Tomas Diaz de la Rubia, LLNL working with AT&T
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux35
Example of wall clock speedup
Memory: 102 Mbytes Data: 210 Mbytes
0
1
2
3
4
5
6
0
To
tal
Tim
e t
o S
olv
e
8 16 24 32 40 48 56Number of Nodes
64
256 nodes: 12 minutes
720 MFlops
Example: B - Si self-interstitial complex in SiExample: B - Si self-interstitial complex in Si
Slide courtesy of L. Yang working with Xerox
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux36
Example of parallel efficiency
Using Shared Memory Access Library (SHMEM) in C
0
1000
2000
3000
4000
0.8
0.9
1
0 16 32 48 64
Tot
al t
ime
(sec
) / I
tera
tion
Parallel E
fficiency
Number of nodes
System of 216 GaAs atomsSlide courtesy of L. Yang working with Xerox
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux37
Example: Parallel efficiency of OVERFLOW/MLP
• OVERFLOW CFD code at NASA/Ames
• high, sustained GFLOP/s rate
• with Multi Level Parallelism (MLP)
• scalable on large CPU counts
• on 512 processor ccNUMA Origin 2000
75
60
45
30
15
0
Per
form
ance
(G
FL
OP
/s)
0 128 256 384 512Number of CPUs
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux38
Examples: fixed problem size and problem size scaled with number PEs
Performance results show nearly ideal speed-up for explicit case(even for fixed problem size)
0.01
0.1
1
10
1 10 100
T3ELinearT3DLinear
CP
U s
ec/t
ime
step
Number of Processors
For scaled problem the speedup is virtually linear for both machinesExample from
NIMROD
• T3E is factor of 4.5 faster_ 2X processor speed_ chaining_ cache effects
Euro-Par 2000 Tutorial © Koniges, Keyes, Rabenseifner, Heroux39
Algorithm Design can affect both performance and scalability
MultiGrid preconditioned CG (MGCG) performance
Fast Scalable
1E-10
1E-08
1E-06
1E-04
1E-02
1E+00
0 5 10 15 20 25 30
Number of Iterations
No
rm o
f R
elat
ive
Res
idu
al
0.00.51.01.52.02.5
0
0.2
0.4
0.6
0.8
1
1 2 4 8 16 32 64 128 256
Number of Processors
Sca
led
Sp
eed
up
• ParFlow MGCG Linear Solver is 100 Times Faster Than Competing Solvers
Example from ISPC: Bosl and Ashby, LLNL