CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of...
-
date post
20-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of...
![Page 1: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/1.jpg)
CIS 429/529 2007 Parallel Arch. Intro
Parallel Computer Architecture
Slides adapted from those of
David Patterson and David Culler, UC Berkeley
Krste Asanovic, MIT
![Page 2: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/2.jpg)
2CIS 429/529 2007 Parallel Arch. Intro
Lecture Roadmap
– Motivation– Flynn’s Taxonomy– History of parallel computers– SIMD - vector architecture (light coverage)– MIMD - shared memory and distributed memory
architectures (in depth with focus on memory coherence)– Performance of parallel computers
![Page 3: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/3.jpg)
3CIS 429/529 2007 Parallel Arch. Intro
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Performance (vs. VAX-11/780)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
3X
![Page 4: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/4.jpg)
CIS 429/529 2007 Parallel Arch. Intro4
0 1 2 3 4 5 6+0
5
10
15
20
25
30
l
l
ll l
0 5 10 150
0.5
1
1.5
2
2.5
3
Fraction of total cycles (%)
Number of instructions issued
Speedup
Instructions issued per cycle
Limits of ILP
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
![Page 5: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/5.jpg)
5CIS 429/529 2007 Parallel Arch. Intro
The Rise of Multiprocessors• Advantage of leveraging design investment by
replication – Rather than new unique designs
• Major need in the scientific computing arena
• Growth in data-intensive applications– Data bases, file servers, …
• Growing interest in servers, server perf.
• Increasing desktop perf. less important – Outside of graphics
• Improved understanding in how to use multiprocessors effectively
– Especially server where significant natural TLP
![Page 6: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/6.jpg)
6CIS 429/529 2007 Parallel Arch. Intro
Definition: Parallel Computer
•Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Almasi and Gottlieb, Highly Parallel Computing ,1989
•Parallel Architecture = Computer Architecture + Communication Architecture
![Page 7: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/7.jpg)
7CIS 429/529 2007 Parallel Arch. Intro
Parallel Architecture Design Issues
–How large a collection of processors?–How powerful are processing elements?–How do they cooperate and communicate?–How are data transmitted between processors?–Where to put the memory and I/O? –What type of interconnection?–What are HW and SW primitives for programmer?–Does it translate into performance?
![Page 8: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/8.jpg)
8CIS 429/529 2007 Parallel Arch. Intro
Flynn’s Taxonomy
• Flynn classified by data and control streams in 1966
• SIMD Data Level Parallelism
• MIMD Thread Level Parallelism
• MIMD popular because – Flexible: N pgms or 1 multithreaded pgm
– Cost-effective: same in desktop & MIMD
Single Instruction Single Data (SISD)
(Uniprocessor)
Single Instruction Multiple Data SIMD
(Vector, CM-2)
Multiple Instruction Single Data (MISD)
(????)
Multiple Instruction Multiple Data MIMD
(supercomputers, clusters, SMP servers)
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
![Page 9: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/9.jpg)
9CIS 429/529 2007 Parallel Arch. Intro
Scientific Supercomputing
• Proving ground and driver for innovative architecture and techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
• Large-scale multiprocessors eventually dominate over vector supercomputers
![Page 10: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/10.jpg)
10CIS 429/529 2007 Parallel Arch. Intro
Scientific Computing Demand
![Page 11: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/11.jpg)
11CIS 429/529 2007 Parallel Arch. Intro
Engineering Computing Demand
• Large parallel machines a mainstay in many industries
– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis, combustion
efficiency), – Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization
» in all of the above» entertainment (films like Toy Story)» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)– etc.
![Page 12: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/12.jpg)
12CIS 429/529 2007 Parallel Arch. Intro
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVerification
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, …
Applications: Speech and Image Processing
![Page 13: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/13.jpg)
13CIS 429/529 2007 Parallel Arch. Intro
Commercial Computing
• Relies on parallelism for high end– Computational power determines scale of business that can
be handled
• Databases, online-transaction processing, decision support, data mining, data warehousing
![Page 14: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/14.jpg)
14CIS 429/529 2007 Parallel Arch. Intro
Summary of Application Trends
• Transition to parallel computing has occurred for scientific and engineering computing
• In rapid progress in commercial computing– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs, which are a lot like parallel programs
• Demand for improving throughput on sequential workloads
• Solid application demand exists and will increase
![Page 15: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/15.jpg)
CIS 429/529 2007 Parallel Arch. Intro15
Economics
• Commodity microprocessors not only fast but CHEAP– Development costs tens of millions of dollars
– BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity building block
• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
• Standardization makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one
• Multiprocessor on a chip -> multicore
![Page 16: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/16.jpg)
16CIS 429/529 2007 Parallel Arch. Intro
Supercomputer Applications
Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)
All involve huge computations on large data sets
In 70s-80s, Supercomputer Vector Machine
![Page 17: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/17.jpg)
17CIS 429/529 2007 Parallel Arch. Intro
Multiprocessor Trends
l
l
l
l
l
l
l
l
ll
l l
l l
l l
l
l
l
ll
l
l
l
l
l
l
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30
SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Number of processors
![Page 18: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/18.jpg)
18CIS 429/529 2007 Parallel Arch. Intro
LINPACK (GFLOPS)
n CRAY peakl MPP peak
Xmp/416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP(1024)
Paragon XP/S MP(6768)
n
n
n
n
l
l
nl
l
l
ll
l
ll
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
Raw Parallel Performance: LINPACK
SIMD
MIMD
![Page 19: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/19.jpg)
19CIS 429/529 2007 Parallel Arch. Intro
Whither Parallel Machines?
• 1997, 500 fastest machines in the world: 319 MPPs, 73 bus-based shared memory (SMP), 106 parallel vector processors (PVP)
• 2000, 381 of 500 fastest: 144 IBM SP (~cluster), 121 Sun (bus SMP), 62 SGI (NUMA SMP), 54 Cray (NUMA SMP)
![Page 20: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/20.jpg)
20CIS 429/529 2007 Parallel Arch. Intro
Vector Supercomputers
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions• Load/Store Architecture
• Vector Registers
• Vector Instructions
• Hardwired Control
• Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory
![Page 21: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/21.jpg)
21CIS 429/529 2007 Parallel Arch. Intro
Older Vector Machines
Machine Year Clock Regs Elements FUs LSUsCray 1 1976 80 MHz 8 64 6 1Cray XMP 1983 120 MHz 8 64 8 2 L, 1 SCray YMP 1988 166 MHz 8 64 8 2 L, 1 SCray C-90 1991 240 MHz 8 128 8 4Cray T-90 1996 455 MHz 8 128 8 4Conv. C-1 1984 10 MHz 8 128 4 1Conv. C-4 1994 133 MHz 16 128 3 1Fuj. VP200 1982 133 MHz 8-256 32-1024 3 2Fuj. VP300 1996 100 MHz 8-256 32-1024 3 2NEC SX/2 1984 160 MHz 8+8K 256+var 16 8NEC SX/3 1995 400 MHz 8+8K 256+var 16 8
![Page 22: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/22.jpg)
22CIS 429/529 2007 Parallel Arch. Intro
Cray-1 (1976)
![Page 23: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/23.jpg)
23CIS 429/529 2007 Parallel Arch. Intro
Cray-1 (1976)
Single PortMemory
16 banks of 64-bit words
+ 8-bit SECDED
80MW/sec data load/store
320MW/sec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V. Mask
V. Length64 Element Vector Registers
![Page 24: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/24.jpg)
24CIS 429/529 2007 Parallel Arch. Intro
Vector Programming Model
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and
Store Instructions
LV v1, r1, r2
Base, r1 Stride, r2Memory
Vector Register
![Page 25: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/25.jpg)
25CIS 429/529 2007 Parallel Arch. Intro
Vector Code Example
# Scalar Code
LI R4, 64
loop:
L.D F0, 0(R1)
L.D F2, 0(R2)
ADD.D F4, F2, F0
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop
# Vector Code
LI VLR, 64
LV V1, R1
LV V2, R2
ADDV.D V3, V1, V2
SV V3, R3
# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];
![Page 26: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/26.jpg)
26CIS 429/529 2007 Parallel Arch. Intro
Vector Instruction Set Advantages
• Compact– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:– are independent
– use the same functional unit
– access disjoint registers
– access registers in the same pattern as previous instructions
– access a contiguous block of memory (unit-stride load/store)
– access memory in a known pattern (strided load/store)
• Scalable– can run same object code on more parallel pipelines or lanes
![Page 27: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/27.jpg)
27CIS 429/529 2007 Parallel Arch. Intro
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) to execute element operations
• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
V1
V2
V3
V3 <- v1 * v2
Six stage multiply pipeline
![Page 28: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/28.jpg)
28CIS 429/529 2007 Parallel Arch. Intro
Vector Memory System
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base StrideVector Registers
Memory Banks
Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank
![Page 29: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/29.jpg)
29CIS 429/529 2007 Parallel Arch. Intro
Vector Instruction ExecutionADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using four pipelined
functional units
![Page 30: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/30.jpg)
30CIS 429/529 2007 Parallel Arch. Intro
Vectors Are Inexpensive
Scalar• N ops per cycle
2) circuitry• HP PA-8000
• 4-way issue• reorder buffer:
850K transistors• incl. 6,720 5-bit register
number comparators
Vector• N ops per cycle
2) circuitry
• T0 vector micro• 24 ops per cycle• 730K transistors total
• only 23 5-bit register number comparators
• No floating point
![Page 31: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/31.jpg)
31CIS 429/529 2007 Parallel Arch. Intro
Vectors Lower Power
Vector• One inst fetch, decode,
dispatch per vector
• Structured register accesses
• Smaller code for high performance, less power in instruction cache misses
• Bypass cache
• One TLB lookup pergroup of loads or stores
• Move only necessary dataacross chip boundary
Single-issue Scalar• One instruction fetch, decode,
dispatch per operation• Arbitrary register accesses,
adds area and power• Loop unrolling and software
pipelining for high performance increases instruction cache footprint
• All data passes through cache; waste power if no temporal locality
• One TLB lookup per load or store
• Off-chip access in whole cache lines
![Page 32: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/32.jpg)
32CIS 429/529 2007 Parallel Arch. Intro
Superscalar Energy Efficiency Even Worse
Vector• Control logic grows
linearly with issue width• Vector unit switches
off when not in use
• Vector instructions expose parallelism without speculation
• Software control ofspeculation when desired:
– Whether to use vector mask or compress/expand for conditionals
Superscalar• Control logic grows
quadratically with issue width
• Control logic consumes energy regardless of available parallelism
• Speculation to increase visible parallelism wastes energy
![Page 33: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/33.jpg)
33CIS 429/529 2007 Parallel Arch. Intro
Vector Applications
Limited to scientific computing?• Multimedia Processing (compress., graphics, audio synth, image proc.)
• Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)• Lossy Compression (JPEG, MPEG video and audio)• Lossless Compression (Zero removal, RLE, Differencing, LZW)• Cryptography (RSA, DES/IDEA, SHA/MD5)
• Speech and handwriting recognition• Operating systems/Networking (memcpy, memset, parity, checksum)• Databases (hash/join, data mining, image/video serving)• Language run-time support (stdlib, garbage collection)
• even SPECint95
![Page 34: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/34.jpg)
34CIS 429/529 2007 Parallel Arch. Intro
Newer Vector Computers
• Cray X1– MIPS like ISA + Vector in CMOS
• NEC Earth Simulator– Fastest computer in world for 3 years; 40 TFLOPS
– 640 CMOS vector nodes
![Page 35: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/35.jpg)
35CIS 429/529 2007 Parallel Arch. Intro
Key Architectural Features of X1
New vector instruction set architecture (ISA)
– Much larger register set (32x64 vector, 64+64 scalar)
– 64- and 32-bit memory and IEEE arithmetic
– Based on 25 years of experience compiling with Cray1 ISA
Decoupled Execution– Scalar unit runs ahead of vector unit, doing addressing and control
– Hardware dynamically unrolls loops, and issues multiple loops concurrently– Special sync operations keep pipeline full, even across barriers
Allows the processor to perform well on short nested loops
Scalable, distributed shared memory (DSM) architecture
– Memory hierarchy: caches, local memory, remote memory
– Low latency, load/store access to entire machine (tens of TBs)
– Processors support 1000’s of outstanding refs with flexible addressing– Very high bandwidth network
– Coherence protocol, addressing and synchronization optimized for DM
![Page 36: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/36.jpg)
36CIS 429/529 2007 Parallel Arch. Intro
A Modern Vector Super: NEC SX-6 (2003)
• CMOS Technology– 500 MHz CPU, fits on single chip
– SDRAM main memory (up to 64GB)
• Scalar unit– 4-way superscalar with out-of-order and speculative
execution
– 64KB I-cache and 64KB data cache
• Vector unit– 8 foreground VRegs + 64 background VRegs (256x64-bit
elements/VReg)
– 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
– 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
– 1 load & store unit (32x8 byte accesses/cycle)
– 32 GB/s memory bandwidth per processor
• SMP structure– 8 CPUs connected to memory through crossbar
– 256 GB/s shared memory bandwidth (4096 interleaved banks)
![Page 37: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/37.jpg)
41CIS 429/529 2007 Parallel Arch. Intro
SIMD Vector Summary
• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware,
more energy efficient, and better real-time model than Out-of-order machines
• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
• Fundamental design issue is memory bandwidth– With virtual address translation and caching
• Will multimedia popularity revive vector architectures?
![Page 38: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/38.jpg)
42CIS 429/529 2007 Parallel Arch. Intro
MIMD Architectures
• 2 classes of multiprocessors wrt memory:
1. Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006
• Small enough to share single, centralized memory
2. Physically Distributed-Memory multiprocessor• Larger number chips and cores (greater scalability)
• BW demands Memory distributed among processors
![Page 39: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/39.jpg)
43CIS 429/529 2007 Parallel Arch. Intro
Centralized vs. Distributed Memory
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem Mem
Centralized Memory Distributed Memory
Scale
![Page 40: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/40.jpg)
44CIS 429/529 2007 Parallel Arch. Intro
Centralized Memory Multiprocessor
• Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors
• Large caches single memory can satisfy memory demands of small number of processors
• Can scale to a few dozen processors by using a switch and by using many memory banks
• Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
![Page 41: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/41.jpg)
45CIS 429/529 2007 Parallel Arch. Intro
Distributed Memory Multiprocessor
• Pro: Cost-effective way to scale memory bandwidth
• If most accesses are to local memory
• Pro: Reduces latency of local memory accesses
• Con: Communicating data between processors more complex
• Con: Must change software to take advantage of increased memory BW
![Page 42: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/42.jpg)
46CIS 429/529 2007 Parallel Arch. Intro
2 Models for Communication and Memory Architecture1. Communication occurs by explicitly passing
messages among the processors: message-passing multiprocessors
2. Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either• UMA (Uniform Memory Access time) for shared
address, centralized memory MP• NUMA (Non Uniform Memory Access time
multiprocessor) for shared address, distributed memory MP
• In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space
![Page 43: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/43.jpg)
47CIS 429/529 2007 Parallel Arch. Intro
Performance of Parallel Processing
• First challenge is % of program inherently sequential
• Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?
a.10%
b.5%
c.1%
d.<1%
![Page 44: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/44.jpg)
49CIS 429/529 2007 Parallel Arch. Intro
Challenges of Parallel Processing
• Second challenge is long latency to remote memory
• Suppose 32 CPU NUMA:base CPI is 0.5, remote access make up 0.2% of instructions but take 400 CPI.
• What is performance impact?
a. 1.5X
b. 2.0X
c. 2.5X
![Page 45: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/45.jpg)
51CIS 429/529 2007 Parallel Arch. Intro
Speedup
• Speedup (p processors) =
• Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
![Page 46: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/46.jpg)
52CIS 429/529 2007 Parallel Arch. Intro
Speedup - what’s happening?
• Ideally, linear speedup• In reality, communication overhead reduces• Suprisingly, super-linear speedup is achievable
![Page 47: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/47.jpg)
53CIS 429/529 2007 Parallel Arch. Intro
Amdahl’s Law
• Most fundamental limitation on parallel speedup
• If fraction s of seq execution is inherently serial, speedup <= 1/s
• Example: 2-phase calculation– sweep over n-by-n grid and do some independent computation– sweep again and add each value to global sum
• Time for first phase = n2/p
• Second phase serialized at global variable, so time = n2
• Speedup <= or at most 2
• Trick: divide second phase into two– accumulate into private sum during sweep– add per-process private sum into global sum
• Parallel time is n2/p + n2/p + p, and speedup at best
2n2
n2
p + n2
2n2
2n2 + p2
![Page 48: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/48.jpg)
54CIS 429/529 2007 Parallel Arch. Intro
Amdahl’s Law
1
p
1
p
1
n2/p
n2
p
wor
k do
ne c
oncu
rren
tly
n2
n2
Timen2/p n2/p
(c)
(b)
(a)
![Page 49: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/49.jpg)
55CIS 429/529 2007 Parallel Arch. Intro
Concurrency Profiles
– Area under curve is total work done, or time with 1 processor– Horizontal extent is lower bound on time (infinite processors)
– Speedup is the ratio: , base case:
– Amdahl’s law applies to any overhead, not just limited concurrency
Concurrency
150
219
247
286
313
343
380
415
444
483
504
526
564
589
633
662
702
7330
200
400
600
800
1,000
1,200
1,400
Clock cycle number
fk k
fkkp
k=1
k=1
1
s + 1-sp
![Page 50: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/50.jpg)
56CIS 429/529 2007 Parallel Arch. Intro
Communication Performance Metrics: Latency and Bandwidth
1. Bandwidth– Need high bandwidth in communication– Match limits in network, memory, and processor– Challenge is link speed of network interface vs. bisection
bandwidth of network
2. Latency– Affects performance, since processor may have to wait– Affects ease of programming, since requires more thought to
overlap communication and computation– Overhead to communicate is a problem in many machines
3. Latency Hiding– How can a mechanism help hide latency?– Increases programming system burdern– Examples: overlap message send with computation, prefetch
data, switch to other tasks
![Page 51: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/51.jpg)
57CIS 429/529 2007 Parallel Arch. Intro
Networks
• Design Options:
• Topology
• Routing
• Direct vs. Indirect
• Physical implementation
• Evaluation Criteria:
• Latency
• Bisection Bandwidth
• Contention and hot-spot behavior
• Partitionability
• Cost and scalability
• Fault tolerance
![Page 52: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/52.jpg)
58CIS 429/529 2007 Parallel Arch. Intro
Buses
• Simple and cost-effective for small-scale multiprocessors
• Not scalable (limited bandwidth; electrical complications)
P PP
Bus
![Page 53: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/53.jpg)
59CIS 429/529 2007 Parallel Arch. Intro
Crossbars
• Each port has link to every other port
+ Low latency and high throughput
- Cost grows as O(N^2) so not very scalable.
- Difficult to arbitrate and to get all data lines into and out of a centralized crossbar.
• Used in small-scale MPs and as building block for other networks (e.g., Omega).
P
P
P
P
M M M M
Crossbar
![Page 54: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/54.jpg)
60CIS 429/529 2007 Parallel Arch. Intro
Rings
• Cheap: Cost is O(N).
• Point-to-point wires and pipelining can be used to make them very fast.
+ High overall bandwidth
- High latency O(N)
• Examples: KSR machine, Hector
P PP
P P P
Ring
![Page 55: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/55.jpg)
61CIS 429/529 2007 Parallel Arch. Intro
Trees
• Cheap: Cost is O(N).
• Latency is O(logN).
• Easy to layout as planar graphs (e.g., H-Trees).
• For random permutations, root can become bottleneck.
• To avoid root being bottleneck, notion of Fat-Trees (used in CM-5)
• channels are wider as you move towards root.
H-Tree
Fat Tree
![Page 56: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/56.jpg)
62CIS 429/529 2007 Parallel Arch. Intro
Hypercubes
• Also called binary n-cubes. # of nodes = N = 2^n.
• Latency is O(logN); Out degree of PE is O(logN)
• Minimizes hops; good bisection BW; but tough to layout in 3-space
• Popular in early message-passing computers (e.g., intel iPSC, NCUBE)
• Used as direct network ==> emphasizes locality
0-D 1-D 2-D 3-D 4-D
![Page 57: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/57.jpg)
63CIS 429/529 2007 Parallel Arch. Intro
Multistage Logarithmic Networks
• Cost is O(NlogN); latency is O(logN); throughput is O(N).
• Generally indirect networks.
• Many variations exist (Omega, Butterfly, Benes, ...).
• Used in many machines: BBN Butterfly, IBM RP3, ...
![Page 58: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/58.jpg)
64CIS 429/529 2007 Parallel Arch. Intro
Omega Network
• All stages are same, so can use recirculating network.
• Single path from source to destination.
• Can add extra stages and pathways to minimize collisions and increase fault tolerance.
• Can support combining. Used in IBM RP3.
000
001
010011
100
101
110111
000
001
010011
100
101
110111
Omega Network
![Page 59: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/59.jpg)
65CIS 429/529 2007 Parallel Arch. Intro
Butterfly Network
000
001
010011
100
101
110111
000
001
010011
100
101
110111
Butter fly Network
split on MSB
split on LSB
• Equivalent to Omega network. Easy to see routing of messages.
• Also very similar to hypercubes (direct vs. indirect though).
• Clearly see that bisection of network is (N / 2) channels.
• Can use higher-degree switches to reduce depth. Used in BBN machines.
![Page 60: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/60.jpg)
66CIS 429/529 2007 Parallel Arch. Intro
k-ary n-cubes
• Generalization of hypercubes (k-nodes in a string)
• Total # of nodes = N = k^n.
• k > 2 reduces # of channels at bisection, thus allowing for wider channels but more hops.
4-ary 3-cube
![Page 61: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/61.jpg)
67CIS 429/529 2007 Parallel Arch. Intro
Routing Strategies and Latency
• Store-and-Forward routing:
• Tsf = Tc • ( D • L / W)
• L = msg length, D = # of hops,
W = width, Tc = hop delay
• Wormhole routing:
• Twh = Tc • (D + L / W)
• # of hops is an additive rather than multiplicative factor
• Virtual Cut-Through routing:
• Older and similar to wormhole. When blockage occurs, however, message is removed from network and buffered.
• Deadlock are avoided through use of virtual channels and by using a routing strategy that does not allow channel-dependency cycles.
![Page 62: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/62.jpg)
68CIS 429/529 2007 Parallel Arch. Intro
Advantages of Low-Dimensional Nets
• What can be built in VLSI is often wire-limited• LDNs are easier to layout:
– more uniform wiring density (easier to embed in 2-D or 3-D space)– mostly local connections (e.g., grids)
• Compared with HDNs (e.g., hypercubes), LDNs have:– shorter wires (reduces hop latency)– fewer wires (increases bandwidth given constant bisection width)
» increased channel width is the major reason why LDNs win!
• Factors that limit end-to-end latency:– LDNs: number of hops– HDNs: length of message going across very narrow channels
• LDNs have better hot-spot throughput– more pins per node than HDNs
![Page 63: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/63.jpg)
CIS 429/529 2007 Parallel Arch. Intro
Cluster InterconnectTechnology (off-the-shelf)
![Page 64: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/64.jpg)
70CIS 429/529 2007 Parallel Arch. Intro
Dolphin
• Implementation of SCI (scalable coherent interconnect)
• Torus topology– (Usually) no switches
» Dolphin has small (8-ported) switches
![Page 65: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/65.jpg)
71CIS 429/529 2007 Parallel Arch. Intro
Dolphin
• PCI Card– $1300 - $2500 (USD)
• MPI performance:– Latency: 4 microseconds
– Bandwidth: 300 MB/s
• Supported Software: Linux and Solaris
• Supported CPUs: x86, SPARC, Alpha
![Page 66: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/66.jpg)
72CIS 429/529 2007 Parallel Arch. Intro
Giganet
• cLAN– VI interface
– 1.25 Gb/s switched interconnect
• MPI Performance:– Latency: 12 microseconds
– Bandwidth: 107 MB/s
• Supported Software: Linux and Windows
• Supported CPUs: x86
![Page 67: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/67.jpg)
73CIS 429/529 2007 Parallel Arch. Intro
Myrinet
• 2 Gb/s switched interconnect
• MPI performance:– Latency: 10 microseconds
– Bandwidth: 240 MB/s
• PCI Card– USD$1200 - $1800
![Page 68: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/68.jpg)
74CIS 429/529 2007 Parallel Arch. Intro
Myrinet
• Switch– 128 ports in one chassis
– Cost: ~USD$50,000
• Supported Software: Linux, Windows, Solaris, Tru64, Irix, VxWorks
• Supported CPUs: x86, Itanium, Alpha, PowerPC, SPARC, O200
![Page 69: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/69.jpg)
75CIS 429/529 2007 Parallel Arch. Intro
Quadrics
• Switched Interconnect
• Interconnect on the PSC Terascale machine
• MPI Performance:– Latency: 5 microseconds
– Bandwidth: 300 MB/s
• Supported Software: Tru64 and Linux
• Supported CPUs: Alpha, IA-32
![Page 70: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/70.jpg)
76CIS 429/529 2007 Parallel Arch. Intro
Ethernet
• 1 Gb/s switched interconnect
• PCI Card– “Free” (integrated into motherboard), or
– USD$90
• MPI Performance– Latency: 70 microseconds
– Bandwidth: 50 MB/s
• High volume
• 128-port GbE switch» USD$160,000
• Hardware and software support: ubiquitous
![Page 71: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/71.jpg)
77CIS 429/529 2007 Parallel Arch. Intro
Ethernet
• 10 GbE on the horizon– 10 GbE is equivalent to OC-192
![Page 72: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/72.jpg)
78CIS 429/529 2007 Parallel Arch. Intro
Infiniband
• 2.5 Gb/s switched interconnect– Multiple link widths: 1x, 4x, 12x (250 MB/s, 1 GB/s, 3 GB/s)
• Engineered with many good networking ideas– E.g., OS bypass (registered memory), selectable transfer
units, remote DMA
• Management and form factors specified up front– Focus on interoperability
» Learning from Fibre Channel’s mistakes!
![Page 73: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/73.jpg)
79CIS 429/529 2007 Parallel Arch. Intro
![Page 74: CIS 429/529 2007 Parallel Arch. Intro Parallel Computer Architecture Slides adapted from those of David Patterson and David Culler, UC Berkeley Krste Asanovic,](https://reader036.fdocuments.in/reader036/viewer/2022081513/56649d4b5503460f94a2932c/html5/thumbnails/74.jpg)
80CIS 429/529 2007 Parallel Arch. Intro
SummaryNetwork MPI Latency
(μsec)
MPI
BW
(MB/s)
NIC Cost
($USD)
Switch Cost
$(USD)
CPU Support Software Support
Dolphin 4 300 1300-2500 NA X86, SPARC, Alpha
Linux, Solaris
GigaNet 12 107 ??? ??? X86 Linux, Windows
Myrinet 10 240 1200-1800 50,000 X86, IA-64, Alpha, PowerPC, SPARC,
O200
Linux, Windows, Solaris,
Tru64, Irix, VxWorks
Quadrics 5 300 ??? ??? X86, Alpha Tru64, Linux
Ethernet 70 50 0-90 160,000 ubiq ubiq