CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 10 th , 2010
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
-
Upload
felicity-tucker -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
![Page 1: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/1.jpg)
1
Lecture 22 Multiprocessor Performance
Adapted from UCB CS252 S01, Copyright 2001 USB
![Page 2: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/2.jpg)
2
Review: Snoopy Cache Protocol
Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all
caches which snoop and invalidate any copies Read Miss:
Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol (typically write through):Write serialization: bus serializes requests! Bus is single point of arbitration
Good for a small number of processors; how about 16 or more?
![Page 3: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/3.jpg)
3
Larger MPs
Separate Memory per ProcessorLocal or Remote access via memory controller1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache
Which caches have a copies of block, dirty vs. clean, ...
Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one
location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache
size)
Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks
![Page 4: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/4.jpg)
4
Distributed Directory MPs
Interconnection Network
Proc+Cache
Memory
Dir
I/O
Proc+Cache
Memory
Dir
I/O
Proc+Cache
Memory
Dir
I/O
Proc+Cache
Memory
Dir I/O
Proc+Cache
Memory
Dir I/O
Proc+Cache
Memory
Dir I/O
![Page 5: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/5.jpg)
5
Directory Protocol
Similar to Snoopy Protocol: Three states Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor hasit; not valid in any cache) Exclusive: 1 processor (owner) has data;
memory out-of-dateIn addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)Keep it simple(r):
Writes to non-exclusive data => write miss
Processor blocks until access completes Assume messages received
and acted upon in order sentSee textbook for directory state machine
![Page 6: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/6.jpg)
6
Directory Protocol
No bus and don’t want to broadcast: interconnect no longer single arbitration point all messages have explicit responses
Terms: typically 3 processors involved Local node where a request originates Home node where the memory location
of an address resides Remote node has a copy of a cache
block, whether exclusive or shared
Example messages on next slide: P = processor number, A = address
![Page 7: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/7.jpg)
7
Directory Protocol MessagesMessage type Source Destination Msg ContentRead miss Local cache Home directory P, A
Processor P reads data at address A; make P a read sharer and arrange to send data back
Write miss Local cache Home directory P, A Processor P writes data at address A;
make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A
Invalidate a shared copy at address A.Fetch Home directory Remote cache A
Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A
Fetch the block at address A and send it to its home directory; invalidate the block in the cache
Data value reply Home directory Local cache Data Return a data value from the home memory (read miss
response)Data write-back Remote cache Home directory A, Data
Write-back a data value for address A (invalidate response)
![Page 8: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/8.jpg)
8
Parallel App: Commercial Workload
Online transaction processing workload (OLTP) (like TPC-B or -C)Decision support system (DSS) (like TPC-D)Web index search (Altavista)
Benchmark % TimeUserMode
% TimeKernel
% TimeI /O time(CPU Idle)
OLTP 71% 18% 11%
DSS (range) 82- 94% 3-5% 4-13%
DSS (avg) 87% 4% 9%
Altavista > 98% < 1% <1%
![Page 9: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/9.jpg)
9
Alpha 4100 SMP
4 CPUs300 MHz Apha 21164 @ 300 MHzL1$ 8KB direct mapped, write throughL2$ 96KB, 3-way set associativeL3$ 2MB (off chip), direct mappedMemory latency 80 clock cyclesCache to cache 125 clock cycles
![Page 10: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/10.jpg)
10
OLTP Performance as vary L3$ size
0
10
20
30
40
50
60
70
80
90
100
1 MB 2 MB 4 MB 8MB
L3 Cache Size
No
rmal
ized
Exe
cuti
on
Tim
e
IdlePAL CodeMemory AccessL2/L3 Cache AccessInstruction Execution
![Page 11: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/11.jpg)
11
L3 Miss Breakdown
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
3.25
1 MB 2 MB 4 MB 8 MB
Cache size
Mem
ory
cyc
les
per
inst
ruct
ion
InstructionCapacity/ConflictColdFalse SharingTrue Sharing
![Page 12: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/12.jpg)
12
Memory CPI as increase CPUs
0
0.5
1
1.5
2
2.5
3
1 2 4 6 8
Processor count
Mem
ory
cyc
les
per
inst
ruct
ion
InstructionConflict/CapacityColdFalse SharingTrue Sharing
L3: 2MB 2-way
![Page 13: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/13.jpg)
13
OLTP Performance as vary L3$ size
0123456789
10111213141516
32 64 128 256
Block size in bytes
Mis
ses
per
1,0
00 in
stru
ctio
ns
Insruction
Capacity/Conflict
Cold
False Sharing
True Sharing
![Page 14: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/14.jpg)
14
SGI Origin 2000
A pure NUMA2 CPUs per node, Scales up to 2048 processorsDesign for scientific computation vs. commercial processingScalable bandwidth is crucial to Origin
![Page 15: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/15.jpg)
15
Parallel App: Scientific/Technical
FFT Kernel: 1D complex number FFT 2 matrix transpose phases => all-to-all
communication Sequential time for n data points: O(n log n) Example is 1 million point data set
LU Kernel: dense matrix factorization Blocking helps cache miss rate, 16x16 Sequential time for nxn matrix: O(n3) Example is 512 x 512 matrix
![Page 16: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/16.jpg)
16
FFT KernelFFT
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
8 16 32 64Processor count
Ave
rage
cyc
les
per
refe
renc
e3-hop miss to remotecacheRemote miss to home
Miss to local memory
Hit
![Page 17: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/17.jpg)
17
LU kernelLU
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
8 16 32 64
Processor count
Ave
rag
e cy
cles
per
ref
eren
ce
3-hop miss to remotecacheRemote miss to home
Miss to local memory
Hit
![Page 18: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/18.jpg)
18
Parallel App: Scientific/TechnicalBarnes App: Barnes-Hut n-body algorithm solving a problem in galaxy evolution
n-body algs rely on forces drop off with distance; if far enough away, can ignore (e.g., gravity is 1/d2)
Sequential time for n data points: O(n log n) Example is 16,384 bodies
Ocean App: Gauss-Seidel multigrid technique to solve a set of elliptical partial differential eq.s’
red-black Gauss-Seidel colors points in grid to consistently update points based on previous values of adjacent neighbors
Multigrid solve finite diff. eq. by iteration using hierarch. Grid Communication when boundary accessed by adjacent subgrid Sequential time for nxn grid: O(n2) Input: 130 x 130 grid points, 5 iterations
![Page 19: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/19.jpg)
19
Barnes AppBarnes
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
8 16 32 64
Processor count
Ave
rag
e c
ycl
es
per
mem
ory
ref
ere
nce
3-hop miss to remotecache
Remote miss to home
Miss to local memory
Hit
![Page 20: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/20.jpg)
20
Ocean AppOcean
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
8 16 32 64
Ave
rag
e cy
cles
per
ref
eren
ce
3-hop miss to remotecacheRemote miss
Local miss
Cache hit
![Page 21: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/21.jpg)
21
Example: Sun Wildfire Prototype1.Connect 2-4 SMPs via optional NUMA technology
1. Use “off-the-self” SMPs as building block
2.For example, E6000 up to 15 processor or I/O boards (2 CPUs/board)
1. Gigaplane bus interconnect, 3.2 Gbytes/sec
3.Wildfire Interface board (WFI) replace a CPU board => up to 112 processors (4 x 28),
1. WFI board supports one coherent address space across 4 SMPs
2. Each WFI has 3 ports connect to up to 3 additional nodes, each with a dual directional 800 MB/sec connection
3. Has a directory cache in WFI interface: local or clean OK, otherwise sent to home node
4. Multiple bus transactions
![Page 22: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/22.jpg)
22
Example: Sun Wildfire Prototype
1.To reduce contention for page, has Coherent Memory Replication (CMR)
2.Page-level mechanisms for migrating and replicating pages in memory, coherence is still maintained at the cache-block level
3.Page counters record misses to remote pages and to migrate/replicate pages with high count
4.Migrate when a page is primarily used by a node
5.Replicate when multiple nodes share a page
![Page 23: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/23.jpg)
23
Synchronization
Why Synchronize? Need to know when it is safe for different processes to use shared data For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization
Study textbook for details
![Page 24: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/24.jpg)
24
Fallacy: Amdahl’s Law doesn’t apply to parallel computers
Since some part linear, can’t go 100X?
1987 claim to break it, since 1000X speedup Instead of using fixed data set, scale data set with
# of processors! Linear speedup with 1000 processors
![Page 25: 1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.](https://reader035.fdocuments.in/reader035/viewer/2022062718/56649e8a5503460f94b8f2b4/html5/thumbnails/25.jpg)
25
Multiprocessor FutureWhat have been proved for: multiprogrammed workloads,
commercial workloads e.g. OLTP and DSS, scientific applications in some domains
Supercomputing 2004: High-performance computing is growing?!Cluster systems are unexpectedly powerful and inexpensiveOptical networking is being deployedGrid software is under intensive researchClaims: Students should learn parallel program from high school, and Undergraduates should be required to learn!
Multiprocessor advances CMP or Chip-level multiprocessing, e.g. IBM Power5 (with SMT) MPs no longer dominate TOP 500, but stay as the building blocks
for clusters