Post on 29-Mar-2021
Architecture and Compilers PART II
HPC Fall 2010 Prof. Robert van Engelen
HPC Fall 2010 2 10/10/10
Overview
The PMS model Shared memory multiprocessors
Basic shared memory systems SMP, Multicore, and COMA
Distributed memory multicomputers MPP systems Network topologies for message-passing multicomputers Distributed shared memory
Pipeline and vector processors Comparison Taxonomies
HPC Fall 2010 3 10/10/10
PMS Architecture Model Processor (P)
A device that performs operations on data
Memory (M) A device that stores
data
Switch (S) A device that facilitates
transfer of data between devices
Arcs denote connectivity
Example: a computer system with CPU and peripherals
A simple PMS model
HPC Fall 2010 4 10/10/10
Shared Memory Multiprocessor Processors access shared memory via a common switch, e.g. a bus
Problem: a single bus results in a bottleneck Shared memory has a single address space Architecture sometimes referred to as a “dance hall”
HPC Fall 2010 5 10/10/10
Shared Memory: the Bus Contention Problem
Each processor competes for access to shared memory Fetching instructions Loading and storing data
Bus contention Access to memory is restricted to one processor at a time This limits the speedup and scalability with respect to the
number of processors Assume that each instruction requires 0<m<1 memory operations
(the ratio load-store per instruction), F instructions are performed per unit of time, and a maximum of W words can be moved over the bus per unit of time, then SP < W / (m F)
regardless of the number of processors P In other words, the parallel efficiency is limited unless
P < W / (m F)
HPC Fall 2010 6 10/10/10
Shared Memory: Work-Memory Ratio
Work-memory ratio (FP:M ratio): ratio of the number of floating point operations to the number of distinct memory locations referenced in the innermost loop: Same location is counted just
once in innermost loop Assumes effective use of
registers (and cache) in innermost loop for reuse
Assumes no reuse across outer loops (registers/cache use saturated in inner loop)
Note that FP:M = m-1 –1 so efficient utilization of shared memory multiprocessors requires P < (FP:M+1) × W / F
for (i=0; i<1000; i++) x = x + i;
for (i=0; i<N; i++) x = x + a[i]*b[i];
2 distinct memory locations and one float add:
FP:M = 500
2N+1 distinct memory locations and 2N FP operations:
FP:M = 1 when N is large
HPC Fall 2010 7 10/10/10
Shared Memory Multiprocessor with Local Cache
Add local cache the improve performance when W / F is small With today’s systems we have W / F << 1
Problem: how to ensure cache coherence?
HPC Fall 2010 8 10/10/10
Shared Memory: Cache Coherence
A cache coherence protocol ensures that processors obtain newly altered data when shared data is modified by another processor
Because caches operate on cache lines, more data than the shared object alone can be effected, which may lead to false sharing
Thread 1 modifies shared data
Thread 0 reads modified shared data
Cache coherence protocol ensures that thread 0 obtains the newly altered data
HPC Fall 2010 9 10/10/10
COMA Cache-only memory architecture (COMA) Large cache per processor to replace shared memory A data item is either in one cache (non-shared) or in multiple caches
(shared) Switch includes an engine that provides a single global address
space and ensures cache coherence
HPC Fall 2010 10 10/10/10
Distributed Memory Multicomputer
Massively parallel processor (MPP) systems with P > 1000 Communication via message passing Nonuniform memory access (NUMA) Network topologies
Mesh Hypercube Cross-bar switch
HPC Fall 2010 11 10/10/10
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
10 20 30 40 50 60 70 80 90 100
n
Speedup
Computation-Communication Ratio
The computation-communication ratio:
tcomp / tcomm Usually assessed analytically
and/or measured empirically High communication overhead
decreases speedup, so ratio should be as high as possible
For example: data size n, number of processors P, and ratio tcomp / tcomm = 1000n / 10n2
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
n
com
puta
tion
/
com
mun
icat
ion tcomp / tcomm = 1000n / 10n2
P=1
P=2
P=4
P=8
SP = ts / tP = 1000n / (1000n / P + 10n2)
HPC Fall 2010 12 10/10/10
Mesh Topology
Network of P nodes has mesh size √P × √P
Diameter 2 × (√P -1)
Torus network wraps the ends Diameter √P -1
HPC Fall 2010 13 10/10/10
Hypercube Topology d-dimensional hypercube has
P = 2d nodes
Diameter is d = log2 P
Node addressing is simple Node number of nearest
neighbor node differs in one bit
Routing algorithm flips bits to determine possible paths, e.g. from node 001 to 111 has two shortest paths
001 → 011 → 111 001 → 101 → 111
d=2
d=4
d=3
HPC Fall 2010 14 10/10/10
Cross-bar Switches
Cross-bar switch
Processors and memories are connected by a set of switches
Enables simultaneous (contention free) communication between processor i and σ(i), where σ is an arbitrary permutation of 1…P
σ(1)=2 σ(2)=1 σ(3)=3
HPC Fall 2010 15 10/10/10
Multistage Interconnect Network
Each switch has an upper output (0) and a lower output (1)
A message travels through switch based on destination address Each bit in destination address is
used to control a switch from start to destination
For example, from 001 to 100 First switch selects lower output
(1) Second switch selects upper
output (0) Third switch selects upper output
(0)
Contention can occur when two messages are routed through the same switch
8×8 three-stage interconnect
4×4 two-stage interconnect
HPC Fall 2010 16 10/10/10
Distributed Shared Memory Distributed shared memory (DSM) systems use physically
distributed memory modules and a global address space that gives the illusion of shared virtual memory
Hardware is used to automatically translate a memory address into a local address or a remote memory address (via message passing)
Software approaches add a programming layer to simplify access to shared objects (hiding the communications)
HPC Fall 2010 17 10/10/10
Pipeline and Vector Processors Vector processors run operations
on multiple data elements simultaneously
Vector processor has a maximum vector length, e.g. 512
Strip mining the loop results in an outer loop with stride 512 to enable vectorization of longer vector operations
Pipelined vector architectures dispatch multiple vector operations per clock cycle
Vector chaining allows the result of a previous vector operation to be directly fed into the next operation in the pipeline
DO i = 0,9999 z(i) = x(i) + y(i)
DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO
DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999)
HPC Fall 2010 18 10/10/10
Comparison: Bandwidth, Latency and Capacity
HPC Fall 2010 19 10/10/10
Flynn’s Taxonomy Single instruction stream single
data stream (SISD) Traditional PC system
Single instruction stream multiple data stream (SIMD) Similar to MMX/SSE/AltiVec
multimedia instruction sets MASPAR
Multiple instruction stream multiple data stream (MIMD) Single program, multiple data
(SPMD) programming: each processor executes a copy of the program
MIMD
SISD SIMD
Data stream
Inst
ruct
ion
stre
am
single
sing
le
mul
tiple
multiple
HPC Fall 2010 20 10/10/10
Task Parallelism versus Data Parallelism
Task parallelism, MIMD Fork-join model with thread-level parallelism and shared memory Message passing model with (distributed processing) processes
Data parallelism, SIMD Multiple processors (or units) operate on segmented data set SIMD model with vector and pipeline machines SIMD-like multi-media extensions, e.g. MMX/SSE/Altivec
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 ⊕ Y3 X2 ⊕ Y2 X1 ⊕ Y1 X0 ⊕ Y0
⊕ ⊕ ⊕ ⊕
src1
src2
dest
Vector operation X[0:3] ⊕ Y[0:3] with SSE instruction on Pentium-4
HPC Fall 2010 21 10/10/10
Further Reading
[PP2] pages 13-26 [SPC] pages 71-95 [HPC] pages 25-28 Optional:
[SRC] pages 15-42