CS4 Parallel Architectures - Introduction
description
Transcript of CS4 Parallel Architectures - Introduction
CS4/MSc Parallel Architectures - 2009-2010
CS4 Parallel Architectures - Introduction Instructor : Marcelo Cintra ([email protected] – 1.03 IF) Lectures: Tue and Fri in G0.9 WRB at 10am Pre-requisites: CS3 Computer Architecture Practicals: Practical 1 – out week 3 (26/1/10); due week 5 (09/2/10) Practical 2 – out week 5 (09/2/10); due week 7 (23/2/10) Practical 3 – out week 7 (23/2/10); due week 9 (09/3/10) (MSc only) Practical 4 – out week 7 (26/2/10); due week 9 (12/3/10) Books:
– (**) Culler & Singh - Parallel Computer Architecture: A Hardware/Software Approach – Morgan Kaufmann
– (*) Hennessy & Patterson - Computer Architecture: A Quantitative Approach – Morgan Kaufmann – 3rd or 4th editions
Lecture slides (no lecture notes) More info: www.inf.ed.ac.uk/teaching/courses/pa/
1
CS4/MSc Parallel Architectures - 2009-2010
Topics Fundamental concepts
– Performance issues– Parallelism in software
Uniprocessor parallelism– Pipelining, superscalar, and VLIW processors– Vector, SIMD processors
Interconnection networks– Routing, static and dynamic networks– Combining networks
Multiprocessors, Multicomputers, and Multithreading– Shared memory and message passing systems– Cache coherence and memory consistency
Performance and scalability
2
CS4/MSc Parallel Architectures - 2009-2010
Lect. 1: Performance Issues Why parallel architectures?
– Performance of sequential architecture is limited (by technology and ultimately by the laws of physics)
– Relentless increase in computing resources (transistors for logic and memory) that can no longer be exploited for sequential processing
– At any point in time many important applications cannot be solved with the best existing sequential architecture
Uses of parallel architectures– To solve a single problem faster (e.g., simulating protein
folding: researchweb.watson.ibm.com/bleugene)– To solve a larger version of a problem (e.g., weather forecast:
www.jamstec.go.jp/esc)– To solve many problems at the same time (e.g., transaction
processing)
3
CS4/MSc Parallel Architectures - 2009-2010
Limits to Sequential Execution Speed of light limit
– Computation/data flow through logic gates, memory devices, and wires
– At all of these there is a non-zero delay that is at a minimum equal to delay of the speed of light
– Thus, the speed of light and the minimum physical feature sizes impose a hard limit on the speed of any sequential computation
Von Neumann’s limit– Programs consist of ordered sequence of instructions– Instructions are stored in memory and must be fetched in
order (same for data)– Thus, sequential computation is ultimately limited by the
memory bandwidth
4
CS4/MSc Parallel Architectures - 2009-2010
Examples of Parallel Architectures
An ARM processor in a common mobile phone has 10s of instructions in-flight in its pipeline
Pentium IV executes up to 6 microinstructions per cycle and has up to 126 microinstructions in-flight
Intel’s quad-core chips have four processors and are now in mainstream desktops and laptops
Japan’s Earth Simulator has 5120 vector processors, each with 8 vector pipelines
IBM’s largest BlueGene supercomputer has 131,072 processors
Google has about 100,000 Linux machines connected in several cluster farms
5
CS4/MSc Parallel Architectures - 2009-2010
Comparing Execution Times Example: system A: TA execution time of program P on A system B: TB execution time of program P’ on
B
Notes:– For fairness P and P’ must be “best possible implementation” on
each system– If multiple programs are run then report weighted arithmetic
mean – Must report all details such as: input set, compiler flags, command
line arguments, etc
6
Speedup: S =TB
TA
; we say: A is S times faster
or A is( TB
TA
X 100 - 100)% faster
CS4/MSc Parallel Architectures - 2009-2010
Amdahl’s Law Let: F fraction of problem that can be optimized Sopt speedup obtained on optimized fraction
e.g.: F = 0.5 (50%), Sopt = 10 Sopt = ∞
Bottom-line: performance improvements must be balanced
7
Soverall =1
(1 – F) +F
Sopt
Soverall =1
(1 – 0.5) +0.5
10
= 1.8 Soverall =1
(1 – 0.5) + 0= 2
CS4/MSc Parallel Architectures - 2009-2010
Amdahl’s Law and Efficiency Let: F fraction of problem that can be parallelized Spar speedup obtained on parallelized fraction P number of processors
e.g.: 16 processors (Spar = 16), F = 0.9 (90%),
Bottom-line: for good scalability E>50%; when resources are “free” then lower efficiencies are acceptable
8
Soverall =1
(1 – F) +F
Spar
Soverall =1
(1 – 0.9) +0.9
16
= 6.4
E =Soverall
P
E =6.4
16= 0.4 (40%)
CS4/MSc Parallel Architectures - 2009-2010
Performance Trends: Computer Families
Bottom-line: microprocessors have become the building blocks of most computer systems across the whole range of price-performance
9
0.1
1
10
100
Per
form
ance
Year
Minicomputers
Mainf rames
Culler and SinghFig. 1.1
CS4/MSc Parallel Architectures - 2009-2010
Technological Trends: Moore’s Law
10
Bottom-line: overwhelming number of transistors allow for incredibly complex and highly integrated systems
4004
8086
80286 80486
PentiumPentium II
Pentium IIIPentium IV
Core Duo
Xeon Multi-Core
1
10
100
1000
10000
100000
1000000
10000000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Tra
nsi
sto
rs (
x100
0)
Year
Intel CPUs
CS4/MSc Parallel Architectures - 2009-2010
Tracking Technology: The role of CA
Bottom-line: architectural innovation complement technological improvements
11
H&PFig. 1.1
0
200
400
600
800
1000
1200
1400
1600
1800
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
SP
EC
int
rati
ng
1.35x/year 1.58x/year
DEC Alpha
Intel Pentium III
DEC Alpha
HP 9000
IBM Pow er1
MIPS R2000
CS4/MSc Parallel Architectures - 2009-2010
The Memory Gap
Bottom-line: memory access is increasingly expensive and CA must devise new ways of hiding this cost
12
H&PFig. 5.2
1
10
100
1000
10000
100000
1980
1985
1990
1995
2000
2005
Per
form
ance
Memory CPU
CS4/MSc Parallel Architectures - 2009-2010
Software Trends Ever larger applications: memory requirements
double every year More powerful compilers and increasing role of
compilers on performance Novel applications with different demands: e.g., multimedia
– Streaming data– Simple fixed operations on regular and small data MMX-
like instructions e.g., web-based services
– Huge data sets with little locality of access– Simple data lookups and processing Transactional
Memory(?) (www.cs.wisc.edu/trans-memory)
Bottom-line: architecture/compiler co-design
13
CS4/MSc Parallel Architectures - 2009-2010
Current Trends in CA Very complex processor design:
– Hybrid branch prediction (MIPS R14000)– Out-of-order execution (Pentium IV)– Multi-banked on-chip caches (Alpha 21364)– EPIC (Explicitly Parallel Instruction Computer) (Intel Itanium)
Parallelism and integration at chip level:– Chip-multiprocessors (CMP) (Sun T2, IBM Power6, Intel Itanium 2)– Multithreading (Intel Hyperthreading, IBM Power6, Sun T2)– Embedded Systems On a Chip (SOC)
Multiprocessors:– Servers (Sun Fire, SGI Origin)– Supercomputers (IBM BlueGene, SGI Origin, IBM HPCx)– Clusters of workstations (Google server farm)
Power-conscious designs
14
CS4/MSc Parallel Architectures - 2009-2010
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor)
– Parallel arithmetic– Pipelining– Superscalar, VLIW, SIMD, and vector execution
Parallelism in Hardware (Multiprocessor)– Chip-multiprocessors a.k.a. Multi-cores– Shared-memory multiprocessors– Distributed-memory multiprocessors– Multicomputers a.k.a. clusters
Parallelism in Software– Tasks– Data parallelism– Data streams
(note: a “processor” must be capable of independent control and of operating on non-trivial data types)
1
CS4/MSc Parallel Architectures - 2009-2010
Taxonomy of Parallel Computers
According to instruction and data streams (Flynn):– Single instruction single data (SISD): this is the standard
uniprocessor– Single instruction, multiple data streams (SIMD):
Same instruction is executed in all processors with different data E.g., graphics processing
– Multiple instruction, single data streams (MISD): Different instructions on the same data Never used in practice
– Multiple instruction, multiple data streams (MIMD): the “common” multiprocessor
Each processor uses it own data and executes its own program (or part of the program)
Most flexible approach Easier/cheaper to build by putting together “off-the-shelf” processors
2
CS4/MSc Parallel Architectures - 2009-2010
Taxonomy of Parallel Computers
According to physical organization of processors and memory:– Physically centralized memory, uniform memory access
(UMA) All memory is allocated at same distance from all processors Also called symmetric multiprocessors (SMP) Memory bandwidth is fixed and must accommodate all
processors does not scale to large number of processors Used in most CMPs today (e.g., IBM Power5, Intel Core
Duo)
3
Interconnection
CPU
Main memory
CPU CPU CPU
Cache Cache Cache Cache
CS4/MSc Parallel Architectures - 2009-2010
Taxonomy of Parallel Computers
According to physical organization of processors and memory:– Physically distributed memory, non-uniform memory
access (NUMA) A portion of memory is allocated with each processor (node) Accessing local memory is much faster than remote memory If most accesses are to local memory than overall memory
bandwidth increases linearly with the number of processors
4
Interconnection
CPU
Mem.
CPU CPU CPU
Cache Cache Cache Cache
Mem. Mem. Mem.
Node
CS4/MSc Parallel Architectures - 2009-2010
Taxonomy of Parallel Computers
According to memory communication model– Shared address or shared memory
Processes in different processors can use the same virtual address space
Any processor can directly access memory in another processor node Communication is done through shared memory variables Explicit synchronization with locks and critical sections Arguably easier to program
– Distributed address or message passing Processes in different processors use different virtual address spaces Each processor can only directly access memory in its own node Communication is done through explicit messages Synchronization is implicit in the messages Arguably harder to program Some standard message passing libraries (e.g., MPI)
5
CS4/MSc Parallel Architectures - 2009-2010
Shared Memory vs. Message Passing
Shared memory
Message passing
6
flag = 0;…a = 10;flag = 1;
flag = 0;…while (!flag) {}x = a * y;
Producer (p1) Consumer (p2)
…a = 10;send(p2, a, label);
…receive(p1, b, label);x = b * y;
Producer (p1) Consumer (p2)
CS4/MSc Parallel Architectures - 2009-2010
Types of Parallelism in Applications
Instruction-level parallelism (ILP)– Multiple instructions from the same instruction stream can
be executed concurrently– Generated and managed by hardware (superscalar) or by
compiler (VLIW)– Limited in practice by data and control dependences
Thread-level or task-level parallelism (TLP)– Multiple threads or instruction sequences from the same
application can be executed concurrently– Generated by compiler/user and managed by compiler and
hardware– Limited in practice by communication/synchronization
overheads and by algorithm characteristics
7
CS4/MSc Parallel Architectures - 2009-2010
Types of Parallelism in Applications
Data-level parallelism (DLP)– Instructions from a single stream operate concurrently
(temporally or spatially) on several data– Limited by non-regular data manipulation patterns and
by memory bandwidth
Transaction-level parallelism– Multiple threads/processes from different transactions
can be executed concurrently– Sometimes not really considered as parallelism– Limited by access to metadata and by interconnection
bandwidth
8
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
The problem:– Operate on a (n+2)x(n+2) matrix– Points on the rim have fixed value– Inner points are updated as:
– Updates are in-place, so top and left are new values and bottom and right are old ones– Updates occur at multiple sweeps– Keep difference between old and new values and stop when difference for all points is small enough
9
A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
Dependences:– Computing the new value of a given point requires the
new value of the point directly above and to the left– By transitivity, it requires all points in the sub-matrix in
the upper-left corner– Points along the top-right to bottom-left diagonals can be
computed independently
10
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
ILP version (from sequential code):– Machine instructions from each j iteration can occur in
parallel– Branch prediction allows overlap of multiple iterations of
j loop – Some of the instructions from multiple j iterations can
occur in parallel
11
while (!done) { diff = 0; for (i=1; i<=n; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); diff += abs(A[i,j] – temp); } } if (diff/(n*n) < TOL) done=1;}
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
TLP version (shared-memory):
12
int mymin = 1+(pid * n/P);int mymax = mymin + n/P – 1;
while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P);}
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
TLP version (shared-memory) (for 2 processors):– Each processor gets a chunk of rows
E.g., processor 0 gets: mymin=1 and mymax=2 and processor 1 gets: mymin=3 and mymax=4
13
int mymin = 1+(pid * n/P);int mymax = mymin + n/P – 1;
while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } ...
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
TLP version (shared-memory):– All processors can access freely the same data structure
A– Access to diff, however, must be in turns– All processors update together their own done variable
14
... for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P);}
CS4/MSc Parallel Architectures - 2009-2010
Types of Speedups and Scaling
Scalability: adding x times more resources to the machine yields close to x times better “performance”– Usually resources are processors, but can also be memory
size or interconnect bandwidth– Usually means that with x times more processors we can
get ~x times speedup for the same problem– In other words: How does efficiency (see Lecture 1) hold
as the number of processors increases? In reality we have different scalability models:
– Problem constrained– Time constrained– Memory constrained
Most appropriate scalability model depends on the user interests
15
CS4/MSc Parallel Architectures - 2009-2010
Types of Speedups and Scaling
Problem constrained (PC) scaling:– Problem size is kept fixed– Wall-clock execution time reduction is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:
– Example: CAD tools that take days to run, weather simulation that does not complete in reasonable time
16
SPC =Time(1 processor)
Time(p processors)
CS4/MSc Parallel Architectures - 2009-2010
Types of Speedups and Scaling
Time constrained (TC) scaling:– Maximum allowable execution time is kept fixed– Problem size increase is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:
– Example: weather simulation with refined grid
17
STC =Work(p processors)
Work(1 processor)
CS4/MSc Parallel Architectures - 2009-2010
Types of Speedups and Scaling
Memory constrained (MC) scaling:– Both problem size and execution time are allowed to increase– Problem size increase with the available memory with
smallest increase in execution time is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:
– Example: astrophysics simulation with more planets and stars
18
SMC =Work(p processors)
Time(p processors)xTime(1 processor)
Work(1 processor)=Increase in Work
Increase in Time
CS4/MSc Parallel Architectures - 2009-2010
Lect. 3: Superscalar Processors I/II
Pipelining: several instructions are simultaneously at different stages of their execution
Superscalar: several instructions are simultaneously at the same stages of their execution
(Superpipelining: very deep pipeline with very short stages to increase the amount of parallelism)
Out-of-order execution: instructions can be executed in an order different from that specified in the program
Dependences between instructions:– Read after Write (RAW) (a.k.a. data dependence)– Write after Read (WAR) (a.k.a. anti dependence)– Write after Write (WAW) (a.k.a. output dependence)– Control dependence
Speculative execution: tentative execution despite dependences
1
CS4/MSc Parallel Architectures - 2009-2010
A 5-stage Pipeline
2
Generalregisters
ID MEMIF EXE WB
MemoryMemory
IF = instruction fetch (includes PC increment)ID = instruction decode + fetching values from general purpose registersEXE = arithmetic/logic operations or address computationMEM = memory access or branch completionWB = write back results to general purpose registers
CS4/MSc Parallel Architectures - 2009-2010
A Pipelining Diagram Start one instruction per clock cycle
3
IF I1 I2
I1 I2ID
EXE
MEM
WB
I1 I2
I1 I2
I1 I2
I3 I4
I3
I3 I4 I5
I3 I4 I5 I6
cycle 1 2 3 4 5 6
instructionflow
each instruction still takes 5 cycles, but instructions now complete every cycle: CPI 1
CS4/MSc Parallel Architectures - 2009-2010
Multiple-issue Start two instructions per clock cycle
4
IF I1 I3
I1 I3ID
EXE
MEM
WB
I1 I3
I1 I3
I1 I3
I5 I7
I5
I5 I7 I9
I5 I7 I9 I11
cycle 1 2 3 4 5 6
instructionflow I2 I4 I6 I8 I10 I12
I2 I4 I6 I8 I10
I2 I4 I6 I8
I2 I4 I6
I2 I4
CPI 0.5;IPC 2
CS4/MSc Parallel Architectures - 2009-2010
A Pipelined Processor (DLX)
5
H&PFig. A.18
CS4/MSc Parallel Architectures - 2009-2010
Advanced Superscalar Execution
6
Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed, and committed per cycle
In practice:– Control flow changes spoil fetch flow– Data, control, and structural hazards spoil issue flow– Multi-cycle arithmetic operations spoil execute flow
Buffers at issue (issue window or issue queue) and commit (reorder buffer) decouple these stages from the rest of the pipeline and regularize somewhat breaks in the flowGeneral
registers
ID MEMFetchengine EXE WB
MemoryMemory
instructions instructions
CS4/MSc Parallel Architectures - 2009-2010
Problems At Instruction Fetch
7
Crossing instruction cache line boundaries– e.g., 32 bit instructions and 32 byte instruction cache
lines → 8 instructions per cache line; 4-wide superscalar processor
– More than one cache lookup are required in the same cycle
– What if one of the line accesses is a cache miss?– Words from different lines must be ordered and packed
into instruction queue
Case 1: all instructions located in same cache line and no branch
Case 2: instructions spread in more lines and no branch
CS4/MSc Parallel Architectures - 2009-2010
Problems At Instruction Fetch
8
Control flow– e.g., 32 bit instructions and 32 byte instruction cache
lines → 8 instructions per cache line; 4-wide superscalar processor
– Branch prediction is required within the instruction fetch stage
– For wider issue processors multiple predictions are likely required
– In practice most fetch units only fetch up to the first predicted taken branch
Case 1: single not taken branch
Case 2: single taken branch outside fetch range and into other cache line
CS4/MSc Parallel Architectures - 2009-2010
Example Frequencies of Control Flow
9
benchmark taken % avg. BB size# of inst. between taken
branches
eqntott 86.2 4.20 4.87
espresso 63.8 4.24 6.65
xlisp 64.7 4.34 6.70
gcc 67.6 4.65 6.88
sc 70.2 4.71 6.71
compress 60.9 5.39 8.85
Data from Rotenberg et. al. for SPEC 92 Int
One branch/jump about every 4 to 6 instructions One taken branch/jump about every 4 to 9
instructions
CS4/MSc Parallel Architectures - 2009-2010
Solutions For Instruction Fetch
10
Advanced fetch engines that can perform multiple cache line lookups– E.g., interleaved I-caches where consecutive program
lines are stored in different banks that can accessed in parallel
Very fast, albeit not very accurate branch predictors (e.g., next line predictor in the Alpha 21464)– Note: usually used in conjunction with more accurate
but slower predictors (see Lecture 4) Restructuring instruction storage to keep
commonly consecutive instructions together (e.g., Trace cache in Pentium 4)
CS4/MSc Parallel Architectures - 2009-2010
Example Advanced Fetch Unit
11
Figure fromRotenberg et. al.
Control flow predictionunits:i) Branch Target Bufferii) Return Address Stackiii) Branch Predictor
Final alignment unit
2-way interleaved I-cache
Mask to select instructionsfrom each of the cache lines
CS4/MSc Parallel Architectures - 2009-2010
Trace Caches
12
Traditional I-cache: instructions laid out in program order
Dynamic execution order does not always follow program order (e.g., taken branches) and the dynamic order also changes
Idea:– Store instructions in execution order (traces)– Traces can start with any static instruction and are
identified by the starting instruction’s PC– Traces are dynamically created as instructions are
normally fetched and branches are resolved– Traces also contain the outcomes of the implicitly
predicted branches– When the same trace is again encountered (i.e., same
starting instruction and same branch predictions) instructions are obtained from trace cache
– Note that multiple traces can be stored with the same starting instruction
CS4/MSc Parallel Architectures - 2009-2010
Pros/Cons of Trace Caches
13
+ Instructions come from a single trace cache line+ Branches are implicitly predicted
– The instruction that follows the branch is fixed in the trace and implies the branch’s direction (taken or not taken)
+ I-cache still present, so no need to change cache hierarchy
+ In CISC IS’s (e.g., x86) the trace cache can keep decoded instructions (e.g., Pentium 4)
- Wasted storage as instructions appear in both I-cache and trace cache, and in possibly multiple trace cache lines
- Not very good at handling indirect jumps and returns (which have multiple targets, instead of only taken/not taken) and even unconditional branches
- Not very good when there are traces with common sub-paths
CS4/MSc Parallel Architectures - 2009-2010
Structure of a Trace Cache
14
Figure fromRotenberg et. al.
CS4/MSc Parallel Architectures - 2009-2010
Structure of a Trace Cache
15
Each line contains n instructions from up to m basic blocks
Control bits:– Valid– Tag– Branch flags and mask: m-1 bits to specify the direction
of the up to m branches– Branch mask: the number of branches in the trace– Trace target address and fall-through address: the
address of the next instruction to be fetched after the trace is exhausted
Trace cache hit:– Tag must match– Branch predictions must match the branch flags for all
branches in the trace
CS4/MSc Parallel Architectures - 2009-2010
Trace Creation
16
Starts on a trace cache miss Instructions are fetched up to the first predicted taken branch Instructions are collected, possibly from multiple basic blocks (when branches are predicted taken) Trace is terminated when either n instructions or m branches have been added Trace target/fall-through address are computed at the end
CS4/MSc Parallel Architectures - 2009-2010
Example
17
I-cache lines contain 8 32-bit instructions and Trace Cache lines contain up to 24 instructions and 3 branches
Processor can issue up to 4 instructions per cycle
L1: I1 [ALU] ... I5 [Cond. Br. to L3]L2: I6 [ALU] ... I12 [Jump to L4]L3: I13 [ALU] ... I18 [ALU]L4: I19 [ALU] ... I24 [Cond. Br. to L1]
Machine Code
B1(I1-I5)
B2(I6-I12)
B3(I13-I18)
B4(I19-I24)
Basic Blocks
I1 I2 I3
I4 I5 I6 I7 I8 I9 I10 I11
I12 I13 I14 I15 I16 I17 I18 I19
I20 I21 I22 I23
Layout in I-Cache
I24
CS4/MSc Parallel Architectures - 2009-2010
Example
18
Step 1: fetch I1-I3 (stop at end of line) → Trace Cache miss → Start trace collection
Step 2: fetch I4-I5 (possible I-cache miss) (stop at predicted taken branch)
Step 3: fetch I13-16 (possible I-cache miss) Step 4: fetch I17-I19 (I18 is predicted not taken branch, stop at
end of line) Step 5: fetch I20-I23 (possible I-cache miss) (stop at predicted
taken branch) Step 6: fetch I24-I27 Step 7: fetch I1-I4 replaced by Trace Cache accessB1
(I1-I5)
B2(I6-I12)
B3(I13-I18)
B4(I19-I24)
Basic Blocks
I1 I2 I3
I4 I5 I6 I7 I8 I9 I10 I11
I12 I13 I14 I15 I16 I17 I18 I19
Layout in I-Cache
Common path
I1 I2 I3 I4 I5 I13 I14 I15
I16 I17 I18 I19 I20 I21 I22 I23
Layout in Trace Cache
I20 I21 I22 I23 I24
I24
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
19
Original hardware trace cache:“Trace Cache: a Low Latency Approach to High
Bandwidth Instruction Fetching”, E. Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.
Next trace prediction for trace caches:“Path-Based Next Trace Prediction”, Q. Jacobson, E.
Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1997.
A Software trace cache:“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C.
Navarro, J. Torrellas, and M. Valero, Intl. Conf. on Supercomputing, June 1999.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 4: Superscalar Processors II/II
n-wide instruction width + m-deep pipeline + d delay to resolve branches:– Up to n*m instructions in-flight– Up to n*d instructions must be re-executed on branch misprediction– Current processors have 10 to 20 cycles of branch misprediction
penalty
Current branch prediction accuracy is around 80%-90% for “difficult” applications and >95% for “easy” applications
Increasing prediction accuracy usually involves increasing the size of tables
Different predictor types are good at different types of branch behavior
Current processors have multiple branch predictors with different accuracy-delay tradeoffs
1
CS4/MSc Parallel Architectures - 2009-2010
Quantifying Prediction Accuracy
2
Two measures:– Coverage: the fraction of branches for which the
predictor has a prediction (Note: usually, it is considered that coverage is 100% and no prediction equals predict not taken)
– Accuracy: the ratio of correctly predicted branches over the total number of branches predicted (Pitfall: higher accuracy is not necessarily better when coverage is lower)
Performance impact is proportional to (1-accuracy), penalty, and amount of branches in the application
Two ways of looking at accuracy improvements:– E.g., accuracy improves from 95% to 97%:
97 - 95
95= 0.021
Only 2% increasein accuracy
5 - 3
5= 0.4
40% reductionin mispredictions
CS4/MSc Parallel Architectures - 2009-2010
2-bit Branch Prediction Branch prediction buffers:
– Match branch PC during IF or ID stages
2-bit saturating counter:– 00: do not take– 01: do not take– 10: take– 11: take
3
Branch PC
0x135c8
0x147e0
…
Outcome
00
01
…
…0x135c4: add r1,r2,r30x135c8: bne r1,r0,n…
CS4/MSc Parallel Architectures - 2009-2010
(2,2) Correlating Predictor
For example: if the four counter values are 00 01 10 01 and the last two branches were, respectively, taken and not taken, then we will predict the branch as not taken (01)
Organized as a table of values indexed by the sequence of past branch outcomes and by the branch PC
This is an example of a context-based branch predictor
4
Prediction bits
00
01
10
11
Do not take
Do not take
Take
Take
PredictionIf NT/NT If T/NT If NT/T If T/T
00
01
10
11
00
01
10
11
00
01
10
11
CS4/MSc Parallel Architectures - 2009-2010
Two Level Branch Predictors
5
Two types of arrangement/indexing:– Global: Information is not particular to a branch and the
table/information is not directly indexed by the branch’s PC
Good when branches are highly correlated– Local (a.k.a. per address): Information is particular to a
branch and the table/information is indexed by the branch’s PC
Good when branches are individually highly biased– Partially local: Table/information is indexed by part of
the branch’s PC (in order to save bits in the tags for the tables)
– Note: sometimes global information may be indexed by information that was local, and is then somewhat indexed by the branch’s PC
CS4/MSc Parallel Architectures - 2009-2010
Two Level Branch Predictors
6
1st level: history of the last n branches– If global:
Single History Register (HR) (n-bit shift register) with the last outcomes of all branches
– If local: Multiple HR’s in a History Register Table (HRT) that is
indexed by the branch’s PC, where each HR contains the last outcomes of the corresponding branch only
2nd level: the branch behavior of the last s occurrences of the history pattern– If global:
Single Pattern History Table (PHT) indexed by the resulting HR contents
– If local: Multiple PHT’s that are indexed by the branch’s PC, where
each entry is indexed by the resulting HR contents– Thus, 2n entries for each HR
CS4/MSc Parallel Architectures - 2009-2010
Two Level Branch Predictors
7
Example with global history and global pattern table (GAg)
– All branches use the same HR– All branches use the same PHT– 2-bit saturating counter is only an example and other
schemes are possible– Meaning: “When the outcome of the last any n branches
is 11…10 then the prediction is P, regardless of what branch is being predicted”
1 1 1... 0
Branch History Register
00 … 00
00 … 01
00 … 10
11 … 10
11 … 11
…
Pattern History Table
P
2-bit SaturatingCounter
P = 01
Predict Not Taken
IndexingPrediction
Result
CS4/MSc Parallel Architectures - 2009-2010
Example with local history and global pattern table (PAg)
– Each branch uses its own HR– All branches use the same PHT– Meaning: “When the outcome of the last n instances of
the branch being predicted is 11…10 then the prediction is P, regardless of what branch is being predicted”
Two Level Branch Predictors
8
1 1 1... 0
Branch History Registers00 … 00
00 … 01
00 … 10
11 … 10
11 … 11
…
Pattern History Table
PIndexing0 1 0... 0
tag
tag
PC
1 1 1... 0tagIndexing
CS4/MSc Parallel Architectures - 2009-2010
Example with local history and local pattern table (PAp)
– Each branch uses its own HR– Each branch uses its own PHT– Meaning: “When the outcome of the last n instances of
the branch being predicted is 11…10 then the prediction is P for this particular branch”
Two Level Branch Predictors
9
1 1 1... 0
Branch History Registers00 … 00
00 … 01
00 … 10
11 … 10
11 … 11
…
Pattern History Table
PIndexing0 1 0... 0
tag
tag
PC
1 1 1... 0tagIndexing
tag
P’
tag
CS4/MSc Parallel Architectures - 2009-2010
Two Level Branch Predictors
10
Notes:– When only part of the branch’s PC is used for indexing
there is aliasing (i.e., multiple branches appear to be same)
– In practice there is a finite number of entries in the tables with local information, so
Either these only cache information for the most recently seen branches
Or the tables are indexed by hashing (usually with an XOR) the branch’s PC (this also leads to aliasing)
– Aliasing also happens with global information, as multiple branches appear to have the same behavior/prediction
– Accuracy of predictor depends on: Local versus Global information at each level Size of the tables in local schemes (number of different
branches that can be tracked) Depth of the history (n) Amount of aliasing
CS4/MSc Parallel Architectures - 2009-2010
Two Level Branch Predictors
11
Updates:– The HR’s are updated with the outcome of the branch
being predicted (only the corresponding HR in case of local scheme)
– The predictor in the selected PHT entry is updated with the outcome of branch (e.g., a 2-bit saturating counter is incremented/decremented if the outcome is taken/not taken)
Taxonomy:– History Table type:
Global: GA; Local (per address): PA– Pattern Table type:
Global: g; Local (per address): p– Thus: GAg=global history table and global pattern table PAg=local history table and global pattern table– GAp combination does not make much sense
CS4/MSc Parallel Architectures - 2009-2010
Local vs. Global Predictors
12
Simple 2-bit predictor performs best for small predictor sizes, but saturates quickly and below other predictors
Local outperforms global for all these predictor sizes
84
86
88
90
92
94
96
98
64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB
Predictor Size
Pre
dic
tor
Acc
ura
cy (
%)
Gag Pag 2-bit
Data from McFarling for SPEC 1989 Int and FP
CS4/MSc Parallel Architectures - 2009-2010
Combining Branch Predictors
13
Different predictors are good at different behaviors
Different predictors have different accuracy and latency
Combining predictors– Can lead to schemes that are good at more behaviors– Can generate quickly a reasonably accurate prediction
and with some more delay a highly accurate prediction, which corrects the previous prediction if necessary
– Usually combine a simple and a complex predictor Choosing between multiple predictors:
– “Meta-predictor” to choose the predictor that most likely has the correct prediction
– Augment predictors with confidence estimators
CS4/MSc Parallel Architectures - 2009-2010
Combining Branch Predictors
14
Meta predictor– Use 2-bit saturating counter to select predictor to use
Selector Counters
S
tag
tag
tag
tag
tag
PCIndexing
2-bit SaturatingCounter
S = 01
Use Predictor P2
SelectionResult
P1 P2
Predictors
2:1 MUX
Predictions
Final prediction
CS4/MSc Parallel Architectures - 2009-2010
Combining Branch Predictors
15
Meta predictor– 2-bit saturating counter interpretation:
00: Use P2 01: Use P2 10: Use P1 11: Use P1
– Updating counter: P1 correct and P2 correct this time: no change to counter P1 correct and P2 incorrect this time: increment counter P1 incorrect and P2 correct this time: decrement counter P1 incorrect and P2 incorrect this time: no change to
counter
Choosing among more than 2 predictors is more involved and rarely pays off
CS4/MSc Parallel Architectures - 2009-2010
Example: The Alpha 21464 Predictors
16
8-wide out-or-order superscalar processor with very deep pipeline and multithreading
Predictors take approximately 44KBytes of storage
Up to 16 branches predicted every cycle Minimum misprediction penalty of 14 cycles (112
instructions) and most common is 20 or 25 cycles (160 or 200 instructions)
Based on global schemes; local schemes were ruled out because:– They would require up to 16 parallel lookups of the
tables– Difficult to maintain per-branch information (e.g., the
same branch may appear multiple times in such a deeply pipelined wide issue machine)
In addition to conditional branch prediction it has a jump predictor and a return address stack predictor
CS4/MSc Parallel Architectures - 2009-2010
Example: The Alpha 21464 Predictors
17
Fetch unit:– Can fetch up to 16 instructions from 2 dynamically
consecutive I-cache lines– Instruction fetch stops at the first taken branch
(predicted not taken branches (up to 16) do not stop fetch)
1st Predictor: Next Line Predictor– Operates within a single cycle– Unacceptably high misprediction rate
2nd Predictor: 2Bc-gskew– Operates over 2 cycles and is pipelined– Actually consists of 2 different predictors (a 2-bit
saturating counter and an e-gskew) combined and with a meta predictor selector
– Uses “de-aliasing” approach: Partition the tables into multiple sets and use special
hashing functions Shown to reduce aliasing in global schemes
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
18
Seminal branch prediction work:“Two-Level Adaptive Training Branch Prediction”, T.-Y.
Yeh and Y. Patt, Intl. Symp. on Microarchitecture, December 1991.
“Alternative Implementations of Two-Level Adaptive Branch Prediction”, T.-Y. Yeh and Y. Patt, Intl. Symp. on Computer Architecture, June 1992.
“Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation”, S.-T. Pan, K. So, and J. T. Rahmeh, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1992.
“Combining Branch Predictors”, S. McFarling, WRL Technical Note TN-36, June 1993.
Adding confidence estimation to predictors:“Assigning Confidence to Conditional Branch
Predictions”, E. Jacobsen, E. Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
19
Alpha 21464 predictor:“Design Tradeoffs for the Alpha EV8 Conditional Branch
Predictor”, A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, Intl. Symp. on Computer Architecture, June 2002.
“Next Cache Line and Set Prediction”, B. Calder and D. Grunwald, Intl. Symp. on Computer Architecture, June 1995.
“Trading Conflict and Capacity Aliasing in Conditional Branch Predictors”, P. Michaud, A. Seznec, and R. Uhlig, Intl. Symp. on Computer Architecture, June 1997.
Neural net based branch predictors:“Fast Path-Based Neural Branch Prediction”, D. Jimenez,
Intl. Symp. on Microarchitecture, December 2003. Championship Branch Prediction
– www.jilp.org/cbp/– camino.rutgers.edu/cbp2/
CS4/MSc Parallel Architectures - 2009-2010
Probing Further
20
Advanced register allocation and de-allocation“Late Allocation and Early Release of Physical Registers”,
T. Monreal, V. Vinals, J. Gonzalez, A. Gonzalez, and M. Valero, IEEE Trans. on Computers, October 2004.
Value prediction“Exceeding the Dataflow Limit Via Value Prediction”, M.
H. Lipasti and J. P. Shen, Intl. Symp. on Microarchitecture, December 1996.
Limitations to wide issue processors“Complexity-Effective Superscalar Processors”, S.
Palacharla, N. P. Jouppi, and J. Smith, Intl. Symp. on Computer Architecture, June 1997.
“Clock Rate Versus IPC: the End of the Road for Conventional Microarchitectures”, V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Intl. Symp. on Computer Architecture, June 2000.
Recent alternatives to out-of-order execution“”Flea-flicker” Multipass Pipelining: An Alternative to the
High-Power Out-of-Order Offense”, R. D. Barnes, S. Ryoo, and W. Hwu, Intl. Symp. on Microarchitecture, November 2005.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 5: Vector Processors Many real-world problems, especially in science and
engineering, map well to computation on arrays RISC approach is inefficient:
– Based on loops → require dynamic or static unrolling to overlap computations– Indexing arrays based on arithmetic updates of induction variables– Fetching of array elements from memory based on individual, and unrelated,
loads and stores– Small register files– Instruction dependences must be identified for each individual instruction
Idea:– Treat operands as whole vectors, not as individual integer of float-point numbers– Single machine instruction now operates on whole vectors (e.g., a vector add)– Loads and stores to memory also operate on whole vectors– Individual operations on vector elements are independent and only dependences
between whole vector operations must be tracked
1
CS4/MSc Parallel Architectures - 2009-2010
Execution Model
Straightforward RISC code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– R3 contains the address of the last element of a + 8
2
for (i=0; i<64; i++)a[i] = b[i] + s;
loop: L.D F0,0(R2) ;F0=array element of b ADD.D F4,F0,F2 ;main computation S.D F4,0(R1) ;store result DADDUI R1,R1,8 ;increment index DADDUI R2,R2,8 ;increment index BNE R1,R3,loop ;next iteration
CS4/MSc Parallel Architectures - 2009-2010
Execution Model
Straightforward vector code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– Assume vector registers have 64 double precision elements
– Notes: Some vector operations require access to integer and FP register files as well In practice vector registers are not of the exact size of the arrays Refer to Figure G.3 of Hennessy&Patterson for a list of the most common types of
vector instructions Only 3 instructions executed compared to 6*64=384 executed in the RISC
3
for (i=0; i<64; i++)a[i] = b[i] + s;
LV V1,R2 ;V1=array b ADDVS.D V2,V1,F2 ;main computation SV V2,R1 ;store result
CS4/MSc Parallel Architectures - 2009-2010
Execution Model (Pipelined)
With multiple vector units, I2 can execute together with I1 (as we will see later)
In practice, the vector units takes several cycles to operate on each element, but is pipelined
4
IF I1
I1ID
EXE
MEM
WB
I1.1
I1.1
I1.1
cycle 1 2 3 4 5 6
I1.2
7
I1.2
I1.3
8
I2
I1.2
I1.3
I1.4
I1.3
I1.4
I1.5
I1.4
I1.5
I1.6
CS4/MSc Parallel Architectures - 2009-2010
Pros of Vector Processors Reduced pressure on instruction fetch
– Fewer instructions are necessary to specify the same amount of work
Reduced pressure on instruction issue– Reduced number of branches alleviates branch prediction– Much simpler hardware for checking dependences
Simpler register file– No need for too many ports as only one element used per
cycle (for pipeline approach)
More streamlined memory accesses– Vector loads and stores specify a regular access pattern– High latency of initiating memory access is amortized
5
CS4/MSc Parallel Architectures - 2009-2010
Cons of Vector Processors Requires a specialized, high-bandwidth, memory system
– Caches do not usually work well with vector processors– Usually built around heavily banked memory with data
interleaving
Still requires a traditional scalar unit (integer and FP) for the non-vector operations
Difficult to maintain precise interrupts (can’t rollback all the individual operations already completed)
Compiler or programmer has vectorize programs Not very efficient for small vector sizes Not suitable/efficient for many different classes of
applications
6
CS4/MSc Parallel Architectures - 2009-2010
Performance Issues Performance of a vector instruction depends on the length of
the operand vectors Initiation rate
– Rate at which individual operations can start in a functional unit– For fully pipelined units this is 1 operation per cycle– Usually >1 for load/store unit
Start-up time– Time it takes to produce the first element of the result– Depends on how deep the pipeline of the functional units are– Especially large for load/store unit
With an initiation rate of 1, the time to complete a single vector instruction is equal to the vector size + the start-up time, which is approximately equal to the vector size for large vectors
7
CS4/MSc Parallel Architectures - 2009-2010
Performance Issues Common vector processor performance metrics:
– R∞ : the rate of execution of the processor with vectors of infinite size (i.e., with no overheads due to smaller vectors)
– N1/2: the vector length required for the processor to reach half of R∞
– NV: the vector length required for the processor to match the performance of scalar execution (i.e., the point at which it pays off to execute in vector mode)
8
CS4/MSc Parallel Architectures - 2009-2010
Dealing with Vector Sizes Two new registers are used:
– vector length register (VLR) specifies (to the hardware) what length is to be assumed for the next instruction to be issued
– maximum vector length (MVL) specifies (to the programmer/compiler) what the maximum length is (i.e., the size of the registers in the particular machine)
Use strip mining for user arrays larger than MVL
9
for (i=0; i<n; i++) a[i] = b[i] + s;
low = 0;VL = n % MVL;for (j=0; j<n/MVL; j++) { for (i=low; i<low+VL-1; i++) a[i] = b[i] + s; low = low + VL; VL = MVL;}
Set length to the remainderpart of the array when the sizeis not divisible by MVL.For instance with n=140 andMVL=64, we have 2 chunks of64 and 1 remainder chunk of 16
This is the loop thatgets vectorized
Set the length back to MVL
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Masking What if the operations involve only some elements of the
array, depending on some run-time condition?
Solution: masking– Add a new boolean vector register (the vector mask register)– The vector instruction then only operates on elements of the vectors
whose corresponding bit in the mask register is 1– Add new vector instructions to set the mask register
E.g., SNEVS.D V1,F0 sets to 1 the bits in the mask registers whose corresponding elements in V1 are not equal to the value in F0
CVM instruction sets all bits of the mask register to 1
10
for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s;
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Masking
Vector code:– F2 contains the value of s and F0 contains zero– R1 contains the address of the first element of a– R2 contains the address of the first element of b– Assume vector registers have 64 double precision elements
11
for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s;
LV V1,R2 ;V1=array b SNEVS.D V1,F0 ;mask bit is 1 if b !=0 ADDVS.D V2,V1,F2 ;main computation CVM SV V2,R1 ;store result
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Scatter-Gather
How can we handle sparse matrices?
Solution: scatter-gather– Use the contents of an auxiliary vector to select which elements
of the main vector are to be used– This is done by pointing to the address in memory of the
elements to be selected– Add new vector instruction to load memory values based on this
auxiliary vector E.g. LVI V1,(R1+V2) loads the elements of a user array from
memory locations R1+V2(i) Also SVI store counterpart
12
for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s;
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Scatter-Gather
Vector code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– V3 contains the indices of a and b that need to be used– Assume vector registers have 64 double precision elements
13
for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s;
LVI V1,(R2+V3) ;V1=array b indexed by V3 ADDVS.D V2,V1,F2 ;main computation SVI V2,(R1+V3) ;store result
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Striding
Assume that the 2D array b is laid out by rows– Iterations access non-contiguous elements of b– Could use scatter-gather, but this would waste a vector
register– Access pattern is very regular and a single integer, the
stride, fully defines it– Add a new vector instruction to load values from memory
based on the stride E.g., LVWS V1,(R1,R2) loads the elements of a user array from
memory locations R1+i*R2 Also SVWS store counterpart
14
for (i=0; i<64; i++) a[i] = b[i,j] + s;
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Chaining
Forwarding in pipelined RISC processors allow dependent instructions to execute as soon as the result of the previous instruction is available
15
IF add mul
add mulID
EXE
MEM
WB
add mul
add mul
add mul
I3 I6
cycle 1 2 3 4 5 6
I3
I3
I5
I3
I4
I4
I4
I5
ADD.D R1,R2,R3 # R1=R2+R3MUL.D R4,R5,R1 # R4=R5+R1
value
CS4/MSc Parallel Architectures - 2009-2010
Advanced Features: Chaining
Similar idea applies to vector instructions and is called chaining– Difference is that chaining of vector instructions requires
multiple functional units as the same unit cannot be used back-to-back
16
IF add mul
add mulID
EXE
MEM
MEM
A.1 A.2
A.1
I3
cycle 1 2 3 4 5 6
A.3 A.4
ADDV.D V1,V2,V3 # V1=V2+V3MULV.D V4,V5,V1 # V4=V5+V1
value
EXE M.1M.2 M.3
M.1 M.2
A.2 A.3
WB
CS4/MSc Parallel Architectures - 2009-2010
Example: The Earth Simulator
17
73rd fastest supercomputer as of Top500 list of November 2008 (was 1st March 2002 to September 2004)
Multiprocessor Vector architecture– 640 nodes, 8 vector processors per node → 5120
processors– 8 pipelines per vector processor– 10 TBytes of main memory– Vector units contain 72 vector registers, each with 256
elements Performance and Power consumption
– 35.9 TFLOPS on Top500 benchmark (closest RISC-based multiprocessor (#72) reaches 36.6 TFLOPS using 9216 processors)
– 12800 KWatts power consumption Designed specifically to simulate nature (e.g.,
weather, ocean, earthquakes) at a global scale (i.e., the whole earth)
CS4/MSc Parallel Architectures - 2009-2010
Further Reading
18
The first truly successful vector supercomputer:“The CRAY-1 Computer System”, R. M. Russel,
Communications of the ACM, January 1978. A recent vector processor on a chip:
“Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, C. Kozyrakis and D. Patterson, Intl. Symp. on Microarchitecture, December 2002.
Integrating a vector unit with a state-of-the-art superscalar:“Tarantula: A Vector Extension to the Alpha
Architecture”, R. Espasa, F. Ardanaz, J. Elmer, S. Felix, J. Galo, R. Gramunt, I. Hernandez, T. Ruan, G. Lowney, M. Mattina, and A. Seznec, Intl. Symp. on Computer Architecture, June 2002.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 6: SIMD Processors Superscalar execution model:
– Mix of scalar ALUs– n unrelated instructions per cycle– 2n unrelated operands per cycle– Results from any ALU can feed back to any ALU individually– Operands are wide (32/64 bits)
Vector execution model:– Vector ALU– 1 vector instruction → multiple of the same operation– Operands belong to an array– Results are written back to reg. file– Operands are wide (32/64 bits)
1
Instr.Sequencer
Reg. file
Reg. fileInstr.
Sequencer
CS4/MSc Parallel Architectures - 2009-2010
Network of simple processing elements (PE)– PEs operate in lockstep under the control of a master sequencer,
the array control unit (ACU) (note: masking is possible)– PEs can exchange results with a small number of neighbors via
special data-routing instructions– Each PE has its own local memory or (less common) accesses
memory via an alignment network– PEs operate on very narrow operands (1 bit in the extreme case of
the CM-1)– Very large (up to 64K) number of PEs– Usually operated as co-processors with a host computer to perform
I/O and to handle external memory Suitable for some scientific, AI, and vizualization
applications Intended for use as supercomputers Programmed via custom extensions of common HLL
2
Original SIMD Idea
CS4/MSc Parallel Architectures - 2009-2010 3
Original SIMD Idea
Instr.Sequencer
M M M M
M M M M
M M M M
CS4/MSc Parallel Architectures - 2009-2010
Example: Equation Solver Kernel
The problem:– Operate on a (n+2)x(n+2) matrix
SIMD implementation:– Assign one node to each PE– Step 1: all PE’s send their data to their east neighbors and
simultaneously read the data sent by their west neighbors (nodes at the right, top, and bottom rim are masked out at this step)– Steps 2 to 4: same as step 1 for west, south, and north (again,
appropriate nodes are masked out)– Step 5: all PE’s compute the new value using equation above– Note: strictly speaking we need some extra tricks to juggle new and
old values
4
A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])
CS4/MSc Parallel Architectures - 2009-2010
Example: MasPar MP-1 Key features
– First SIMD to use a traditional RISC IS– ACU also performs non-SIMD operations/computation– From 1K to 16K PE’s– PE array interconnects
2D mesh for 8-way (N, S, E, W, NE, SE, SW, NW) neighbor communication (X net)
Circuit-switched 3 stage hierarchical crossbar for any-to-any communication
Two global buses for ACU-PE lockstep control
– PE’s have local memory for data (16KB) (instructions are stored in the ACU)
– PE’s commonly operate on 32 bit words, but can also operate on individual bits, bytes, 16 bit words, and 64 bit words
5
CS4/MSc Parallel Architectures - 2009-2010
Example: MasPar MP-1
6
Figure fromBlank
PE array with 2D mesh
ACU and Unix host
Crossbar with routers
CS4/MSc Parallel Architectures - 2009-2010
A Modern SIMD Co-processor
ClearSpeed CSX600– Intended as an accelerator for high performance technical computing– Current implementation has 96 PE’s plus a scalar unit for non-SIMD
operations (including control flow)– Each PE is in fact a VLIW core– 1, 2, 4, and 8 byte operands– PE’s can communicate directly with right and left neighbors– Also supports multithreading to hide I/O latency (Lecture 12)– Uses traditional instruction and data caches in addition to memory
local to each PE– Programmed with a extension of C
Poly variables: replicated in each PE with different values Mono variables: only a single instance exists (either at the host, or
replicated at the PE’s but with synchronized values)
7
CS4/MSc Parallel Architectures - 2009-2010
A Modern SIMD Co-processor
8
Figure fromClearSpeed
PE array with local memories(SRAM) and registers
RISC scalar processor andACU
Neighbor communicationinfrastructure (swazzle)
CS4/MSc Parallel Architectures - 2009-2010
Multimedia SIMD Extensions Key ideas:
– No network of processing elements, but an array of (narrow) ALU’s
– No memories associated with ALU’s, but a pool of relatively wide (64 to 128 bits) registers that store several operands
– Still narrow operands (8 bits) and instructions that use operands of different sizes
– No direct communication between ALU’s, but via registers and with special shuffling/permutation instructions
– Not co-processors or supercomputers, but tightly integrated into CPU pipeline
– Still lockstep operation of ALU’s– Special instructions to handle common media operations (e.g.,
saturated arithmetic)
9
CS4/MSc Parallel Architectures - 2009-2010
Multimedia SIMD Extensions SIMD ext. execution model:
10
Instr.Sequencer
Reg. file
Shuffling network
Inter register operations
R1
R2
+ + + +
R3
Intra register operations
R1
+ + + +
R2or
R1
+ + + +
R2
CS4/MSc Parallel Architectures - 2009-2010
Example: Intel SSE Streaming SIMD Extensions introduced in 1999 with
Pentium III Improved over earlier MMX (1997)
– MMX re-used the FP registers– MMX only operated on integer operands
70 new machine instructions (SEE2 added 144 more in 2001) and 8 128bit registers– Registers are part of the architectural state– Include instructions to move values between SEE and x86 registers– Operands can be: single (32bit) and double (64bit) precision FP; 8,
16, and 32 bit integer– Some instructions to support digital signal processing (DSP) and 3D– SSE2 included instructions for handling the cache (recall that
streaming data does not utilize caches efficiently)
11
CS4/MSc Parallel Architectures - 2009-2010
A Modern SIMD Variation: Cell
12
IBM/Sony/Toshiba Cell Broadband Engine: Heterogeneous “multi-core” system with 1 PowerPC
(PPE) + 8 SIMD engines (SPE – “Synergistic Processor Units”)
On-chip storage based on “scratch pads” (very, very hard to program)
Used in the Playstation 3 SIMD support
SPE’s are incapable of independent control and are “slaves” to PowerPC
PPE already supports SIMD extensions (IBM’s VMX) SPE supports SIMD through specific IS 128 128-bit registers and 128 bit datapath (note: no
scalar registers in SPE) Accessible to programmer through HLL intrinsics (i.e.,
function calls, e.g., spu_add(a,b)) Additional support for synchronization across SPE’s
and PPE and for data transfer
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
13
Seminal SIMD work:“A Model of SIMD Machines and a Comparison of Various
Interconnection Networks”, H. Siegel, IEEE Trans. on Computers, December 1979.
“The Connection Machine”, D. Hillis, Ph.D. dissertation, MIT, 1985.
Two commercial SIMD supercomputers:“The CM-2 Technical Summary”, Thinking Machines
Corporation, 1990.“The MasPar MP-1 Architecture”, T. Blank, Compcon,
1990. A modern SIMD co-processor:
“CSX Processor Architecture”, ClearSpeed, Whitepaper, 2006.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 7: Shared Mem. Multiprocessors I/V
Obtained by connecting full processors together– Processors contain normal width (32 or 64 bits) datapaths– Processors are capable of independent execution and control– Processors have their own connection to memory(Thus, by this definition, Sony’s Playstation 3 is not a multiprocessor as
the 8 SPE’s in the Cell are not full processors)
Have a single OS for the whole system, support both processes and threads, and appear as a common multiprogrammed system(Thus, by this definition, Beowulf clusters are not multiprocessors)
Can be used to run multiple sequential programs concurrently or parallel programs
Suitable for parallel programs where threads can follow different code
1
CS4/MSc Parallel Architectures - 2009-2010
Recall the communication model:– Threads in different processors can use the same virtual
address space– Communication is done through shared memory variables– Explicit synchronization with locks (e.g., variable flag
below) and critical sections
2
flag = 0;…a = 10;flag = 1;
flag = 0;…while (!flag) {}x = a * y;
Producer (p1) Consumer (p2)
Shared Memory Multiprocessors
CS4/MSc Parallel Architectures - 2009-2010
Shared Memory Multiprocessors
Recall the two common organizations:– Physically centralized memory, uniform memory access (UMA) (a.k.a. SMP)– Physically distributed memory, non-uniform memory access (NUMA)
(Note that both organizations have caches between processors and memory)
3
CPU
Main memory
CPU CPU CPU
Cache Cache Cache Cache
CPU
Mem.
CPU CPU CPU
Cache Cache Cache Cache
Mem. Mem. Mem.
CS4/MSc Parallel Architectures - 2009-2010
The Cache Coherence Problem
4
CPU
Main memory
CPU CPU
Cache Cache Cache
T0: A=1
T0: A not cached T0: A not cached T0: A not cachedT1: load A (A=1)
T1: A=1
T1: A not cached T1: A not cachedT2: load A (A=1)T2: A not cachedT2: A=1
T2: A=1
T3: store A (A=2)T3: A not cachedT3: A=1
T3: A=1
stale
stale
T4: load A (A=1)T4: A=1 T4: A=2
T4: A=1
use old value
T5: load A (A=1)
use stale value!
CS4/MSc Parallel Architectures - 2009-2010
Cache Coherence Protocols Idea:
– Keep track of what processors have copies of what data– Enforce that at any given time a single value of every data exists:
By getting rid of copies of the data with old values → invalidate protocols By updating everyone’s copy of the data → update protocols
In practice:– Guarantee that old values are eventually invalidated/updated (write
propagation) (recall that without synchronization there is no guarantee that a load
will return the new value anyway)– Guarantee that a single processor is allowed to modify a certain datum
at any given time (write serialization)– Must appear as if no caches were present
Note: must fit with cache’s operation at the granularity of lines
5
CS4/MSc Parallel Architectures - 2009-2010
Write-invalidate Example
6
CPU
Main memory
CPU CPU
Cache Cache Cache
T1: load A (A=1)
T1: A=1
T1: A not cached T1: A not cachedT2: load A (A=1)T2: A not cachedT2: A=1
T2: A=1
T3: store A (A=2)T3: A not cachedT3: A not cached
T3: A=1
invalidate
stale
T4: load A (A=2)T4: A not cached T4: A=2
T4: A=1
new valueT5: load A (A=2)
new value
CS4/MSc Parallel Architectures - 2009-2010
Write-update Example
7
CPU
Main memory
CPU CPU
Cache Cache Cache
T1: load A (A=1)
T1: A=1
T1: A not cached T1: A not cachedT2: load A (A=1)T2: A not cachedT2: A=1
T2: A=1
T3: store A (A=2)T3: A not cachedT3: A = 2
T3: A=2
update
update
T4: load A (A=2)T4: A = 2 T4: A=2
T4: A=2
new value
T5: load A (A=2)
CS4/MSc Parallel Architectures - 2009-2010
Invalidate vs. Update Protocols
Invalidate:+ Multiple writes by the same processor to the cache block only
require one invalidation+ No need to send the new value of the data (less bandwidth)– Caches must be able to provide up-to-date data upon request– Must write-back data to memory when evicting a modified blockUsually used with write-back caches (more popular)
Update:+ New value can be re-used without the need to ask for it again+ Data can always be read from memory+ Modified blocks can be evicted from caches silently– Possible multiple useless updates (more bandwidth)Usually used with write-through caches (less popular)
8
CS4/MSc Parallel Architectures - 2009-2010
Cache Coherence Protocols Implementation
– Can be in either hardware or software, but software schemes are not very practical (and will not be discussed further in this course)
Add state bits to cache lines to track state of the line– Most common: Invalid, Shared, Owned, Modified, Exclusive– Protocols usually named after the states supported
Global state of a memory line corresponds to the collection of its state in all caches
Cache lines transition between states upon load/store operations from the local processor and by remote processors
These state transitions must guarantee the invariant: no two cache copies can be simultaneously modified
9
CS4/MSc Parallel Architectures - 2009-2010
Example: MSI Protocol States:
– Modified (M): block is cached only in this cache and has been modified
– Shared (S): block is cached in this cache and possibly in other caches (no cache can modify the block)
– Invalid (I): block is not cached
10
CS4/MSc Parallel Architectures - 2009-2010
Example: MSI Protocol Transactions originated at this CPU:
11
Invalid Shared
Modified
CPU read miss
CPU read hit
CPU write miss
CPU write
CPU write hitCPU read hit
CS4/MSc Parallel Architectures - 2009-2010
Example: MSI Protocol Transactions originated at other CPU:
12
Invalid Shared
Modified
CPU read miss
CPU read hit
CPU write miss
CPU write
CPU write hitCPU read hit
Remote write miss
Remote write missRemote read miss
Remote read miss
CS4/MSc Parallel Architectures - 2009-2010
Example: MESI Protocol States:
– Modified (M): block is cached only in this cache and has been modified– Exclusive (E): block is cached only in this cache, has not been modified,
but can be modified at will– Shared (S): block is cached in this cache and possibly in other caches– Invalid (I): block is not cached
State E is obtained on reads when no other processor has a shared copy– All processors must answer if they have copies or not
Easily done in bus-based systems with a shared-OR line
– Or some device must know if processors have copies
Advantage over MSI– Often variables are loaded, modified in register, and then stored– The store on state E then does not require asking for permission to write
13
CS4/MSc Parallel Architectures - 2009-2010
Example: MESI Protocol Transactions originated at this CPU:
14
Invalid Shared
Modified
CPU read miss & sharing
CPU read hit
CPU write miss
CPU write
CPU write hitCPU read hit
ExclusiveCPU read hit
CPU read miss & no sharing
CPU write
Must inform everyone(upgrade)
Can be done silently
CS4/MSc Parallel Architectures - 2009-2010
Example: MESI Protocol Transactions originated at other CPU:
15
Invalid Shared
Modified Exclusive
Remote writemiss
Remote read miss
Remote write miss
Remote read miss
Remote read miss
Remote write miss
CS4/MSc Parallel Architectures - 2009-2010
Possible Implementations Three possible ways of implementing coherence
protocols in hardware– Snooping: all cache controllers monitor all other caches’
activities and maintain the state of their lines Commonly used with buses and in many CMP’s today
– Directory: a central control device directly handles all cache activities and tells the caches what transitions to make
Can be of two types: centralized and distributed Commonly used with scalable interconnects and in many CMP’s
today
– List: each cache controller keeps track of its own state and the identity and state of its neighbors in a linked list
E.g., IEEE SCI protocol (ANSI/IEEE Std 1596-1992) Only used in a few machines in the late 90’s
16
CS4/MSc Parallel Architectures - 2009-2010
Behavior of Cache Coherence Protocols
Uniprocessor cache misses (the 3 C’s):– Cold (or compulsory) misses: when a block is accessed for the first
time– Capacity misses: when a block is not in the cache because it was
evicted because the cache was full– Conflict misses: when a block is not in the cache because it was
evicted because the cache set was full
Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor– Hard to reduce relates to intrinsic communication and sharing of
data in the parallel application
– False sharing coherence misses: processors modify different words of the cache block (no real communication or sharing) but end up invalidating the complete block
17
CS4/MSc Parallel Architectures - 2009-2010
Behavior of Cache Coherence Protocols
False sharing increases with larger cache line size– Only true sharing remains with single word/byte cache lines
False sharing can be reduced with better placement of data in memory
True sharing tends to decrease with larger cache line sizes (due to locality)
Classifying misses in a multiprocessor is not straightforward– E.g., if P0 has line A in the cache and evicts it due to capacity limitation,
and later P1 writes to the same line: is this a capacity or a coherence miss?
It is both, as fixing one problem (e.g., increasing cache size) won’t fix the other (see Figure 5.20 of Culler&Singh for a complete decision chart)
18
CS4/MSc Parallel Architectures - 2009-2010
Behavior of Cache Coherence Protocols
Common types of data access patterns– Private: data that is only accessed by a single processor– Read-only shared: data that is accessed by multiple processors but
only for reading (this includes instructions)– Migratory: data that is used and modified by multiple processors,
but in turns– Producer-consumer: data that is updated by one processor and
consumed by another– Read-write: data that is used and modified by multiple processors
simultaneously Falsely shared data Data used for synchronization (Lecture 10)
Bottom-line: threads don’t usually read and write the same data indiscriminately
19
CS4/MSc Parallel Architectures - 2009-2010
Snooping coherence on simple shared bus
– “Easy” as all processors and memory controller can observe all transactions
– Bus-side cache controller monitors the tags of the lines involved and reacts if necessary by checking the contents and state of the local cache
– Bus provides a serialization point (i.e., every transaction A is either before or after another transaction B)
More complex with split transaction buses
1
P1L1
0 0
Line stateP2
L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Lect. 8: Shared Mem. Multiprocessors II/V
CS4/MSc Parallel Architectures - 2009-2010
“The devil is in the details”, Classic Proverb
Problem: conflict when processor and bus-side controller must check the cache
Solutions:– Use dual-ported modules for the tag and state array– Or, duplicate tag and state array
Both must be kept consistent when one is changed, which introduces some amount of conflicts
2
P1L1
0 0
Line stateP2
L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Snooping on Simple Bus
Ld/St
CS4/MSc Parallel Architectures - 2009-2010
Problem: even if bus is atomic, transactions are not instantaneous and may require several steps → transactions are not atomic– E.g., part of a transaction may be delayed by a memory response
or by a bus-side controller that had to wait to access its tags– E.g., out-of-order processors may issue cache requests that
conflict with the current request being served– E.g., an upgrade request may lose bus arbitration to another
processor’s and may have to be re-issued as a full write miss (due to the required invalidation)
Solution:– Introduce transient states to cache lines and the protocol (the I,
S, M, etc states seen in Lecture 7 are then called the stable states)
3
Snooping on Simple Bus
CS4/MSc Parallel Architectures - 2009-2010
Example: Extended MESI Protocol
Transactions originated at this CPU:
4
Invalid Shared
Modified
CPU write miss
CPU write
CPU write hitCPU read hit
ExclusiveCPU
read hit
CPU read miss & shr.
CPU read miss & no shr.
CPU write
I→S,E
CPU read hit
I→M
S→M
CPU write miss
bus granted
CPU read
bus granted & shr.
bus granted
& no shr. CPU write
bus granted& no conflict
conflict
CS4/MSc Parallel Architectures - 2009-2010
Problems:– Processor interacts with L1 while bus snooping device
interacts with L2, and propagating such operations up or down is not instantaneous
– L2 lines are usually bigger than L1 lines
5
Snooping with Multi-Level Hierarchies
P1L1
0 0
Line stateP2
L1
0 0
Line state Cache states:00 = invalid01 = shared10 = modified
Ld/St
L2
0 0
Line stateL2
0 0
Line state
CS4/MSc Parallel Architectures - 2009-2010
Solution: 1. Maintain inclusion property
– Lines in L1 must also be in L2 → no data is found solely in L1, so no risk of missing a relevant transaction when snooping at L2
– Lines M state in L1 must also be in M state in L2→ snooping controller at L2 can identify all data that is modified locally
2. Propagate coherence transactions
6
Snooping with Multi-Level Hierarchies
CS4/MSc Parallel Architectures - 2009-2010
Maintaining inclusion property Assume: L1: associativity a1, number of sets n1,
block size b1 L2: associativity a2, number of sets n2,
block size b2– Difficulty: Replacement policy (e.g., LRU) Assume: a1=a2=2; b1=b2; n2=k*n1; lines m1, m2, and
m3 map to same set in L1 and the same set in L2
7
Snooping with Multi-Level Hierarchies
m1
L1
m1
L2
P
miss2 miss3
Ld m21
m2m2fill 4fill 5
miss8 miss9
Ld m16hit
Ld m37
fill 10fill 11
m3 m3
CS4/MSc Parallel Architectures - 2009-2010
Maintaining inclusion property Assume: L1: associativity a1, number of sets n1, block size b1 L2: associativity a2, number of sets n2, block size b2
– Difficulty: Different line sizes Assume: a1=a2=1; b1=1, b2=2; n1=4, n2=8
Thus, words w0 and w17 can coexist in L1, but not in L2
8
Snooping with Multi-Level Hierarchies
w0L1 L2w1w2w3
w16
w17
CS4/MSc Parallel Architectures - 2009-2010
Maintaining inclusion property– Most combinations of L1/L2 size, associativity, and line size do
not automatically lead to inclusion– One solution is to have a1=1, a2≥1, b1=b2, and n1≤n2– More common solution is to invalidate the L1 line (or lines, if
b1<b2) upon replacing a line in L2– Must also invalidate L1 line(s) when L2 line is invalidated due
to coherence Propagate all invalidations from L2 to L1, whether relevant or not Keep extra state in the L2 lines to tell whether the line is also
present in L1 or not (inclusion bits)
– Finally, add a new state to L2 (modified-but-stale) to keep track of lines that are in state M in L1
9
Snooping with Multi-Level Hierarchies
CS4/MSc Parallel Architectures - 2009-2010
Non-split-transaction buses are idle from when the address request is finished until the data returns from memory or another cache
In split-transaction buses transactions are split into a request transaction and a response transaction, which can be separated
Sometimes implemented as two buses: one for requests and one for responses
10
Snooping with Split-Transaction Buses
Address(normal) address 1
address 2
address 2
Data(normal) data 1
Address(split) address 1 address 3
Data(split) data 0 data 1
CS4/MSc Parallel Architectures - 2009-2010
Problems– Multiple requests can clash (e.g., a read and a write, or two writes, to the
same data) (Note that this is more complicated than the case in Slide 3, as now different transactions may be at different stages of service)
– Buffers used to hold pending transactions may fill up and cause incorrect execution and even deadlock (flow control is required)
– Responses from multiple requests may appear in a different order than their respective requests
Responses and requests must then be matched using tags for each transaction
Note: it may be necessary for snoop controllers to request more time before responding (e.g., when they can’t have quick enough access to the local cache tags)
Note: snoop controllers may have to keep track themselves of what transactions are pending, in case there is conflict
11
Snooping with Split-Transaction Buses
CS4/MSc Parallel Architectures - 2009-2010
Clashing requests– Allow only one request at a time for each line (e.g., SGI
Challenge)
Flow control– Use negative acknowledgement (NACK) when buffers are full
(requests must be retried later; a bit more tricky with responses, due to danger of deadlock) (e.g., SGI Challenge)
– Or, design the size of all queues for the worst case scenario
Ordering of transactions– Responses can appear in any order → the interleaving of the
requests fully determine the order of transactions (e.g., SGI Challenge)
– Or, enforce a FIFO order of transactions across the whole system (caches + memory) (e.g., Sun Enterprise)
12
Snooping with Split-Transaction Buses
CS4/MSc Parallel Architectures - 2009-2010
Sun Enterprise (1996-2001) Up to 30 UltraSparc processors (Enterprise 6000) The Gigaplane bus: (3rd generation of buses from Sun)
– Peak bandwidth of 2.67GB/s at 83MHz– Supports up to 16 nodes (either processor or I/O boards)– 256bits data, 43bits address, 32bits ECC, and 57 control lines– Split-transaction with up to 112 outstanding transactions
Up to 30GB of main memory 16-way interleaved Memory is physically located in processor boards, but it is
still a UMA system
13
P
ctrl
L1
L2
MemP
L1
L2
Bus interface
CPU/Memcards
Fib
erC
han
nel
SB
US
Bus interface
I/Ocards
SB
US
SB
US
100b
T,S
CS
I
Gigaplane bus
CS4/MSc Parallel Architectures - 2009-2010
Sun Fire (2001-present) Up to 106 UltraSparc III processors (Fire 15K) The Fireplane bus: (4th generation of buses from Sun)
– Peak bandwidth of 9.6GB/s at 150MHz– Actually implemented using 4 levels of switches, not bus lines– Consists of two snooping domains connected by the upper level
switch Up to 576GB of main memory Memory is physically located in processor boards, but it is
still a UMA system
14
P L1 Mem
Switch level 0
P L1 Mem
L2
L2
Switch level 13x3 dataswitch
Low-end systemwith 2 processors
Up to 8 processors
Switch level 2data switch
Switch level 318x18 dataswitch
Up to 106 processorsUp to 24 processors
CS4/MSc Parallel Architectures - 2009-2010
Like a bus, rings easily support broadcasts Snooping implemented by all controllers checking the
message as it passes by and re-injecting it into the ring Potentially multiple transactions can be simultaneously on
different stretches of the ring (harder to enforce proper ordering)
Large latency for long rings and growing linearly with number of processors
Used to provide coherence across multiple chips in current CMP systems (e.g., IBM Power 5)
15
Snooping with Ring
P
L1 Mem
P
L1 Mem
P
L1 Mem
P
L1Mem
P
L1 Mem
P
L1 Mem
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
16
Original (hardware) cache coherence works:“Using Cache Memory to Reduce Processor Memory
Traffic”, J. Goodman, Intl. Symp. on Computer Architecture, June 1983.
“A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories”, M. Papamarcos and J. Patel, Intl. Symp. on Computer Architecture, June 1984.
“Hierarchical Cache/Bus Architecture for Shared-Memory Multiprocessors”, A. Wilson Jr., Intl. Symp. on Computer Architecture, June 1987.
An early survey of cache coherence protocols:“Cache Coherence Protocols: Evaluation Using a
Multiprocessor Simulation Model”, J. Archibald and J.-L. Baer, ACM Trans. on Computer Systems, November 1986.
Discussion on the difficulties of maintaining inclusion“On the Inclusion Properties for Multi-Level Cache
Hierarchies”, J.-L. Baer and W.-H. Wang, Intl. Symp. on Computer Architecture, May 1988.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
17
Modern bus-based coherent multiprocessors:“The Sun Fireplane System Interconnect”, A.
Charlesworth, Supercomputing Conf., November 2001. Some software cache coherence schemes:
“The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture”, G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss, Intl. Conf. on Parallel Processing, August 1985.
“Automatic Management of Programmable Caches”, R. Cytron, S. Karlowsky, and K. McAuliffe, Intl. Conf. on Parallel Processing, August 1988.
CS4/MSc Parallel Architectures - 2009-2010
Snooping coherence– Global state of a memory line is the collection of its state in all caches,
and there is no summary state anywhere– All cache controllers monitor all other caches’ activities and maintain the
state of their lines– Requires a broadcast shared medium (e.g., bus or ring) that also
maintains a total order of all transactions– Bus acts as a serialization point to provide ordering
Directory coherence– Global state of a memory line is the collection of its state in all caches,
but there is a summary state at the directory– Cache controllers do not observe all activity, but interact only with
directory– Can be implemented on scalable networks, where there is no total order
and no simple broadcast, but only one-to-one communication– Directory acts as a serialization point to provide ordering (Lecture 11)
1
Lect. 9: Shared Mem. Multiprocessors III/V
CS4/MSc Parallel Architectures - 2009-2010
Directory Structure Directory information (for
every memory line)– Line state bits (e.g., not
cached, shared, modified)– Sharing bit-vector: one bit for
each processor that is sharing or for the single processor that has the modified line
– Organized as a table indexed by the memory line address
Directory controller– Hardware logic that interacts
with cache controllers and enforces cache coherence
2
Sharing vector
0 0 00 0
Line state Memory
4
Cache states:00 = invalid01 = shared10 = modified
Dir. states:00 = not cached01 = shared10 = modified
Directory information
Up to 3 processors can be supported
Line is not cached so sharing vectoris empty and memory value is valid
1 0 10 1 9
Line is shared in P0 and P2and memory value is valid
CS4/MSc Parallel Architectures - 2009-2010
Directory Operation Example: load with no sharers
3
Sharing vector
0 0 00 0
Line state Memory
4P0
L1
0 0
Line state
P1L1
0 0
Line state
P2L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Dir. states:00 = not cached01 = shared10 = modified
Load
Miss
1 1
4 Value1
CS4/MSc Parallel Architectures - 2009-2010
Directory Operation Example: load with sharers
4
Sharing vector
0 0 10 1
Line state Memory
4P0
L1
0 1
Line state
P1L1
0 0
Line state
P2L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Dir. states:00 = not cached01 = shared10 = modified
Load
4
Miss
1
4
4
Value
1
CS4/MSc Parallel Architectures - 2009-2010
Directory Operation Example: store with sharers
5
Sharing vector
0 1 10 1
Line state Memory
4P0
L1
0 1
Line state
P1L1
0 1
Line state
P2L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Dir. states:00 = not cached01 = shared10 = modified
Store
4
Miss
4
4
0
Ackn
ow
led
ge
6
Acknowledge
1 0
Invalidate
1 0 0
Reply
CS4/MSc Parallel Architectures - 2009-2010
Directory Operation Example: load with owner
6
Sharing vector
0 1 01 0
Line state Memory
4P0
L1
0 0
Line state
P1L1
1 0
Line state
P2L1
0 0
Line stateCache states:00 = invalid01 = shared10 = modified
Dir. states:00 = not cached01 = shared10 = modified
Load
44
6
Miss
Forward
0 1
6
Value
0 1
Acknowledge+Value
10 1 6
CS4/MSc Parallel Architectures - 2009-2010
Notes on Directory Operation On a write with multiple sharers it is necessary to collect and
count all the invalidation acknowledgements (ACK) before actually writing
On transactions that involve more complex state changes the directory must also receive and acknowledgement– In case something goes wrong– To establish the completion of the load or store (Lecture 11)
As with snooping on buses, “the devil is in the details” and we actually need transient states, must deal with conflicting requests, and must handle multi-level caches
As with buses, when buffers overflow we need to introduce NACKs
Directories should work well if only a small number of processors share common data at any given time (otherwise broadcasts are better)
7
CS4/MSc Parallel Architectures - 2009-2010
Quantitative Motivation for Directories
Number of invalidations per store miss on MSI with infinite caches
Bottom-line: number of sharers for read-write data is small
8
0 3 6
12
to1
5
24
to2
7
36
to3
9
48
to5
1
60
to6
3
0
20
40
60
80
100
LU
Radix
Ocean
Raytrace
Barnes-Hut
Radiosity
Culler and SinghFig. 8.9
CS4/MSc Parallel Architectures - 2009-2010
Example Implementation Difficulties
Operations have to be serialized locally
Operations have to be serialized at directory
9
P0 P1
Dir.
1. P0 sends read request for line A.
1
2. P1 sends read exclusive request for line A (waits at dir.).2 3. Dir. responds to (1), sets sharing vector (message gets delayed).3 4a/b. Dir. responds to (2) to both P0 (sharer) and P1 (new owner).4a
4b Problem: when (3) finally arrives at P0 the stale value of line A is placed in the cache. Solution: P0 must serialize transactions locally so that it won’t react to 4b while it has a read pending.
5. P0 invalidates line A and sends acknowledgement
5
P0 P1
Dir.
1. P1 sends read exclusive request for line A.
12. Dir. forwards request to P0 (owner).
24. P1 receives (3a) and considers read excl. complete. A replacement miss sends the updated value back to memory.
4
Problem: when (4) arrives dir. accepts and overwrites memory. When (3b) finally arrives dir. completes ownership transfer and thinks that P1 is the owner. Solution: dir. must serialize transactions so that it won’t react to 4 while the ownership transfer is pending.
3b
3a/b. P0 sends data to P2 and ack. to dir. (ack gets delayed).
3a
CS4/MSc Parallel Architectures - 2009-2010
Directory Overhead Problem: consider a system with 128 processors, 256GB of
memory, 1MB L2 cache per processor, and 64byte cache lines– 128 bits for sharing vector plus 3 bits for state → ~16bytes– Per line: 16/64 = 0.25 → 25% memory overhead– Total: 0.25*256G = 64GB of memory overhead!
Solution: Cached Directories– At any given point in time there can be only 128M/64 = 2M lines
actually cached in the whole system– Lines not cached anywhere are implicitly in state “not cached” with null
sharing vector– To maintain only the entries for the actively cached lines we need to
keep the tags → 64bits = 8bytes– Overhead per cached line: (8+16)/64 = 0.375 → 37.5% overhead– Total overhead: 0.375*2M = 768KB of memory overhead
10
CS4/MSc Parallel Architectures - 2009-2010
Scalability of Directory Information
Problem: number of bits in sharing vector limit the maximum number of processors in the system– Larger machines are not possible once we decide on the size of the vector– Smaller machines waste memory
Solution: Limited Pointer Directories– In practice only a small number of processors share each line at any time– To keep the ID of up to n processors we need log2n bits and to remember
m sharers we need m IDs → m*log2n
– For n=128 and m=4 → 4*log2128 = 28bits = 3.5bytes– Total overhead: (3.5/64)*256G = 14GB of memory overhead– Idea:
Start with pointer scheme If more than m processors attempt to share the same line then trap to OS and
let OS manage longer lists of sharers Maintain one extra bit per directory entry to identify the current
representation
11
CS4/MSc Parallel Architectures - 2009-2010
Distributed Directories Directories can be used with UMA systems, but are more
commonly used with NUMA systems
In this case the directory is actually distributed across the system
These machines are then called cc-NUMA, for cache-coherent-NUMA, and DSM, for distributed shared memory
12
Interconnection
CPU
Cache
Mem.
Node
Dir.
CPU
Cache
Mem.Dir.
CPU
Cache
Mem.Dir.
CPU
Cache
Mem.Dir.
CS4/MSc Parallel Architectures - 2009-2010
Distributed Directories Now each part of the directory is only responsible for the
memory lines of its node How are memory lines distributed across the nodes?
– Lines are mapped per OS page to nodes– Pages are mapped to nodes following their physical address– Mapping of physical pages to nodes is done statically in chunks– E.g., 4 processors with 1GB of memory each and 4KB pages (thus,
256 pages per node) Node 0 is responsible (home) for pages [0,255] Node 1 is responsible for pages [256,511] Node 2 is responsible for pages [512,767] Node 3 is responsible for pages [768,1023] Load to address 1478656 goes to page 1478656/4096=361, which goes
to node 361/256=1
13
CS4/MSc Parallel Architectures - 2009-2010
Distributed Directories How is data mapped to nodes?
– With a single user, OS can map a virtual page to any physical page→ OS can place data almost anywhere, albeit at the granularity of pages
– Common mapping policies: First-touch: the first processor to request a particular data has the data’s
page mapped to its range of physical pages– Good when each processor is the first to touch the data it needs, and other nodes
do not access this page often Round-robin: as data is requested virtual pages are mapped to physical
pages in circular order (i.e., node 0, node 1, node 2, … node N, node 0, …)– Good when one processor manipulates most of the data at the beginning of a
phase (e.g., initialization of data)– Good when some pages are heavily shared (hot pages)
Note: data that is only private is always mapped locally– Advanced cc-NUMA OS functionality
Mapping of virtual pages to nodes can be changed on-the-fly (page migration)
A virtual page with read-only data can be mapped to physical pages in multiple nodes (page replication)
14
CS4/MSc Parallel Architectures - 2009-2010
Combined Coherence Schemes
Use bus-based snooping in nodes and directory (or bus snooping) across nodes– Bus-based snooping coherence for a small number of processors
is relatively strait-forward– Hopefully communication across processors within a node will
not have to go beyond this domain– Easier to scale up and down the machine size– Two levels of state:
Per-node at higher level (e.g., a whole node owns modified data, but Dir. does not know which processor in the node actually has it)
Per-processor at lower level (e.g., by snooping inside the node we can find the exact owner and the exact up-to-date value)
15
Bus
CPU
Main memory
CPU CPU CPU
Cache Cache Cache Cache
Dir.
Bus or Scalable interconnect
Bus
CPU
Main memory
CPU CPU CPU
Cache Cache Cache Cache
Dir.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
16
Original directory coherence idea:“A New Solution to Coherence Problems in Multicache
Systems”, L. Censier and P. Feautrier, IEEE Trans. on Computers, December 1978
Seminal work on distributed directories:“The DASH Prototype: Implementation and Performance”,
D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1992.
A commercial machine with distributed directories:“The SGI Origin: a ccNUMA Highly Scalable Server”, J.
Laudon and D. Lenoski, Intl. Symp. on Computer Architecture, June 1997.
A commercial machine with SCI:“STiNG: a CC-NUMA Computer System for the
Commercial Marketplace”, T. Lovett and R. Clapp, Intl. Symp. on Computer Architecture, June 1996.
Adaptive full/limited pointer distributed directory protocols:“An Evaluation of Directory Schemes for Cache
Coherence”, A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, Intl. Symp. on Computer Architecture, June 1988.
CS4/MSc Parallel Architectures - 2009-2010
Probing Further
17
Page migration and replication for ccNUMA“Operating System Support for Improving Data Locality
on ccNUMA Compute Servers”, B. Verghese, S. Devine, A. Gupta, and M. Rosemblum, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.
Cache Only Memory Architectures“Comparative Performance Evaluation of Cache-Coherent
NUMA and COMA Architectures”, P. Stenstrom, T. Joe, and A. Gupta, Intl. Symp. on Computer Architecture, June 1992.
Recent alternative protocols: token, ring“Token Coherence: Decoupling Performance and
Correctness”, M. Martin, M. Hill, and D. Wood, Intl. Symp. on Computer Architecture, June 2003.
“Coherence Ordering for Ring-Based Chip Multiprocessors”, M. Marty and M. Hill, Intl. Symp. On Microarchitecture, December 2006.
CS4/MSc Parallel Architectures - 2009-2010
Synchronization is necessary to ensure that operations in a parallel program happen in the correct order
Different primitives are used at different levels of abstraction– High-level (e.g., critical sections, monitors, parallel sections and
loops, atomic): supported in languages themselves or language extensions (e.g, Java threads, OpenMP)
– Middle-level (e.g., semaphores, condition variables, locks, barriers): supported in libraries (e.g., POSIX threads)
– Low-level (e.g., compare&swap, test&set, load-link & store-conditional): supported in hardware
Higher level primitives can be constructed from lower level ones
Things to consider: deadlock, livelock, starvation
1
Lect. 10: Shared Mem. Multiprocessors IV/V
CS4/MSc Parallel Architectures - 2009-2010
Example: Sync. in Java Threads
2
Synchronized Methods– Concurrent calls to the method on the same object have
to be serialized– All data modified during one call to the method
becomes atomically visible to all calls to other methods of the object
– E.g.:
– Can be implemented with locks
public class SynchronizedCounter { private int c = 0;
public synchronized void increment() { c++; }}
SynchronizedCounter myCounter;
CS4/MSc Parallel Architectures - 2009-2010
Example: Sync. in OpenMP
3
Doall loops– Iterations of the loop can be executed concurrently– After the loop, all processors have to wait and a single
one continues with the following code– All data modified during the loop is visible after the loop– E.g.:
– Can be implemented with barrier
#pragma omp parallel for \ private(i,s) shared (A,B)\ schedule(static)for (i=0; i<N; i++) { s = …
A[i] = B[i] + s;}
CS4/MSc Parallel Architectures - 2009-2010
Example: Sync. in POSIX Threads
4
Locks– Only one thread can own the lock at any given time– Unlocking makes all the modified data visible to all
threads and locking forces the thread to obtain fresh copies of all data
– E.g.:
– Can be implemented with test&set
pthread_mutex_t mylock;
pthread_mutex_init(&mylock, NULL);pthread_mutex_lock(&mylock);
Count++;
pthread_mutex_unlock(&mylock);
CS4/MSc Parallel Architectures - 2009-2010
Example: Building CS from Locks
5
Relatively strait-forward
Actual implementation is encapsulated in library function
In practice, the library may implement different policies on how to wait for a lock and how to avoid starvation
Processor 0
int A, B, C;lock_t mylock;
lock(&mylock);
A = …;B = …;
unlock(&mylock);
Processor 1
lock(&mylock);
… = A + …;… = C + …;
unlock(&mylock);
init
iali
zati
on
para
llel
CS4/MSc Parallel Architectures - 2009-2010
Example: Building CS from Ld/St?
6
E.g., Peterson’s algorithm
No! This is not a safe way to implement CS in a modern multiprocessor (Lecture 11)
Processor 0int A, B, C;int mylock[2], turn;
mylock[0]=0; mylock[1]=0;turn = 0;
mylock[0] = 1; turn = 1;while(mylock[1]&&turn==1);
A = …;B = …;
mylock[0] = 0;
Processor 1
mylock[1] = 1; turn = 0;while(mylock[0]&&turn==0);
… = A + …;… = C + …;
mylock[1] = 0;
init
iali
zati
on
para
llel
CS4/MSc Parallel Architectures - 2009-2010
Hardware Primitives
7
Hardware job is to provide atomic memory operations, which involves both processors and the memory subsystem
Implemented in the IS, but usually encapsulated in library function calls by manufacturers
At a minimum, hardware must provide an atomic swap
Examples:– Compare&Swap (e.g., Sun Sparc) and Test&Set: if
value in memory is equal to value in register Ra then swap memory value with the value in Rb and return memory’s original value in Ra
Can implement more complex conditions for synchronization
Requires comparison operation in memory or must block memory location until processor is done with comparison
CAS (R1),R2,R3 ;MEM[R1]==R2?MEM[R1]=R3:
CS4/MSc Parallel Architectures - 2009-2010
Hardware Primitives
8
Examples:– Fetch&Increment (e.g., Intel x86) (in general
Fetch&Op): increment the value in memory and return the old value in register
Less flexible than Compare&Swap Requires arithmetic operation in memory or must block
memory location (or bus) until processor is done with comparison (e.g., x86)
– Swap: swap the values in memory and in a register Less flexible of all Does not require comparison or arithmetic operation in
memory
lock; ADD (R1),R2 ;MEM[R1]=MEM[R1]+R2 LODSW ;accumulator=MEM[DS:SI]
CS4/MSc Parallel Architectures - 2009-2010
Building Locks with Hdw. Primitives
9
Example: Test&Set
int lock(int *mylock) {
int value;
value = test&set(mylock,1); if (value) return FALSE; else return TRUE;}
void unlock(int *mylock) { *mylock = 0; return;}
CS4/MSc Parallel Architectures - 2009-2010
What If the Lock is Taken?
10
Spin-wait lock
– Each call to lock invokes the hardware primitive, which involves an expensive memory operation and takes up network bandwidth
Spin-wait on cache: Test-and-Test&Set– Spin on cached value using normal load and rely on
coherence protocol
– Still, all processors race to memory, and clash, once the lock is released
while (!lock(&mylock));…unlock(&mylock);
while (TRUE) { if (!lock(&mylock)) while (!mylock);}…unlock(&mylock);
CS4/MSc Parallel Architectures - 2009-2010
What If the Lock is Taken?
11
Software solution: Blocking locks and Backoff
– Wait can be implemented in the application itself (backoff) or by calling the OS to be put to sleep (blocking)
– The waiting time is usually increased exponentially with the number of retries
– Similar to the backoff mechanism adopted in the Ethernet protocol
while (TRUE) { if (!lock(&mylock)) wait (time);}…unlock(&mylock);
CS4/MSc Parallel Architectures - 2009-2010
A Better Hardware Primitive
12
Load-link and Store-conditional– Implement atomic memory operation as two operations– Load-link (LL):
Registers the intention to acquire the lock Returns the present value of the lock
– Store-conditional (SC): Only stores the new value if no other processor attempted
a store between our previous LL and now Returns 1 if it succeeds and 0 if it fails
– Relies on the coherence mechanism to detect conflicting SC’s
– All operation is done locally at the cache controllers or directory, no need for complex blocking operation in memory
– New register id added to L1 to remember pending LL from local processor (e.g., PowerPC RESERVE register)
– Also benefits from blocking and backoff– Introduced in the MIPS processor, now also used in
PowerPC and ARM
CS4/MSc Parallel Architectures - 2009-2010
A Better Hardware Primitive
13
Load-link and Store-conditional operation
P0L1
RESERVE
P1L1
RESERVE
Coherence substrate
LL 0xA
0xA1
LL completes
SC 0xA
SC suceeds
P0L1
RESERVE
P1L1
RESERVE
LL 0xA
Coherence substrate
LL 0xA
0xA1
LL completes
SC 0xA
SC fails 0xA1
LL completesSC suceeds
SC 0xA
0
CS4/MSc Parallel Architectures - 2009-2010
E.g., spin-wait with attempted swap
– At the end, if SC succeeds the value of the lock variable will be in R4
– If lock is taken then start over again
Building Locks with LL/SC
14
try: OR R3,R4,R0 ;move value to be exchanged LL R2,0(R1) ;value of lock loaded SC R3,0(R1) ;try to store value BEQZ R3,try ;branch if SC failed MOV R4,R2 ;move lock value into R4
check: BNEZ R4,try ;try again if lock was taken
CS4/MSc Parallel Architectures - 2009-2010
An Alternative Hdw. Approach
15
Locks have a relatively large overhead and, thus, are suitable for guarding relatively large amounts of data
Some algorithms need to exchange only a small number of words each time
Also, consumer thread must wait for all data guarded by a lock to be ready before it can begin work
Better approach for fine-grain synchronization: Full/Empty Bits– Associate one bit with every memory word (1.5%
overhead for 64bit words)– Augment the behavior of load/store
Load: if word is empty then trap to OS (to wait) otherwise, return value and set bit to empty Store: if word is full then trap to OS (to deal with error) otherwise, store the new value, set bit to full, and
release any threads pending on the word (with OS help) Reset: set the bit to empty
– Good for producer-consumer type of communication/synchronization
CS4/MSc Parallel Architectures - 2009-2010
Example: Using Full/Empty Bits
16
Compare against example in Slide 5
Processor 0
int A, B, C;
A = …; // blocks if not // yet used
B = …; // no impact on P1
Processor 1
… = A + …; // waits if not // ready
… = C + …; // does not // have to wait
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
17
A commercial machine with Full/Empty bits:“The Tera Computer System”, R. Alverson, D. Callahan, D.
Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Symp. on Supercomputing, June 1990.
Performance evaluations of synchronization for shared-memory:“The Performance of Spin Lock Alternatives for Shared-
Memory Multiprocessors”, T. Anderson, IEEE Trans. on Parallel and Distributed Systems, January 1990.
“Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors”, J. Mellor-Crummey and M. Scott, ACM Trans. on Computer Systems, February 1991.
CS4/MSc Parallel Architectures - 2009-2010
Consider the following code:
– What are the possible outcomes?
1
Lect. 11: Shared Mem. Multiprocessors V/V
A=0, B=0, C=0;…
C = 1;A = 1; while (A==0);
B = 1;
P1 P2
init
iali
zati
on
para
llel
A==1, C==1?A==0, C==1?A==0, C==0?A==1, C==0?
while (B==0);print A;
P3
Yes. This is what one would expect.Yes. If st to B overtakes the st to A on the interconnect toward P3.Yes. If the st to C overtakes the st to A from the same processor.
CS4/MSc Parallel Architectures - 2009-2010
Memory Consistency Cache coherence:
– Guarantees eventual write propagation– Guarantees a single order of all writes
Memory consistency:– Specifies the ordering of loads and stores to different memory
locations– Defined in so called Memory Consistency Models– This is really a “contract” between the hardware, the compiler, and the
programmer i.e., hardware and compiler will not violate the ordering specified i.e., the programmer will not assume a stricter order than that of the model
– Hardware/Compiler provide “safety net” mechanisms so the user can enforce a stricter order than that provided by the model
2
For the same memory location.No guarantees on when writes propagate.No guarantees on the order of writes.
CS4/MSc Parallel Architectures - 2009-2010
Sequential Consistency (SC) Key ideas:
– The behavior on a multiprocessor should be the same as in a time-shared multiprocessor
– Thus, memory ordering has to follow the individual order in each thread and there can be any interleaving of such sequential segments
– Memory abstraction is that of a random switch to memory:
– Notice that in practice many orderings are still valid
3
P0 P1 Pn
Memory
CS4/MSc Parallel Architectures - 2009-2010
Terminology Issue: memory operation leaves the processor and becomes
visible to the memory subsystem Performed: memory operation appears to have taken place
– Performed w.r.t. processor X: as far as processor X can tell E.g., a store S by processor Y to variable A is performed w.r.t. processor X
if a subsequent load by X to A returns the value of S (or the value of a store later than S, but never a value older than that of S)
E.g., a load L is performed w.r.t. processor X if all subsequent stores by any processor cannot affect the value returned by L to X
– Globally performed or complete: performed w.r.t. to all processors E.g., a store S by processor Y to variable A is globally performed if any
subsequent load by any processor to A returns the value of S
X consistent execution: any execution that matches one of the possible total orders (interleavings) as defined by model X
4
CS4/MSc Parallel Architectures - 2009-2010
Example: Sequential Consistency
Some valid SC orderings:
5
A=0, B=0, C=0;…
C = 1;A = 1; while (A==0);
B = 1;
P1 P2
init
iali
zati
on
para
llel
P1: st C # C=1P1: st A # A=1P2: ld A # whileP2: st B # B=1P3: ld B # whileP3: ld A # print
while (B==0);print A;
P3
P1: st C # C=1P2: ld A # while…P1: st A # A=1P2: ld A # whileP2: st B # B=1P3: ld B # whileP3: ld A # print
P1: st C # C=1P2: ld A # while…P1: st A # A=1P2: ld A # whileP3: ld B # while…P2: st B # B=1P3: ld B # whileP3: ld A # print
CS4/MSc Parallel Architectures - 2009-2010
Sequential Consistency (SC) Sufficient conditions
1. Threads issue memory operations in program order2. Before issuing next memory operation threads wait until last issued
write completes (i.e., performs w.r.t. all other processors)3. Before issuing next memory operation threads wait until last issued
read completes and until the matching write (i.e., the one whose value is returned to the read) also completes
Notes:– Condition 3 is actually quite demanding and is the one that guarantees
write atomicity– In practice necessary conditions may be more relaxed– These conditions are easily violated in real hardware and compilers
(e.g., write buffers in hdw. and ld-st scheduling in compiler)– Program order defined after source code (programmer’s intention) and
may be different from assembly code due to compiler optimizations
6
CS4/MSc Parallel Architectures - 2009-2010
Relaxed Memory Consistency Models
At a high level they relax ordering constraints between pairs of reads, writes, and read-write (e.g., reads are allowed to bypass writes, writes are allowed to bypass each other)
In practice there are some implementation artifacts (e.g., no write atomicity in Pentium)
Some models make synchronization explicit and different from normal loads and stores
Many models have been proposed and implemented– Total Store Ordering (TSO) (e.g., Sparc)– Partial Store Ordering (PSO) (e.g., Sparc)– Relaxed Memory Ordering (RMO) (e.g., Sparc)– Processor Consistency (PC) (e.g., Pentium)– Weak Ordering (WO)– Release Consistency (RC)– PowerPC
7
CS4/MSc Parallel Architectures - 2009-2010
Relaxed Memory Consistency Models
Note that control flow and data flow dependences within a thread must still be honored regardless of the consistency model– E.g.,
– E.g.,
8
A=0, B=0, C=0;…
C = 1;A = 1; while (A==0);
B = 1; while (B==0);print A;st to B cannot overtake ld to A
ld to A cannot overtake ld to B
A = 1;…A = 2;…B = A;
Second st to A cannot overtake earlier st to A
ld to A cannot overtake earlier st to A
CS4/MSc Parallel Architectures - 2009-2010
Example: Total Store Ordering (TSO)
Reads are allowed to bypass writes (can hide write latency) Similar to PC Still makes prior example work as expected,
but breaks some intuitive assumptions,
including Peterson’s algorithm (Lecture 10)
9
…C = 1;A = 1; while (A==0);
B = 1;
P1 P2
…A = 1;Print B;
B = 1;Print A;
P1 P2 SC guarantees that A==0 and B==0 will never be printedTSO allow it if ld B (P1) overtakes st A (P1) and ld A (P2) overtakes st B (P2)
CS4/MSc Parallel Architectures - 2009-2010
Example: Release Consistency (RC)
Reads and writes are allowed to bypass both reads and writes (i.e., any order that satisfies control flow and data flow is allowed)
Assumes explicit synchronization operations: acquire and release (Lecture 10). So, for correct operation, our example must become:
Constraints– All previous writes must complete before a release can complete– No subsequent reads can complete before a previous acquire completes– All synchronization operations must be sequentially consistent (i.e., follow
the rules of Slide 6, where an acquire is equivalent to a read and a release is equivalent to a write)
10
…C = 1;Release(A); while (!Lock(A));
B = 1;
P1 P2
CS4/MSc Parallel Architectures - 2009-2010
Example: Release Consistency (RC)
Example: original program order
11
Read/write…Read/write
P1
Acquire
Read/write…Read/write
Release
Read/write…Read/write
Allowable overlaps
– Reads and writes from block 1 can appear after the acquire (thus, initialization also requires an acquire-release pair)
– Reads and writes from block 3 can appear even before the release
– Between acquire and release any order is valid in block 2 (and also 1 and 3)
Note that despite the many reorderings, this still matches our intuition of critical sections
1
2
3
Read/write…Read/write
1
Acquire
Read/write…Read/write
Release
2Read/write…Read/write
3
CS4/MSc Parallel Architectures - 2009-2010
Races and Proper Synchronization
Races: unsynchronized loads and stores “race each order” through the memory hierarchy (e.g., the loads and stores to A, B, and C in prior example)
Delay-set Analysis– Technique that allows to identify races that require
synchronization– Mark all memory references in both threads and create arcs
between them Directed arcs that follow program order (the blue ones below) Undirected arcs that follow cross-thread data dependences (the
green ones below, recall that the print implicitly contains a read)
– Cycles following the arcs indicate the problematic memory references
12
…A = 1;
Print B;
B = 1;
Print A;
P1 P2
CS4/MSc Parallel Architectures - 2009-2010
Memory Barriers/Fences How can I enforce some order of memory accesses?
– Ideally use the synchronization primitives, but these can be very costly
Memory Barriers/Fences:– New instructions in the IS and supported in the processor and memory– Specify that previously issued memory operations must complete before
processor is allowed to proceed past the barrier Write-to-read barrier: all previous writes must complete before the next read can
be issued Write-to-write: all previous writes must complete before the next write can be
issued Full barriers: all previous loads and stores must complete before the next
memory operation can be issued
Note: not to be confused with synchronization barriers (Lecture 10)
Note: stricter models can be emulated with such barriers on system that only support less strict models
13
CS4/MSc Parallel Architectures - 2009-2010
Final Notes Many processors/systems support more than one
consistency model, usually set at boot time It is possible to decouple consistency model
presented to programmer from that of the hardware/compiler– E.g., hardware may implement a relaxed model but
compiler guarantees SC via memory barriers
It is possible to allow a great degree of reordering with SC through speculative execution in hardware (with rollback when stricter model is violated) (e.g., MIPS R10000/SGI Origin)
14
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
15
Original definition of sequential consistency:“How to Make a Multiprocessor Computer that Correctly
Execute Multiprocess Programs”, L. Lamport, IEEE Trans. on. Computers, September 1979.
Original work on relaxed consistency models:“Correct Memory Operation of Cache-Based
Multiprocessors”, C. Scheurich and M. Dubois, Intl. Symp. on Computer Architecture, June 1987.
“Weak Ordering: A New Definition”, S. Adve and M. Hill, Intl. Symp. on Computer Architecture, June 1990.
“Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors”, K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1990.
A very good tutorial on memory consistency models:“Shared Memory Consistency Models: A Tutorial”, S. Adve
and K. Gharachorloo, IEEE Computer, December 1996.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
16
Problems with OO memory consistency models (e.g., Java):“Fixing the Java Memory Model”, W. Pugh, Conf. on Java
Grande, June 1999. Delay set analysis:
“Efficient and Correct Execution of Parallel Programs that Share Memory”, D. Shasha and M. Snir, ACM Trans. on. Programming Languages and Operating Systems, February 1988.
Compiler support for SC on non-SC hardware:“Analyses and Optimizations for Shared Address Space
Programs”, A. Krishnamurthy and K. Yelick, Journal of Parallel and Distributed Computing, February 1996.
“Hiding Relaxed Memory Consistency with Compilers”, J. Lee and D. Padua, Intl. Conf. on Parallel Architectures and Compilation Techniques, October 2000.
CS4/MSc Parallel Architectures - 2009-2010
Probing Further
17
Transactional Memory“Transactional Memory: Architectural Support for Lock-
Free Data Structures”, M. Herlihy and J. Moss, Intl. Symp. on Computer Architecture, June 1993.
“Transactional Memory Coherence and Consistency”, L. Hammond, V. Wong, M. Chen, B. Calstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Intl. Symp. on Computer Architecture, June 2004.
“Transactional Execution: Toward Reliable, High-Performance Multithreading”, R. Rajwar and J. Goodman, IEEE Micro, November 2003.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 12: Multithreading Memory latencies and even latencies to lower level
caches are becoming longer w.r.t. processor cycle times There are basically 3 ways to hide/tolerate such latencies
by overlapping computation with the memory access– Dynamic out-of-order scheduling– Prefetching– Multithreading
OOO execution and prefetching allow overlap of computation and memory access within the same thread (these were covered in CS3 Computer Architecture)
Multithreading allows overlap of memory access of one thread/process with computation by another thread/process
1
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
2
Basic idea:– Recall multi-tasking: on I/O a process is context-
switched out of the processor by the OS
– With multithreading a thread/process is context-switched out of the pipeline by the hardware on longer-latency operations
process 1running
system call for I/O
OS interrupt handlerrunning
I/O completion
Process 1running
process 2running
OS interrupt handlerrunning
process 1running
Long-latency operation
Hardware contextswitch
Long-latency operation
Process 1running
process 2running
Hardware contextswitch
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
3
Basic idea:– Unlike in multi-tasking, context is still kept in the
processor and OS is not aware of any changes– Context switch overhead is minimal (usually only a few
cycles)– Unlike in multi-tasking, the completion of the long-
latency operation does not trigger a context switch (the blocked thread is simply marked as ready)
– Usually the long-latency operation is a L1 cache miss, but it can also be others, such as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)
Context of a thread in the processor:– Registers– Program counter– Stack pointer– Other processor status words
Note: the term is commonly (mis)used to mean simply the fact that the system supports multiple threads
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
4
Latency hiding example:
Memory latencies
Pipeline latency
Thread A
Thread B
Thread C
Thread D
= context switch overhead
= idle (stall cycle)
Culler and SinghFig. 11.27
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
5
Hardware mechanisms:– Keeping multiple contexts and supporting fast switch
One register file per context One set of special registers (including PC) per context
– Flushing instructions from the previous context from the pipeline after a context switch
Note that such squashed instructions add to the context switch overhead
Note that keeping instructions from two different threads in the pipeline increases the complexity of the interlocking mechanism and requires that instructions be tagged with context ID throughout the pipeline
– Possibly replicating other microarchitectural structures (e.g., branch prediction tables, load-store queues, non-blocking cache queues)
Employed in the Sun T1 and T2 systems (a.k.a. Niagara)
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
6
Simple analytical performance model:– Parameters:
Number of threads (N): the number of threads supported in the hardware
Busy time (R): time processor spends computing between context switch points
Switching time (C): time processor spends with each context switch
Latency (L): time required by the operation that triggers the switch
– To completely hide all L we need enough N such that ~N*(R+C) equals L (strictly speaking, (N-1)*R + N*C = L)
Fewer threads mean we can’t hide all L More threads are unnecessary
– Note: these are only average numbers and ideally N should be bigger to accommodate variation
R C
L
R C R C R C
CS4/MSc Parallel Architectures - 2009-2010
Blocked Multithreading
7
Simple analytical performance model:– The minimum value of N is referred to as the saturation
point (Nsat)
– Thus, there are two regions of operation: Before saturation, adding more threads increase processor
utilization linearly After saturation, processor utilization does not improve
with more threads, but is limited by the switching overhead
– E.g.: for R=40, L=200, and C=10
Nsat =R + LR + C
Usat =R
R + C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6 7 8 9
Number of threads
Pro
cess
or
uti
liza
tio
n (
%)
Culler and SinghFig. 11.25
CS4/MSc Parallel Architectures - 2009-2010
Fine-grain or Interleaved Multithreading
8
Basic idea:– Instead of waiting for long-latency operation, context
switch on every cycle– Threads waiting for a long latency operation are
marked not ready and are not considered for execution– With enough threads no two instructions from the same
thread are in the pipeline at the same time → no need for pipeline interlock at all
Advantages and disadvantages over blocked multithreading:+ No context switch overhead (no pipeline flush)+ Better at handling short pipeline latencies/bubbles– Possibly poor single thread performance (each thread
only gets the processor once every N cycles)– Requires more threads to completely hide long
latencies– Slightly more complex hardware than blocked
multithreading Some machines have taken this idea to the
extreme and eliminated caches altogether (e.g., Cray MTA-2, with 128 threads per processor)
CS4/MSc Parallel Architectures - 2009-2010
Fine-grain or Interleaved Multithreading
9
Latency hiding example:
Memory latencies
Pipeline latency
Thread A
Thread B
Thread C
Thread D = idle (stall cycle)
Culler and SinghFig. 11.28
Thread E
Thread F
A is still blocked,so is skipped
E is still blocked,so is skipped
CS4/MSc Parallel Architectures - 2009-2010
Fine-grain or Interleaved Multithreading
10
Simple analytical performance model (see Slide 6):– Parameters:
Number of threads (N) and Latency (L) Busy time (R) is now 1 and switching time (C) is now 0
– To completely hide all L we need enough N such that N-1 = L
– Again, these are only average numbers and ideally N should be bigger to accommodate variation
– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)
– Again, there are two regions of operation: Before saturation, adding more threads increase processor
utilization linearly After saturation, processor utilization does not improve
with more threads, but is 100% (i.e., Usat = 1)
R
L
R RR
CS4/MSc Parallel Architectures - 2009-2010
Simultaneous Multithreading (SMT)
11
Basic idea:– Don’t actually context switch, but on a superscalar
processor fetch and issue instructions from different threads/processes simultaneously
– E.g., 4-issue processor
Advantages:+ Can handle not only long latencies and pipeline
bubbles but also unused issue slots+ Full performance in single-thread mode– Most complex hardware of all multithreading schemes
cycles
no multithreading
cachemiss
blocked interleaved SMT
CS4/MSc Parallel Architectures - 2009-2010
Simultaneous Multithreading (SMT)
12
Fetch policies:– Non-multithreaded fetch: only fetch instructions from
one thread in each cycle, in a round-robin alternation– Partitioned fetch: divide the total fetch bandwidth
equally between some of the available threads (requires more complex fetch unit to fetch from multiple I-cache lines; see Lecture 3)
– Priority fetch: fetch more instructions for specific threads (e.g., those not in control speculation, those with the least number of instructions in the issue queue)
Issue policies:– Round-robin: select one ready instruction from each
ready thread in turn until all issue slots are full or there are no more ready instructions
(note: should remember which thread was the last to have an instruction selected and start from there in the next cycle)
– Priority issue: E.g., threads with older instructions in the issue queue are
tried first E.g., threads in control speculative mode are tried last E.g., issue all pending branches first
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
13
Original work on multithreading:“The Tera Computer System”, R. Alverson, D. Callahan, D.
Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Conf. on Supercomputing, June 1990.
“Performance Tradeoffs in Multithreaded Processors”, A. Agarwal, IEEE Trans. on Parallel and Distributed Systems, September 1992.
“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. Tullsen, S. Eggers, and H. Levy, Intl. Symp. on Computer Architecture, June 1995.
“Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor”, D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Intl. Symp. on Computer Architecture, June 1996.
Intel’s hyper-threading mechanism:“Hyper-Threading Technology Architecture and
Microarchitecture”, D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Intel Technology Journal, Q1 2002.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 13: Chip-Multiprocessors (CMP)
Main driving forces:– Complexity of design and verification of wider-issue
superscalar processor would be unmanageable– Performance gains of either wider issue width or deeper
pipelines would be only marginal Limited ILP in applications Wire delays and longer access times of larger structures
– Power consumption of the large centralized structures necessary in wider-issue superscalar processors would be unmanageable
– Increase relative importance of throughput oriented computing as compared to latency oriented computing
– Continuation of Moore’s law so that more transistors fit in a chip
1
CS4/MSc Parallel Architectures - 2009-2010
Early (ca. 2006) CMP’s
2
Example: Intel Core Duo– 2 cores
3-issue superscalar 12-stage pipeline 2-way simultaneous
multithreading (HT) Up to 2.33GHz P6 (Pentium M)
microarchitecture– 2MB shared L2 cache– 151M transistors in
65nm technology– Power consumption
between 9W and 30W
CS4/MSc Parallel Architectures - 2009-2010
Current (ca. 2007) CMP’s
3
Example: Sun T2– 8 cores
Single issue, statically scheduled
8-stage pipeline 8-way multithreading
(blocked) Up to 1.4GHz UltraSparc V9 IS
– 4MB shared L2 cache– 65nm technology– Power consumption
around 72W
CS4/MSc Parallel Architectures - 2009-2010
Future CMP’s?
4
Example: Intel Polaris (2007)– 80 cores
Single issue, statically scheduled
3.2GHz (up to 5GHz)– Scalable, packet-
switched, interconnect (8x10 mesh)
– No shared L2 or L3 cache
– No cache coherence– “Tiled” approach
Core + cache + router– Stacked memory
technology– Power consumption
around 62W Example: Intel SCC
(2010)– 48 cores (full IA-32
compatible)
CS4/MSc Parallel Architectures - 2009-2010
CMP’s vs. Multi-chip Multiprocessors
5
While conceptually similar to traditional multiprocessors, CMP’s have specific issues:– Off-chip memory bandwidth: number of pins per
package does not increase much– On-chip interconnection network: wires and metal
layers are a very scarce resource– Shared memory hierarchy: processors must share some
lower level cache (e.g., L2 or L3) and the on-chip links between these
– Wire delays: actual physical distances to be crossed for communication affect the latency of the communication
– Power consumption and heat dissipation: both are much harder to fit within the limitations of a single chip package
CS4/MSc Parallel Architectures - 2009-2010
Shared vs. Private L2 Caches
6
Private caches:+ Less chance of negative interference between
processors+ Simpler interconnections– Possibly wasted storage in less loaded parts of the chip– Must enforce coherence across L2’s
Shared caches:– More chance for negative interference between
processors+ Possible positive interference between processors+ Better utilization of storage+ Single/few threads have access to all resources when
cores are idle+ No need enforce coherence (but still must enforce
coherence across L1’s) and L2 can act as a coherence point (i.e., directory)
– All-to-one interconnect takes up large area and may become a bottleneck
Note: L1 caches are tightly integrated into the pipeline and are an inseparable part of the core
CS4/MSc Parallel Architectures - 2009-2010
Shared vs. Private L2 Caches
7
Priority Inversion and Fair Sharing– In uniprocessor and multi-chip multiprocessors:
processes with higher priority are given more resources (e.g., more processors, larger scheduling quanta, more memory/caches, etc) → faster execution
– In CMP’s with shared resources (e.g., L2 caches, off-chip memory bandwidth, issue slots with multithreading)
Dynamic allocation of resources to threads/processes is oblivious to OS (e.g., LRU replacement policy in caches)
Hardware policies attempt to maximize utilization across the board
Hardware treats all threads/processes equally and threads/processes compete dynamically for resources
– Thus, at run time, a lower priority thread/processe may grab a larger share of resources and may execute relatively faster than a higher thread/process
– One of the biggest problems is that of fair cache sharing
– In more general terms, overall quality of service should be directly proportional to priority
CS4/MSc Parallel Architectures - 2009-2010
Shared vs. Private L2 Caches
8
Fair Sharing– Example:
– Interference in L2 causes gzip to have 3 to 10 times more L2 misses and to run at as low as half the original speed
– Effect of interference depends on what other application is co-scheduled with gzip
0123456789
10
gzip
(alo
ne)
gzip+
applu
gzip+
apsi
gzip+
art
gzip+
swim
Co-scheduled combinations
No
rmal
ized
L2
mis
ses
per
in
stru
ctio
n
0
0.2
0.4
0.6
0.8
1
1.2
gzip
(alo
ne)
gzip+
applu
gzip+
apsi
gzip+
art
gzip+
swim
Co-scheduled combinations
No
rmal
ized
IPC
Figure fromKim et. al.
CS4/MSc Parallel Architectures - 2009-2010
Shared vs. Private L2 Caches
9
Fair Sharing– Condition for fair sharing:
Where Tdedi is the execution time of thread i when executed alone in the CMP with a dedicated L2 cache and Tshri is its execution time when sharing L2 with the other n-1 threads
– To maximize fair sharing, minimize:
where
– Possible solution: partition caches in different sized portions either statically or at run time
Tshr1
Tded1
=Tshr2
Tded2
= … =Tshrn
Tdedn
Mij = Xi - Xj
Xi =Tshri
Tdedj
CS4/MSc Parallel Architectures - 2009-2010
NUCA L2 Caches
10
On-chip L2 and L3 caches are expected to continue increasing in size (e.g., Core Duo has 2MB while Core 2 Duo has 4MB L2)
Such caches are logically divided in a few (2 to 8) logical banks with independent access
Banks are physically divided into small (128KB to 512KB) sub-banks
Thus, future multi-megabyte L2 and L3 caches will likely have 32 or more sub-banks
Increasing wire delays mean that sub-banks closer to a given processor could be accessed quicker than sub-banks further away
Also, some sub-banks will invariably be closer to one processor and far from another, and some sub-banks will be at similar distances from a few processors
Bottom-line: uniform (worst case) access times will be increasingly inefficient
CS4/MSc Parallel Architectures - 2009-2010
NUCA L2 Caches
11
Key ideas:– Allow and exploit the fact that different sub-banks have
different access times– Each sub-bank has its own wire set to the cache
controller (which does increase overall area)– Either statically or dynamically map and migrate the
most heavily used lines to the banks closer to the processor
– By tweaking the dynamic mapping and migration mechanisms such NUCA caches can adapt from private to shared caches
– Obviously, with such dynamic mapping and migration, searching the cache and performing replacements becomes more expensive
E.g., Sun’s T2 uses a NUCA L2 cache with 8 banks spread across the chip borders, but with static mapping and no migration
CS4/MSc Parallel Architectures - 2009-2010
Directory Coherence On-Chip?
12
Mem.Dir.
CPU
L2 Cache
Mem.Dir.
CPU
L2 Cache
L2 CacheDir.
CPU
L1 Cache
L2 CacheDir.
CPU
L1 Cache
One-to-One mapping from CC-NUMA?
L2 Cache → L1 Cache Main memory → L2 Cache Dir. entry per memory line → Dir. entry per L2 cache
line Mem. lines mapped to physical mem. by first-touch
policy at OS page granularity → L2 lines mapped to physical L2 by first-touch policy at OS page level
CS4/MSc Parallel Architectures - 2009-2010
Directory Coherence On-Chip
13
The mapping problem (home node) OS page granularity is too coarse and many lines
needed by Px might be actually used by Py, but still have to be cached in Px (ok for large mem. but not ok for small L2; also may lead to imbalance in mapping)
Line granularity with first-touch needs a hardware/OS mapping of every individual cache line to a physical L2 (too expensive)
Solution: map at line granularity but circularly based on physical address (mem. line 0 maps to L2 #0, mem. line 1 maps to L2 #1, etc) The problem with this solution is that locality of use
is lost! The eviction problem
Upon eviction of an L2 (mem.) line the corresponding dir. entry is lost and all L1 cached copies must be invalidated (ok for rare paging case in CC-NUMA, but not ok for small L2)
Solution: associate dir. entries not with L2 cache lines, but with cached L1 lines (replicated tags and exclusive L1-Home L2)
CS4/MSc Parallel Architectures - 2009-2010
Exclusivity with Replicated Tags
12
Dir. contains copy of the L1 tags of lines mapped to the home L2, but L2 does not have to keep the L1 data itself Good: lines can be evicted from L2 silently (by
exclusivity, they are not cached in any L1) and Dir. does not change
Bad: replicated tags (i.e., the Dir. information) increases with number of L1 caches E.g., for 8 cores with 32KB L1 with 32B lines (i.e.,
1024 lines) and fully associative → 8x1024 = 8,192 entries per Dir.
(In practice, associativity reduces this overhead and alternative exist)
L2 Cache
CPU
L1 Cache
L2 CacheDir.
CPU
L1 Cache
Dir.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
14
Early study of chip-multiprocessors“The Case for a Single-Chip Multiprocessor”, K. Olukotun,
B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.
More recent study of chip-multiprocessors (throughput-oriented)“Maximizing CMP Throughput with Mediocre Cores”, J.
Davis, J. Laudon, and K. Olukotun, Intl. Conf. on Parallel Architecture and Compilation Techniques, September 2005.
First NUCA caches proposal (for uniprocessor)“An Adaptive, Non-uniform Cache Structure for Wire-
delay Dominated On-chip Caches”, C. Kim, D. Burger, and S. Keckler, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 2002.
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
15
NUCA cache study for CMP“Managing Wire Delay in Large Chip-Multiprocessor
Caches”, B. Beckmann and D. Wood, Intl. Symp. on Microarchitecture, December 2004.
Recent fair cache sharing studies“Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture”, S. Kim, D. Chandra, and Y. Solihin, Intl. Conf. on Parallel Architecture and Compilation Techniques, October 2004.
“CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms”, R. Iyer, Intl. Conf. on Supercomputing, June 2004.
Other recent studies on priorities and quality of service in CMP/SMT“Symbiotic Job-Scheduling with Priorities for
Simultaneous Multithreading Processors”, A. Snavely, D. Tullsen, and G. Voelker, Intl. Conf. on Measurement and Modeling of Computer Systems, June 2002.
CS4/MSc Parallel Architectures - 2009-2010
Lect. 14: Interconnection Networks
Communication networks (e.g., LANs and WANs)– Must follow industry standards– Must support many different types of packets– Many features, such as reliability, are handled by upper software layers– Currently based on buses (e.g., Ethernet LAN) and optic fiber (WAN)– Latency is high and bandwidth is low– Topologies are highly irregular (e.g., Internet)
Multiprocessor interconnects– Custom made and proprietary– Must only support a few (3 to 4) different types of packets– Most features handled in hardware– Many different topologies and technologies are commonly used– Latency is low and bandwidth is high– Topologies are very regular
1
CS4/MSc Parallel Architectures - 2009-2010
Interconnection Networks
2
General organization
– Network controller (NC): links the processor (host) to the network
– Switches (SW): links different parts of the network internally
– Note: SW’s may not be present at all in some topologies
CPU
Cache
Mem.NC
CPU
Cache
Mem.NC
CPU
Cache
Mem. NC
SW
SW
SW
CS4/MSc Parallel Architectures - 2009-2010
Interconnection Networks
3
Characterizing Interconnects– Topology: the “shape” or structure of the interconnect
(e.g., buses, meshes, hypercubes, butterflies, etc) Direct networks: each host+NC connects directly to other
hosts+NCs Indirect networks: hosts+NCs connect to a subset of the
switches, which are then the entry points to the network and are themselves connected to other internal switches
– Routing algorithm: the rules and mechanisms for routing messages
Dynamic: route from a given A to B may change at different times
Static: route from a given A to B is fixed– Switching strategy: how exchange of messages is set up
Circuit switching: route and connection from source to destination is established and fixed before communication (e.g., like telephone calls)
Packet switching: each part of the communication (packet) is handled separately
– Flow control mechanism: how traffic flow under conflict and/or congestion is handled
CS4/MSc Parallel Architectures - 2009-2010
Interconnection Networks
4
Terminology:– Link: the physical connection between two
hosts/switches– Channel: a logical connection between two
hosts/switches that are connected with a link (multiple channels may be multiplexed into a single link)
– Degree of a switch: the number of input/output channels
– Simplex channel: communication can only happen in one direction
Duplex channel: communication can happen in both directions
– Phit: the smallest physical unit of data that can be transferred in a unit of time over a link
– Flit: the smallest unit of data that can be exchanged between two hosts/switches (1 flit ≥ 1 phit)
– Hop: each step between two adjacent hosts/switches– Permutations: a combination of pairs of hosts that can
communicate simultaneously
CS4/MSc Parallel Architectures - 2009-2010
Interconnection Networks
5
Important properties:– Degree or radix: the smallest number of hosts/switches
that any given host/switch can connect directly to– Diameter: the longest distance between any two hosts (in
number of hops)– Bisection: a collection of links that if removed would
divide the network in two equal-size (disconnect) parts Bisection width: the minimum number of links across all
bisections Bisection bandwidth: the minimum bandwidth across the
bisections– Total bandwidth: the maximum communication
bandwidth that can be attained– Cost: usually given as a function of the total number of
links, switches, and network controllers– Scalability: how a given property scales with the increase
in the number of hosts (e.g., bandwidth, cost, diameter, etc)
Usually given in terms of O() (e.g., O(1) is constant and O(N) is linear)
– Fault tolerance: whether communication between any two nodes is still possible after failure of some links
CS4/MSc Parallel Architectures - 2009-2010
Topologies Buses:
– Degree: N-1 (i.e., fully connected)– Diameter: 1– Bisection width: 1– Total bandwidth: O(1)– Cost: O(N)– Permutations: single pair, broadcast (one-to-all), multicast (one-to-
many)
6
Bus
CPU
Main memory
CPU CPU CPU
CS4/MSc Parallel Architectures - 2009-2010
Topologies Crossbar:
– Degree: N-1 (i.e., fully connected)– Diameter: 2 (sometimes also said to be 1)– Bisection width: N– Total bandwidth: O(N)– Cost: O(N2)– Permutations: single-pair, any pair-wise permutation
7
CPU 1
CPU 2
CPU 3
CPU 4
CPU 1 CPU 2 CPU 3 CPU 4
CPU
Mem.
CS4/MSc Parallel Architectures - 2009-2010
Topologies Bidirectional Ring:
or, with same-size wires:
– Degree: 2– Diameter: N/2– Bisection width: 2– Total bandwidth: O(N) (e.g., all nodes communicate to neighbor)– Cost: O(N)– Permutations: single-pair, neighbor
8
CPU
Mem.
CS4/MSc Parallel Architectures - 2009-2010
Topologies 2-D Mesh:
– Degree: 2 (maximum is 4 at internal nodes)– Diameter: 2*(k-1) (k is the number of nodes per row/column, i.e., N1/2)– Bisection width: k– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor
9
CPU
Mem.
CS4/MSc Parallel Architectures - 2009-2010
Topologies 2-D Torus:
– Degree: 4– Diameter: k– Bisection width: 2*k– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor
10
CPU
Mem.
CS4/MSc Parallel Architectures - 2009-2010
Topologies 4-D Cube (hypercube):
– Degree: 4– Diameter: 4– Bisection width: 8– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor
11
CPU
Mem.
CS4/MSc Parallel Architectures - 2009-2010
Topologies Binary Tree:
or, in H-tree configuration:– Degree: 1 for hosts and 3 for switches
– Diameter: 2*log2N
– Bisection width: 1– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor– Note: “fat” tree → width of links increases as we go toward root
12
CPU
Mem.
Intermediate node:switch
root: switch
leaf node:
CS4/MSc Parallel Architectures - 2009-2010
Topologies Switched network:
– Degree: 1 for hosts and 2 for switches
– Diameter: log2N
– Bisection width: N/2– Total bandwidth: O(N)
– Cost: O(Nlog2N)
– Permutations: depends on the actual topology
13
CPU
Mem.
CPU 1
CPU 2
CPU 3
CPU 4
CPU 1
CPU 2
CPU 3
CPU 4
CS4/MSc Parallel Architectures - 2009-2010
Topologies Switched network: e.g., Omega network
14
CS4/MSc Parallel Architectures - 2009-2010
Routing
15
Example: mesh and d-dimension cubes– Hosts are numbered as in a matrix– To avoid deadlock use dimension-ordered routing
(a.k.a. X-Y routing in 2D) Follow all the steps necessary in one dimension before
changing dimensions Always choose dimensions in the same order
– E.g., from (1,1) to (3,3)
and from (3,3) to (1,1)
(0,0) (0,1) (0,2) (0,3)
(1,0) (1,1) (1,2) (1,3)
(2,0) (2,1) (2,2) (2,3)
(3,0) (3,1) (3,2) (3,3)
CS4/MSc Parallel Architectures - 2009-2010
Routing
16
Example: Omega network– Hosts are numbered linearly in binary (log2N bits are
required)– The routing function is given by F=S XOR D, where S
and D are the binary numbers of the source and destination hosts, respectively
– At each level of the network, use the corresponding bit of the routing function to go:
Straight, if bit is 1 Across, if bit is 0
– Assign numbers to hosts appropriately (easy for Omega, but more complex for other networks)– E.g., from 010 to 011 F=001 → straight, straight, across
and from 100 to 111 F=011 → straight, across, across
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
CS4/MSc Parallel Architectures - 2009-2010
Packet Switching
17
Store-and-forward– Enough space must be pre-allocated in destination
router’s buffers for the complete packet– Router must wait until the complete packet is received
before it can initiate forwarding it Cut-through
– Enough space must be pre-allocated in destination router’s buffer for the complete packet
– Router may initiate forwarding parts of the packet as soon as they arrive
Wormhole– Packets are divided in small pieces called flow units
(flits)– Only header flit contains address of destination and is
responsible for setting up the route (trailing flits simply follow the header)
– No need to allocate enough buffer space for entire packet (packet spreads through multiple routers and links like a “worm”)
– May lead to deadlock
CS4/MSc Parallel Architectures - 2009-2010
References and Further Reading
18
Recent books on multiprocessor interconnects“Principles and Practice of Interconnection Networks”, W.
Dally and B. Towles, Morgan Kaufmann, 2003.“Interconnection Networks”, J. Duato, Morgan Kaufmann,
2002.