©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
-
Upload
joan-butler -
Category
Documents
-
view
223 -
download
1
Transcript of ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
©RG:E0243:L2- Parallel Architecture
1
E0-243: Computer Architecture
L2 – Parallel Architecture
©RG:E0243:L2- Parallel Architecture
2
Overview
Parallel Architecture Cache coherence problem Memory consistency
©RG:E0243:L2- Parallel Architecture
3
Trends
Ever increasing transistor density multiple processors (multiple core) on a single chip (CMP)
Beyond Instruction level parallelism thread-level parallelism
Speculative execution Speculative Multithreaded execution
©RG:E0243:L2- Parallel Architecture
4
Recall:
Amdahl’s Law:For a program with x part sequential
execution, speedup is limited by 1/x . Speedup = (Exec. Time in Uniproc.)/
Exec. Time in N Procs.)
Efficiency = Speedup of N Procs. /N
©RG:E0243:L2- Parallel Architecture
5
Space of Parallel Computing
Programming Models
What programmer uses in coding applns.
Specifies synch. And communication.
Programming Models: Shared address
space, e.g., OpenMP
Message passing, e.g., MPI
Parallel Architecture Shared Memory
Centralized shared memory (UMA)
Distributed Shared Memory (NUMA)
Distributed Memory A.k.a. Message
passing E.g., Clusters
©RG:E0243:L2- Parallel Architecture
6
Shared Memory Architectures Shared, global, address space,
hence called Shared Address Space
Any processor can directly reference any memory locationCommunication occurs implicitly as
result of loads and stores Centralized: latencies to memory
uniform, but uniformly large Distributed: Non-Uniform Memory
Access (NUMA)
©RG:E0243:L2- Parallel Architecture
7
M
Network
° ° °
Centralized Shared Memory
M M
$P
$P
$P
° ° °
Network
Distributed Shared Memory
M
$
P
M
$
P
° ° °
Shared Memory Architecture
©RG:E0243:L2- Parallel Architecture
8
Distributed Memory Architecture
Network
M $
P ° ° °
M $
P
M $
P
Message Passing ArchitectureMemory is private to each node Processes communicate by
messages
Proc.Node
Proc.Node
Proc.Node
©RG:E0243:L2- Parallel Architecture
9
Caches and Cache Coherence
Caches play key role in all casesReduce average data access timeReduce bandwidth demands placed on shared
interconnect Private processor caches create a
problemCopies of a variable can be present in multiple
caches A write by processor P may not be visible to
P’ ! P’ will keep accessing stale value from its
cache! Cache coherence problem
©RG:E0243:L2- Parallel Architecture
10
Cache Coherence Problem: Example
Processors see different values for u after event 3 With write back caches, value written back to
memory depends on which cache flushes or writes back value.
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
Read Read WriteRead Read
©RG:E0243:L2- Parallel Architecture
11
Cache Coherence Problem
Multiple processors with private caches Potential data consistency problem: the cache
coherence problem
Processes shouldn’t read `stale’ data Intuitively, Reading an address should return
the last value written to that address
Solutions Hardware: cache coherence mechanisms
Invalidation-based vs. Update-based Snoopy vs. directory
Software: compiler assisted cache coherence
©RG:E0243:L2- Parallel Architecture
12
Example: Snoopy Bus Protocols Assumption: shared bus
interconnect where all cache controllers monitor all bus activity Called snooping
There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in cachesCorrective action could involve
updating or invalidating a cache block
©RG:E0243:L2- Parallel Architecture
13
Snoopy Invalidate Protocol
I/O devices
Memory
P1
$ $ $
P2 P3
4
u = ?
u :51
u :5
2
u :5
3
u = 7
©RG:E0243:L2- Parallel Architecture
14
Invalidate vs Update
Basic question of program behavior: Is a block written by one processor later
read by others before it is overwritten? Invalidate
readers will take a miss multiple writes without additional traffic clears out copies that are not used again
Update avoids misses on later references multiple useless updates
©RG:E0243:L2- Parallel Architecture
15
MSI Invalidation Protocol
Cache Block States I: Invalid S: Shared (one or more cache copies) M: Modified or Dirty (only copy)
Encoded in 2 bits and updated by protocol Processor Events:
PrRd (read) PrWr (write)
Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for copy with intent to modify Flush: write back (updates main memory)
©RG:E0243:L2- Parallel Architecture
16
MSI: State Transition Diagram
M
I
PrR
d/B
usR
dP
rWr/
Bu
sRd
X
PrRd/- PrWr/-
Bu
sRd
X/F
lush
Bu
sRd
/Flu
shB
usR
dX
/—
PrRd/—
BusRd/—
PrW
r/B
usR
dX
S
©RG:E0243:L2- Parallel Architecture
17
MESI (4-state) Invalidation Protocol
Problem with MSI protocol Reading and modifying data is 2 bus xactions,
even if no one is sharing BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without xaction, Memory is up to date, so cache not necessarily
owner States
invalid exclusive or exclusive-clean (only this cache has copy,
but not modified) shared (two or more caches may have copies) modified (dirty)
©RG:E0243:L2- Parallel Architecture
18
MESI - State Transition Diagram
BusR
d/Flush
BusR
dX/F
lush
PrW
r/B
usR
dX
PrWr
PrRd/—
PrRd/—
BusR
d/Flush
¢
E
M
I
S
PrWr/--PrRd
PrR
d/
Bus
Rd(
S)
BusR
dX/F
lush
¢
BusR
dX/F
lush
BusRd/
Flush
PrW
r/B
usR
dXP
rRd/
Bus
Rd
(S )
©RG:E0243:L2- Parallel Architecture
19
Scalability Issues of Snoopy Protocol
Snoopy cache ideally suited for bus-based IN.
Shared bus IN saturates performance for large no. of procs. (beyond 8 procs.)
For non-bus-based IN, coherence messages can be broadcast – expensive
Only a few procs. may have a copy of the shared data.
May be more efficient to maintain a directory of caches that have a copy of the cache block.
©RG:E0243:L2- Parallel Architecture
20
Directory Based Coherence
Memory (or Cache) maintains a list (directory) of procs. that have the copy of a block
On write, memory controller sends Invalidate (or Update) signal only to procs. that have a copy
Memory also knows the current owner (in case of Dirty blocks) memory controller requests owner for updated copy
©RG:E0243:L2- Parallel Architecture
21
Generic Solution: Directories
P1
Cache
Memory
Scalable Interconnection Network
Comm.Assist
P1
Cache
CommAssist
Directory MemoryDirectory
• •• Directory
presence bits dirty bit
©RG:E0243:L2- Parallel Architecture
22
Memory Consistency Model
Memory consistency model
Order in which memory operations will appear to execute Þ What value can a read return?
Contract between appln. software and system.
Þ Affects ease-of-programming and performance
©RG:E0243:L2- Parallel Architecture
23
Understanding Program Order: Example
Initially A = B = 0;Process P1 Process P2 Process P3
A = 1; while (A==0); while (B==0);
B = 1; Print A;
What value of A will be printed by process P3?
Role of Program order in ensuring P3 reads the value of A as 1.
©RG:E0243:L2- Parallel Architecture
24
Example 2
Software Implementation of Mutex:Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical section critical section Can both P1 and P2 enter the critical
section? i.e., evaluate the “if” condition as true?
©RG:E0243:L2- Parallel Architecture
25
Sequential Consistency: Definition
A system is sequentially consistent if Operations within a processor follow
program order Operations of all processors were
executed in some (interleaved) sequential order
All processors see the same sequential order
©RG:E0243:L2- Parallel Architecture
26
Implicit Memory Model
Sequential consistency (SC) [Lamport] Result of an execution appears as if
• Operations from different processors executed in some sequential (interleaved) order
• Memory operations of each process in program order
MEMORY
P1 P3P2 Pn
©RG:E0243:L2- Parallel Architecture
27
Sequential Consistency: Definition
A system is sequentially consistent if Operations within a
processor follow program order
Operations of all processors were executed in some (interleaved) sequential order
All processors see the same sequential order
Initially A = B = 0;Process P1 Process P2 Process
P3
A = 1; while (A==0); while (B==0);
B = 1; Print A;
©RG:E0243:L2- Parallel Architecture
28
Under SC can P3 print A as 0?
Initially A = B = 0;
Process P1 Process P2 Process P3
(w1)A = 1; (r2) while (A==0); (r3) while (B==0);
(w2) B = 1; (r3’) Print A;w1
r3’w2
r2 r3
©RG:E0243:L2- Parallel Architecture
29
Sequential Consistency
SC ensures all Memory orders: Write Read Write Write Read Read Read Write
SC treats all Memory operations same way!
©RG:E0243:L2- Parallel Architecture
31
Processor Consistency: Definition
A system is Processor consistent if Writes issued by a
processor must be in program order Read read, read write,
and write write order But no write read order
Operations of all processors were executed in some (interleaved) sequential order
All processors need not see the same sequential order of writes from different processors
Initially A = B = 0;Process P1 Process P2 Process
P3
A = 1; while (A==0); while (B==0);
B = 1; Print A;
Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical section
©RG:E0243:L2- Parallel Architecture
32
Example 2
Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical
section
©RG:E0243:L2- Parallel Architecture
33
Weak Consistency
Distinguishes between ordinary memory operations and synchronization operations (e.g., lock acquire/release)
A system is weak consistent if Before a load/store is allowed to perform, all
previous synchronization accesses must be performed
Before a synchronization operation is performed, all previous load/store must be performed
Synchronization accesses are sequentially consistent.
©RG:E0243:L2- Parallel Architecture
34
Weak Consistency
Weak ordering: Divide memory operations into data
operations and synchronization operations Synchronization operations act like a fence:
All data operations before synch in program order must complete before synch is executed
All data operations after synch in program order must wait for synch to complete
Synchs are performed in program order
©RG:E0243:L2- Parallel Architecture
35
Weak Consistency
Weak ordering: Implementation of fence: processor has
counter that is incremented when data op is issued, and decremented when data op is completed
Example: PowerPC has SYNC instruction
©RG:E0243:L2- Parallel Architecture
36
An Example:
Load
Load
Store
Load
Store
Store
Load
Load
Store
Load
Store
Store
Sequential Consistency Processor Consistency
©RG:E0243:L2- Parallel Architecture
37
Example: Weak Consistency :
Sync(Acq)
Load/StoreLoad/Store
Sync(Rel)
Sync(Acq)
Sync(Rel)
Load/StoreLoad/Store
Load/StoreLoad/Store
No ordering among the loads/stores here!
No ordering among the loads/stores here!
©RG:E0243:L2- Parallel Architecture
38
Another model: Release Consistency Synchronization accesses are divided into
Acquires: operations like lock Release: operations like unlock
Semantics of acquire: Acquire must complete before all following
memory accesses
Semantics of release: all memory operations before release must
complete but accesses after release in program order do not
have to wait for release operations which follow release and which need to
wait must be protected by an acquire
©RG:E0243:L2- Parallel Architecture
39
Release Consistency
Further distinguishes between lock acquire and lock release synch. Operation.
A system is release consistent if Before a load/store is allowed to perform, all
previous acquire accesses must be performed Before a release synchronization operation is
performed, all previous load/store must be performed
Synchronization accesses are processor consistent.
©RG:E0243:L2- Parallel Architecture
40
Example: Release Consistency
Sync(Acq)
Load/StoreLoad/Store
Sync(Rel)
Sync(Acq)
Sync(Rel)
Load/StoreLoad/Store
Load/StoreLoad/Store
Weak C
on
sis
ten
cy
Acquire
Load/StoreLoad/Store
Release
Acquire
Release
Load/StoreLoad/Store
Load/StoreLoad/Store
• Acquire treated as READ/LOAD
• Release treated as WRITE/STORE
©RG:E0243:L2- Parallel Architecture
41
Ordering in Consistency Models
Model RRRW
WR WW SAR
SAW
R SA
W SA
SRR
SRW
R SR
W SR
SC
PC R SA SRW
WC
RC
WC : SA SA, SA SR, SR SR , SR SA
RC : SA SA, SA SR, SR SR
Reading Material
S. V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial” WRL Research Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbon, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors”’ ISCA 1991.