1 Shared Memory Multiprocessors Logical design and software interactions.
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of 1 Shared Memory Multiprocessors Logical design and software interactions.
1
Shared Memory Multiprocessors
Logical design and software interactions
2
Shared Memory Multiprocessors
Symmetric Multiprocessors (SMPs)• Symmetric access to all of main memory from any processor
Dominate the server market• Building blocks for larger systems; arriving to desktop
Attractive as throughput servers and for parallel programs• Fine-grain resource sharing• Uniform access via loads/stores• Automatic data movement and coherent replication in caches• Useful for operating system too
Normal uniprocessor mechanisms to access data (reads and writes)• Key is extension of memory hierarchy to support multiple processors
3
Supporting Programming Models
• Address translation and protection in hardware (hardware SAS)• Message passing using shared memory buffers
– can be very high performance since no OS involvement necessary• Focus here on supporting coherent shared address space
Multiprogramming
Shared address space
Message passing Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardwar e
Physical communication medium
Hardware/software boundary
4
Natural Extensions of Memory System
I/O devicesMem
P1
$ $
Pn
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
P1
$
Interconnection network
$
Pn
Mem Mem
(b) Bus-based shared memory
(c) Dancehall
(a) Shared cache
First-level $
Bus
P1
$
Interconnection network
$
Pn
Mem Mem
(d) Distributed-memory
5
Caches and Cache Coherence
Caches play key role in all cases• Reduce average data access time• Reduce bandwidth demands placed on shared interconnect
But private processor caches create a problem• Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others
– They’ll keep accessing stale value in their caches
• Cache coherence problem• Need to take actions to ensure visibility
6
Focus: Bus-based, Centralized Memory
Shared cache• Low-latency sharing and prefetching across processors• Sharing of working sets• No coherence problem (and hence no false sharing either)• But high bandwidth needs and negative interference (e.g. conflicts)• Hit and miss latency increased due to intervening switch and cache size• Mid 80s: to connect couple of processors on a board (Encore, Sequent)• Today: for multiprocessor on a chip (for small-scale systems or nodes)
Dancehall • No longer popular: everything is uniformly far away
Distributed memory• Most popular way to build scalable systems, discussed later
7
Outline
Coherence and Consistency
Snooping Cache Coherence Protocols
Quantitative Evaluation of Cache Coherence Protocols
Synchronization
Snoop-based Multiprocessor Design
8
A Coherent Memory System: Intuition
Reading a location should return latest value written (by any process)
Easy in uniprocessors• Except for I/O: coherence between I/O devices and processors• But infrequent so software solutions work
– uncacheable memory, uncacheable operations, flush pages, pass I/O data through caches
Would like same to hold when processes run on different processors • E.g. as if the processes were interleaved on a uniprocessor
But coherence problem much more critical in multiprocessors• Pervasive• Performance-critical• Must be treated as a basic hardware design issue
9
Example Cache Coherence Problem
• Processors see different values for u after event 3• With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when– Processes accessing main memory may see very stale value
• Unacceptable to programs, and frequent!
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5
u = ?u = ?
u:5
u:5
u:5
u = 7
10
Problems with the Intuition
Recall: Value returned by read should be last value written
But “last” is not well-defined
Even in seq. case, last defined in terms of program order, not time• Order of operations in the machine language presented to processor• “Subsequent” defined in analogous way, and well defined
In parallel case, program order defined within a process, but need to make sense of orders across processes
Must define a meaningful semantics
11
Some Basic Definitions
Extend from definitions in uniprocessors to those in multiprocessors
Memory operation: a single read (load), write (store) or read-modify-write access to a memory location
• Assumed to execute atomically w.r.t each other
Issue: a memory operation issues when it leaves processor’s internal environment and is presented to memory system (cache, buffer …)
Perform: operation appears to have taken place, as far as processor can tell from other memory operations it issues
• A write performs w.r.t. the processor when a subsequent read by the processor returns the value of that write or a later write
• A read perform w.r.t the processor when subsequent writes issued by the processor cannot affect the value returned by the read
In multiprocessors, stay same but replace “the” by “a” processor• Also, complete: perform with respect to all processors• Still need to make sense of order in operations from different processes
12
Sharpening the Intuition
Imagine a single shared memory and no caches• Every read and write to a location accesses the same physical location• Operation completes when it does so
Memory imposes a serial or total order on operations to the location• Operations to the location from a given processor are in program order• The order of operations to the location from different processors is some
interleaving that preserves the individual program orders
“Last” now means most recent in a hypothetical serial order that maintains these properties
For the serial order to be consistent, all processors must see writes to the location in the same order (if they bother to look, i.e. to read)
Note that the total order is never really constructed in real systems• Don’t even want memory, or any hardware, to see all operations
But program should behave as if some serial order is enforced• Order in which things appear to happen, not actually happen
13
Formal Definition of Coherence
Results of a program: values returned by its read operations
A memory system is coherent if the results of any execution of a program are such that each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which:
1. operations issued by any particular process occur in the order issued by that process, and
2. the value returned by a read is the value written by the last write to that location in the serial order
Two necessary features:• Write propagation: value written must become visible to others • Write serialization: writes to location seen in same order by all
– if I see w1 after w2, you should not see w2 before w1– no need for analogous read serialization since reads not visible to others
14
Cache Coherence Using a Bus
Built on top of two fundamentals of uniprocessor systems• Bus transactions• State transition diagram in cache
Uniprocessor bus transaction:• Three phases: arbitration, command/address, data transfer• All devices observe addresses, one is responsible
Uniprocessor cache states:• Effectively, every block is a finite state machine• Write-through, write no-allocate has two states: valid, invalid• Writeback caches have one more state: modified (“dirty”)
Multiprocessors extend both these somewhat to implement coherence
15
Snooping-based Coherence
Basic Idea
Transactions on bus are visible to all processors
Processors or their representatives can snoop (monitor) bus and take action on relevant events (e.g. change state)
Implementing a Protocol
Cache controller now receives inputs from both sides: • Requests from processor, bus requests/responses from snooper
In either case, takes zero or more actions• Updates state, responds with data, generates new bus transactions
Protocol is distributed algorithm: cooperating state machines• Set of states, state transition diagram, actions
Granularity of coherence is typically cache block• Like that of allocation in cache and transfer to/from cache
16
Memory Consistency
• Intuition not guaranteed by coherence• Sometimes expect memory to respect order between accesses to
different locations issued by a given process– to preserve orders among accesses to same location by different processes
• Coherence doesn’t help: pertains only to single location
Writes to a location become visible to all in the same orderBut when does a write become visible
•How to establish orders between a write and a read by different procs?
–Typically use event synchronization, by using more than one location
P1 P2
/*Assume initial value of A and ag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;
17
Another Example of Orders
• What’s the intuition?• Whatever it is, we need an ordering model for clear semantics
– across different locations as well– so programmers can reason about what results are possible
• This is the memory consistency model
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
18
Memory Consistency Model
Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another
• What orders are preserved?• Given a load, constrains the possible values returned by it
Without it, can’t tell much about an SAS program’s execution
Implications for both programmer and system designer• Programmer uses to reason about correctness and possible results• System designer can use to constrain how much accesses can be
reordered by compiler or hardware
Contract between programmer and system
19
Sequential Consistency
• (as if there were no caches, and a single memory)• Total order achieved by interleaving accesses from different processes• Maintains program order, and memory operations, from all processes,
appear to [issue, execute, complete] atomically w.r.t. others• Programmer’s intuition is maintained
“A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]
Processors issuing memory references as per program order
P1 P2 Pn
Memory
The “switch” is randomly set after each memoryreference
20
What Really is Program Order?
Intuitively, order in which operations appear in source code• Straightforward translation of source code to assembly• At most one memory operation per instruction
But not the same as order presented to hardware by compiler
So which is program order?
Depends on which layer, and who’s doing the reasoning
We assume order as seen by programmer
21
SC Example
– possible outcomes for (A,B): (0,0), (1,0), (1,2); impossible under SC: (0,2)– we know 1a->1b and 2a->2b by program order– A = 0 implies 2b->1a, which implies 2a->1b– B = 2 implies 1b->2a, which leads to a contradiction
– BUT, actual execution 1b->1a->2b->2a is SC, despite not program order• appears just like 1a->1b->2a->2b as visible from results
– actual execution 1b->2a->2b-> is not SC
What matters is order in which appears to execute, not executes
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
22
Implementing SC
Two kinds of requirements• Program order
– memory operations issued by a process must appear to become visible (to others and itself) in program order
• Atomicity– in the overall total order, one memory operation should appear to complete
with respect to all processes before the next one is issued– needed to guarantee that total order is consistent across processes – tricky part is making writes atomic
23
Write AtomicityWrite Atomicity: Position in total order at which a write appears to
perform should be the same for all processes• Nothing a process does after it has seen the new value produced by a
write W should be visible to other processes until they too have seen W• In effect, extends write serialization to writes from multiple processes
•Transitivity implies A should print as 1 under SC•Problem if P2 leaves loop, writes B, and P3 sees new B but old A (from its cache, say)
P1 P2 P3
A=1; while (A==0);
B=1; while (B==0);
print A;
24
More Formally
Each process’s program order imposes partial order on set of all operations
Interleaving of these partial orders defines a total order on all operations
Many total orders may be SC (SC does not define particular interleaving)
SC Execution: An execution of a program is SC if the results it produces are the same as those produced by some possible total order (interleaving)
SC System: A system is SC if any possible execution on that system is an SC execution
25
Sufficient Conditions for SC
• Every process issues memory operations in program order• After a write operation is issued, the issuing process waits for the write to
complete before issuing its next operation• After a read operation is issued, the issuing process waits for the read to
complete, and for the write whose value is being returned by the read to complete, before issuing its next operation (provides write atomicity)
Sufficient, not necessary, conditions
Clearly, compilers should not reorder for SC, but they do!• Loop transformations, register allocation (eliminates!)
Even if issued in order, hardware may violate for better performance• Write buffers, out of order execution
Reason: uniprocessors care only about dependences to same location• Makes the sufficient conditions very restrictive for performance
26
Our Treatment of Ordering
Assume for now that compiler does not reorder
Hardware needs mechanisms to detect:• Detect write completion (read completion is easy)• Ensure write atomicity
For all protocols and implementations, we will see• How they satisfy coherence, particularly write serialization• How they satisfy sufficient conditions for SC (write completion and
write atomicity)• How they can ensure SC but not through sufficient conditions
Will see that centralized bus interconnect makes it easier
27
MESI (4-state) Invalidation Protocol
Problem with MSI protocol• Reading and modifying data is 2 bus xactions, even if noone sharing
– e.g. even in sequential program– BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without xaction, but not modified• Main memory is up to date, so cache not necessarily owner• States
– invalid– exclusive or exclusive-clean (only this cache has copy, but not modified)– shared (two or more caches may have copies)– modified (dirty)
• I -> E on PrRd if noone else has copy– needs “shared” signal on bus: wired-or line asserted in response to BusRd
28
MESI State Transition Diagram
• BusRd(S) means shared line asserted on BusRd transaction• Flush’: if cache-to-cache sharing (see next), only one cache flushes data• MOESI protocol: Owned state: exclusive but memory not valid
PrWr/—
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd (S)
29
Lower-level Protocol Choices
Who supplies data on miss when not in M state: memory or cache
Original, lllinois MESI: cache, since assumed faster than memory• Cache-to-cache sharing
Not true in modern systems• Intervening in another cache more expensive than getting from memory
Cache-to-cache sharing also adds complexity• How does memory know it should supply data (must wait for caches)• Selection algorithm if multiple caches have valid data
But valuable for cache-coherent machines with distributed memory• May be cheaper to obtain from nearby cache than distant memory• Especially when constructed out of SMP nodes (Stanford DASH)
30
Synchronization
“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Types of Synchronization• Mutual Exclusion• Event synchronization
– point-to-point– group– global (barriers)
31
History and Perspectives
Much debate over hardware primitives over the years
Conclusions depend on technology and machine style• speed vs flexibility
Most modern methods use a form of atomic read-modify-write• IBM 370: included atomic compare&swap for multiprogramming• x86: any instruction can be prefixed with a lock modifier• High-level language advocates want hardware locks/barriers
– but it’s goes against the “RISC” flow,and has other problems
• SPARC: atomic register-memory ops (swap, compare&swap)• MIPS, IBM Power: no atomic operations but pair of instructions
– load-locked, store-conditional– later used by PowerPC and DEC Alpha too
Rich set of tradeoffs
32
Components of a Synchronization Event
Acquire method• Acquire right to the synch (enter critical section, go past event
Waiting algorithm• Wait for synch to become available when it isn’t
Release method• Enable other processors to acquire right to the synch
Waiting algorithm is independent of type of synchronization
33
Waiting Algorithms
Blocking• Waiting processes are descheduled• High overhead• Allows processor to do other things
Busy-waiting• Waiting processes repeatedly test a location until it changes value• Releasing process sets the location• Lower overhead, but consumes processor resources• Can cause network traffic
Busy-waiting better when• Scheduling overhead is larger than expected wait time• Processor resources are not needed for other tasks• Scheduler-based blocking is inappropriate (e.g. in OS kernel)
Hybrid methods: busy-wait a while, then block
34
Role of System and User
User wants to use high-level synchronization operations• Locks, barriers...• Doesn’t care about implementation
System designer: how much hardware support in implementation?• Speed versus cost and flexibility• Waiting algorithm difficult in hardware, so provide support for others
Popular trend:• System provides simple hardware primitives (atomic operations)• Software libraries implement lock, barrier algorithms using these• But some propose and implement full-hardware synchronization
35
Challenges
Same synchronization may have different needs at different times• Lock accessed with low or high contention• Different performance requirements: low latency or high throughput• Different algorithms best for each case, and need different primitives
Multiprogramming can change synchronization behavior and needs• Process scheduling and other resource interactions• May need more sophisticated algorithms, not so good in dedicated case
Rich area of software-hardware interactions• Which primitives available affects what algorithms can be used• Which algorithms are effective affects what primitives to provide
Need to evaluate using workloads
36
Mutual Exclusion: Hardware Locks
Separate lock lines on the bus: holder of a lock asserts the line• Priority mechanism for multiple requestors
Lock registers (Cray XMP)• Set of registers shared among processors
Inflexible, so not popular for general purpose use– few locks can be in use at a time (one per lock line)– hardwired waiting algorithm
Primarily used to provide atomicity for higher-level software locks
37
First Attempt at Simple Software Lock
lock: ld register, location /* copy location to register */
cmp location, #0 /* compare with 0 */
bnz lock /* if not 0, try again */
st location, #1 / * store 1 to mark it locked */
ret /* return control to caller */
and
unlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
Problem: lock needs atomicity in its own implementation• Read (test) and write (set) of lock variable by a process not atomic
Solution: atomic read-modify-write or exchange instructions• atomically test value of location and set it to another value, return success
or failure somehow
38
Atomic Exchange Instruction
Specifies a location and register. In atomic operation:• Value in location read into a register• Another value (function of value read or not) stored into location
Many variants• Varying degrees of flexibility in second part
Simple example: test&set• Value in location read into a specified register• Constant 1 stored into location• Successful if value loaded into register is 0• Other constants could be used instead of 1 and 0
Can be used to build locks
39
Simple Test&Set Lock
lock: t&s register, location
bnz lock /* if not 0, try again */
ret /* return control to caller */
unlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
Other read-modify-write primitives can be used too• Swap• Fetch&op• Compare&swap
– Three operands: location, register to compare with, register to swap with– Not commonly supported by RISC instruction sets
Can be cacheable or uncacheable (we assume cacheable)
40
T&S Lock Microbenchmark Performance
• Performance degrades because unsuccessful test&sets generate traffic
On SGI Challenge. Code: lock; delay(c); unlock;Same total no. of lock calls as p increases; measure time per transfer
Number of processors
Tim
e (
s)
11 13 150
2
4
6
8
10
12
14
16
18
20 Test&set, c = 0
Test&set, exponential backoff, c = 3.64
Test&set, exponential backoff, c = 0
Ideal
9753
41
Enhancements to Simple Lock Algorithm
Reduce frequency of issuing test&sets while waiting• Test&set lock with backoff• Don’t back off too much or will be backed off when lock becomes free• Exponential backoff works quite well empirically: ith time = k*ci
Busy-wait with read operations rather than test&set• Test-and-test&set lock• Keep testing with ordinary load
– cached lock variable will be invalidated when release occurs
• When value changes (to 0), try to obtain lock with test&set– only one attemptor will succeed; others will fail and start testing again
42
Performance Criteria (T&S Lock)
Uncontended Latency• Very low if repeatedly accessed by same processor; indept. of p
Traffic• Lots if many processors compete; poor scaling with p• Each t&s generates invalidations, and all rush out again to t&s
Storage• Very small (single variable); independent of p
Fairness• Poor, can cause starvation
Test&set with backoff similar, but less trafficTest-and-test&set: slightly higher latency, much less trafficBut still all rush out to read miss and test&set on release
• Traffic for p processors to access once each: O(p2)
Luckily, better hardware primitives as well as algorithms exist
43
Improved Hardware Primitives: LL-SC
Goals: • Test with reads• Failed read-modify-write attempts don’t generate invalidations• Nice if single primitive can implement range of r-m-w operations
Load-Locked (or -linked), Store-Conditional
LL reads variable into register
Follow with arbitrary instructions to manipulate its value
SC tries to store back to location if and only if no one else has written to the variable since this processor’s LL
• If SC succeeds, means all three steps happened atomically• If fails, doesn’t write or generate invalidations (need to retry LL)• Success indicated by condition codes; implementation later
44
Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */
sc location, reg2 /* SC reg2 into location*/beqz reg2, lock /* if failed, start again */ret
unlock: st location, #0 /* write 0 to location */ret
Can do more fancy atomic ops by changing what’s between LL & SC• But keep it small so SC likely to succeed• Don’t include instructions that would need to be undone (e.g. stores)
SC can fail (without putting transaction on bus) if:• Detects intervening write even before trying to get bus• Tries to get bus but another processor’s SC gets bus first
LL, SC are not lock, unlock respectively• Only guarantee no conflicting write to lock variable between them• But can use directly to implement simple operations on shared variables
45
More Efficient SW Locking AlgorithmsProblem with Simple LL-SC lock
• No invals on failure, but read misses by all waiters after both release and successful SC by winner
• No test-and-test&set analog, but can use backoff to reduce burstiness• Doesn’t reduce traffic to minimum, and not a fair lock
Better SW algorithms for bus (for r-m-w instructions or LL-SC)• Only one process to try to get lock upon release
– valuable when using test&set instructions; LL-SC does it already
• Only one process to have read miss upon release– valuable with LL-SC too
• Ticket lock achieves first• Array-based queueing lock achieves both• Both are fair (FIFO) locks as well
46
Ticket Lock
Only one r-m-w (from only one processor) per acquire
Works like waiting line at deli or bank• Two counters per lock (next_ticket, now_serving)• Acquire: fetch&inc next_ticket; wait for now_serving to equal it
– atomic op when arrive at lock, not when it’s free (so less contention)
• Release: increment now-serving• FIFO order, low latency for low-contention if fetch&inc cacheable• Still O(p) read misses at release, since all spin on same variable
– like simple LL-SC lock, but no inval when SC succeeds, and fair
• Can be difficult to find a good amount to delay on backoff– exponential backoff not a good idea due to FIFO order– backoff proportional to now-serving - next-ticket may work well
Wouldn’t it be nice to poll different locations ...
47
Array-based Queuing Locks
Waiting processes poll on different locations in an array of size p• Acquire
– fetch&inc to obtain address on which to spin (next array element)– ensure that these addresses are in different cache lines or memories
• Release– set next location in array, thus waking up process spinning on it
• O(1) traffic per acquire with coherent caches• FIFO ordering, as in ticket lock• But, O(p) space per lock• Good performance for bus-based machines• Not so great for non-cache-coherent machines with distributed memory
– array location I spin on not necessarily in my local memory (solution later)
48
Lock Performance on SGI Challenge
• Simple LL-SC lock does best at small p due to unfairness– Not so with delay between unlock and next lock– Need to be careful with backoff
• Ticket lock with proportional backoff scales well, as does array lock• Methodologically challenging, and need to look at real workloads
Loop: lock; delay(c); unlock; delay(d); Array-based LL-SC LL-SC, exponential Ticket Ticket, proportional
0
1
1
3 5 7 9
11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
(a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 s, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s)
Tim
e (
s)
Tim
e (
s)
Tim
e (
s)
Number of processors Number of processors Number of processors
49
Point to Point Event Synchronization
Software methods:• Interrupts• Busy-waiting: use ordinary variables as flags• Blocking: use semaphores
Full hardware support: full-empty bit with each word in memory• Set when word is “full” with newly produced data (i.e. when written)• Unset when word is “empty” due to being consumed (i.e. when read)• Natural for word-level producer-consumer synchronization
– producer: write if empty, set to full; consumer: read if full; set to empty
• Hardware preserves atomicity of bit manipulation with read or write• Problem: flexiblity
– multiple consumers, or multiple writes before consumer reads?– needs language support to specify when to use– composite data structures?
50
Barriers
Software algorithms implemented using locks, flags, counters
Hardware barriers• Wired-AND line separate from address/data bus• Set input high when arrive, wait for output to be high to leave• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors
– even harder with multiple processes per processor
• Difficult to dynamically change number and identity of participants– e.g. latter due to process migration
• Not common today on bus-based machines
Let’s look at software algorithms with simple hardware primitives
51
A Simple Centralized Barrier
Shared counter maintains number of processes that have arrived• increment when arrive (lock), check until reaches numprocs
struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)
bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */
}else while (bar_name.flag == 0) {}; /* busy wait for release */
}• Problem?
52
A Working Centralized Barrier
Consecutively entering the same barrier doesn’t work• Must prevent process from entering until all have left previous instance• Could use another counter, but increases latency and contention
Sense reversal: wait for flag to take different value consecutive times• Toggle this value only when all processes reach
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
53
Centralized Barrier Performance
Latency• Want short critical path in barrier• Centralized has critical path length at least proportional to p
Traffic• Barriers likely to be highly contended, so want traffic to scale well• About 3p bus transactions in centralized
Storage Cost• Very low: centralized counter and flag
Fairness• Same processor should not always be last to exit barrier• No such bias in centralized
Key problems for centralized barrier are latency and traffic• Especially with distributed memory, traffic goes to same node
54
Improved Barrier Algorithms for a Bus
• Separate arrival and exit trees, and use sense reversal• Valuable in distributed network: communicate along different paths• On bus, all traffic goes on same bus, and no less total traffic• Higher latency (log p steps of work, and O(p) serialized bus xactions)• Advantage on bus is use of ordinary reads/writes instead of locks
Software combining tree•Only k processors access the same location, where k is degree of tree
Flat Tree structured
Contention Little contention
55
Barrier Performance on SGI Challenge
• Centralized does quite well– Will discuss fancier barrier algorithms for distributed machines
• Helpful hardware support: piggybacking of reads misses on bus– Also for spinning on highly contended locks
Number of processors
Tim
e (
s)
123456780
5
10
15
20
25
30
35 Centralized Combining tree Tournament Dissemination
56
Synchronization Summary
Rich interaction of hardware-software tradeoffs
Must evaluate hardware primitives and software algorithms together• primitives determine which algorithms perform well
Evaluation methodology is challenging• Use of delays, microbenchmarks• Should use both microbenchmarks and real workloads
Simple software algorithms with common hardware primitives do well on bus
• Will see more sophisticated techniques for distributed machines• Hardware support still subject of debate
Theoretical research argues for swap or compare&swap, not fetch&op• Algorithms that ensure constant-time access, but complex
Snoop-based Multiprocessor Design
58
Design Goals
Performance and cost depend on design and implementation too
Goals• Correctness• High Performance• Minimal Hardware
Often at odds• High Performance => multiple outstanding low-level events
=> more complex interactions
=> more potential correctness bugs
We’ll start simply and add concurrency to the design
59
Correctness Issues
Fulfil conditions for coherence and consistency• Write propagation, serialization; for SC: completion, atomicity
Deadlock: all system activity ceases• Cycle of resource dependences
Livelock: no processor makes forward progress although transactions are performed at hardware level
• e.g. simultaneous writes in invalidation-based protocol– each requests ownership, invalidating other, but loses it before winning
arbitration for the bus
Starvation: one or more processors make no forward progress while others do.
• e.g. interleaved memory system with NACK on bank busy• Often not completely eliminated (not likely, not catastrophic)
B
A
60
Base Cache Coherence Design
Single-level write-back cache
Invalidation protocol
One outstanding memory request per processor
Atomic memory bus transactions• For BusRd, BusRdX no intervening transactions allowed on
bus between issuing address and receiving data• BusWB: address and data simultaneous and sinked by
memory system before any new bus request
Atomic operations within process• One finishes before next in program order starts
Examine write serialization, completion, atomicity
Then add more concurrency/complexity and examine again
61
Some Design Issues
Design of cache controller and tags• Both processor and bus need to look up
How and when to present snoop results on bus
Dealing with write backs
Overall set of actions for memory operation not atomic• Can introduce race conditions
New issues deadlock, livelock, starvation, serialization, etc.
Implementing atomic operations (e.g. read-modify-write)
Let’s examine one by one ...
62
Cache Controller and Tags
Cache controller stages components of an operation• Itself a finite state machine (but not same as protocol state machine)
Uniprocessor: On a miss:• Assert request for bus• Wait for bus grant• Drive address and command lines• Wait for command to be accepted by relevant device• Transfer data
In snoop-based multiprocessor, cache controller must: • Monitor bus and processor
– Can view as two controllers: bus-side, and processor-side– With single-level cache: dual tags (not data) or dual-ported tag RAM
• must reconcile when updated, but usually only looked up
• Respond to bus transactions when necessary (multiprocessor-ready)
63
Reporting Snoop Results: How?
Collective response from caches must appear on bus
Example: in MESI protocol, need to know• Is block dirty; i.e. should memory respond or not?• Is block shared; i.e. transition to E or S state on read miss?
Three wired-OR signals• Shared: asserted if any cache has a copy• Dirty: asserted if some cache has a dirty copy
– needn’t know which, since it will do what’s necessary
• Snoop-valid: asserted when OK to check other two signals– actually inhibit until OK to check
Illinois MESI requires priority scheme for cache-to-cache transfers• Which cache should supply data when in shared state?• Commercial implementations allow memory to provide data
64
Reporting Snoop Results: When?
Memory needs to know what, if anything, to do
Fixed number of clocks from address appearing on bus• Dual tags required to reduce contention with processor• Still must be conservative (update both on write: E -> M)• Pentium Pro, HP servers, Sun Enterprise
Variable delay• Memory assumes cache will supply data till all say “sorry”• Less conservative, more flexible, more complex• Memory can fetch data and hold just in case (SGI Challenge)
Immediately: Bit-per-block in memory• Extra hardware complexity in commodity main memory system
65
WritebacksTo allow processor to continue quickly, want to service miss first and
then process the write back caused by the miss asynchronously• Need write-back buffer• Must handle bus transactions relevant to buffered block
– snoop the WB buffer
Addr CmdSnoop state Data buffer
Write-back buffer
Cache data RAM
Comparator
Comparator
P
Tag
Addr Cmd
Data
Addr Cmd
Tocontroller
System bus
Bus-side
controller
Tocontroller
Tagsandstateforsnoop
TagsandstateforP
Processor-side
controller
66
Non-Atomic State Transitions
Memory operation involves many actions by many entities, incl. bus
• Look up cache tags, bus arbitration, actions by other controllers, ...• Even if bus is atomic, overall set of actions is not• Can have race conditions among components of different operations
Suppose P1 and P2 attempt to write cached block A simultaneously• Each decides to issue BusUpgr to allow S –> M
Issues• Must handle requests for other blocks while waiting to acquire bus • Must handle requests for this block A
– e.g. if P2 wins, P1 must invalidate copy and modify request to BusRdX
67
Handling Non-atomicity: Transient States
• Increase complexity, so many seek to avoid– e.g. don’t use BusUpgr, rather other mechanisms to avoid data transfer
Two types of states•Stable (e.g. MESI)•Transient or Intermediate
PrWr/—
BusGrant/BusUpgr
BusRd/Flush
BusGrant/
BusRdX/Flush
BusGrant/BusRdX
PrRd/BusReq
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd/—
BusRd (S)
PrWr/BusReq
I M
S M
PrWr/ BusReq
BusRdX/Flush
I S,E
BusRdX/Flush
BusRdX/Flush
BusGrant/ BusRd (S) BusRd/Flush
68
Serialization
Processor-cache handshake must preserve serialization of bus order• e.g. on write to block in S state, mustn’t write data in block until
ownership is acquired.– other transactions that get bus before this one may seem to appear later
Write completion for SC: needn’t wait for inval to actuallly happen• Just wait till it gets bus (here, will happen before next bus xaction)• Commit versus complete• Don’t know when inval actually inserted in destination process’s local
order, only that it’s before next xaction and in same order for all procs• Local write hits become visible not before next bus transaction• Same argument will extend to more complex systems• What matters is not when written data gets on the bus (write back), but
when subsequent reads are guaranteed to see it
Write atomicity: if a read returns value of a write W, W has already gone to bus and therefore completed if it needed to
69
Deadlock, Livelock, Starvation
Request-reply protocols can lead to protocol-level, fetch deadlock• In addition to buffer deadlock discussed earlier• When attempting to issue requests, must service incoming transactions
– e.g. cache controller awaiting bus grant must snoop and even flush blocks– else may not respond to request that will release bus: deadlock
Livelock: many processors try to write same line. Each one:• Obtains exclusive ownership via bus transaction (assume not in cache)• Realizes block is in cache and tries to write it• Livelock: I obtain ownership, but you steal it before I can write, etc.• Solution: don’t let exclusive ownership be taken away before write
Starvation: solve by using fair arbitration on bus and FIFO buffers• May require too much buffering; if retries used, priorities as heuristics
70
Implementing Atomic Operations
Read-modify-write: read component and write component
• Cacheable variable, or perform read-modify-write at memory– cacheable has lower latency and bandwidth needs for self-reacquisition– also allows spinning in cache without generating traffic while waiting– at-memory has lower transfer time– usually traffic and latency considerations dominate, so use cacheable
• Natural to implement with two bus transactions: read and write– can lock down bus: okay for atomic bus, but not for split-transaction– get exclusive ownership, read-modify-write, only then allow others access– compare&swap more difficult in RISC machines: two registers+memory
71
Implementing LL-SC
Lock flag and lock address register at each processor
LL reads block, sets lock flag, puts block address in register
Incoming invalidations checked against address: if match, reset flag• Also if block is replaced and at context switches
SC checks lock flag as indicator of intervening conflicting write• If reset, fail; if not, succeed
Livelock considerations• Don’t allow replacement of lock variable between LL and SC
– split or set-assoc. cache, and don’t allow memory accesses between LL, SC– (also don’t allow reordering of accesses across LL or SC)
• Don’t allow failing SC to generate invalidations (not an ordinary write)
Performance: both LL and SC can miss in cache• Prefetch block in exclusive state at LL• But exclusive request reintroduces livelock possibility: use backoff
72
Multi-level Cache Hierarchies
How to snoop with multi-level caches?• independent bus snooping at every level?• maintain cache inclusion
Requirements for Inclusion• data in higher-level cache is subset of data in lower-level cache• modified in higher-level => marked modified in lower-level
Now only need to snoop lowest-level cache• If L2 says not present (modified), then not so in L1 too• If BusRd seen to block that is modified in L1, L2 itself knows this
Is inclusion automatically preserved• Replacements: all higher-level misses go to lower level• Modifications
73
Violations of Inclusion
The two caches (L1, L2) may choose to replace different block• Differences in reference history
– set-associative first-level cache with LRU replacement– example: blocks m1, m2, m3 fall in same set of L1 cache...
• Split higher-level caches– instruction, data blocks go in different caches at L1, but may collide in L2– what if L2 is set-associative?
• Differences in block size
But a common case works automatically• L1 direct-mapped, fewer sets than in L2, and block size same
74
Preserving Inclusion Explicitly
Propagate lower-level (L2) replacements to higher-level (L1)• Invalidate or flush (if dirty) messages
Propagate bus transactions from L2 to L1• Propagate all transactions, or use inclusion bits
Propagate modified state from L1 to L2 on writes?• Write-through L1, or modified-but-stale bit per block in L2 cache
Correctness issues altered?• Not really, if all propagation occurs correctly and is waited for• Writes commit when they reach the bus, acknowledged immediately• But performance problems, so want to not wait for propagation• Discuss after split-transaction busses
Dual cache tags less important: each cache is filter for other
75
Split-Transaction Bus
Mem Access Delay
Address/CMD
Mem Access Delay
Data
Address/CMD
Data
Address/CMD
Busarbitration
Split bus transaction into request and response sub-transactions• Separate arbitration for each phase
Other transactions may intervene• Improves bandwidth dramatically• Response is matched to request• Buffering between bus and cache controllers
Reduce serialization down to the actual bus arbitration
76
Complications
New request can appear on bus before previous one serviced• Even before snoop result obtained• Conflicting operations to same block may be outstanding on bus• e.g. P1, P2 write block in S state at same time
– both get bus before either gets snoop result, so both think they’ve won
• Note: different from overall non-atomicity discussed earlier
Buffers are small, so may need flow control
Buffering implies revisiting snoop issues• When and how snoop results and data responses are provided• In order w.r.t. requests? (PPro, DEC Turbolaser: yes; SGI, Sun: no)• Snoop and data response together or separately?
– SGI together, SUN separately
Large space, much industry innovation: let’s look at one example first
77
Example (based on SGI Challenge)
No conflicting requests for same block allowed on bus• 8 outstanding requests total, makes conflict detection tractable
Flow-control through negative acknowledgement (NACK)• NACK as soon as request appears on bus, requestor retries• Separate command (incl. NACK) + address and tag + data buses
Responses may be in different order than requests• Order of transactions determined by requests• Snoop results presented on bus with response
Look at• Bus design, and how requests and responses are matched• Snoop results and handling conflicting requests• Flow control• Path of a request through the system
78
Bus Design and Req-Resp Matching
Essentially two separate buses, arbitrated independently• “Request” bus for command and address• “Response” bus for data
Out-of-order responses imply need for matching req-response• Request gets 3-bit tag when wins arbitration (8 outstanding max)• Response includes data as well as corresponding request tag• Tags allow response to not use address bus, leaving it free
Separate bus lines for arbitration, and for snoop results
79
Bus Design (continued)Each of request and response phase is 5 bus cycles (best case)
• Response: 4 cycles for data (128 bytes, 256-bit bus), 1 turnaround• Request phase: arbitration, resolution, address, decode, ack• Request-response transaction takes 3 or more of these
Cache tags looked up in decode; extend ack cycle if not possible• Determine who will respond, if any• Actual response comes later, with re-arbitration
Write-backs have request phase only: arbitrate both data+addr busesUpgrades have only request part; ack’ed by bus on grant (commit)
Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack
Addrreq
Addr Addr
Datareq
Tag
D0 D1 D2 D3
Addrreq
Addr Addr
Datareq
Tag
Grant
D0
check check
ackack
Time
Addressbus
Dataarbitration
Databus
Read operation 1
Read operation 2
80
Bus Design (continued)
Tracking outstanding requests and matching responses
• Eight-entry “request table” in each cache controller
• New request on bus added to all at same index, determined by tag
• Entry holds address, request type, state in that cache (if determined already), ...
• All entries checked on bus or processor accesses for match, so fully associative
• Entry freed when response appears, so tag can be reassigned by bus
81
Bus Interface with Request Table
Addr + cmdSnoop Data buffer
Write-back buffer
Comparator
Tag
Addr + cmd
Tocontrol
TagTag
Data to/from $
Requestbuffer
Request table
Tag
7
Add
ress
Request +
Mis
cella
neo
us
responsequeue
Addr + cmd bus
Data + tag bus
Snoop statefrom $
state
Issue +merge
Writ
e b
ack
s
Re
spon
ses
check
0
Ori
gina
tor
My
resp
ons
e
info
rma
tion
Res
pons
equ
eue
82
Snoop Results and Conflicting Requests
Variable-delay snooping
Shared, dirty and inhibit wired-OR lines, as before
Snoop results presented when response appears• Determined earlier, in request phase, and kept in request table entry• (Also determined who will respond)• Writebacks and upgrades don’t have data response or snoop result
Avoiding conflicting requests on bus • easy: don’t issue request for conflicting request that is in request table
Recall writes committed when request gets bus
83
Flow Control
Not just at incoming buffers from bus to cache controller
Cache system’s buffer for responses to its requests• Controller limits number of outstanding requests, so easy
Mainly needed at main memory in this design• Each of the 8 transactions can generate a writeback• Can happen in quick succession (no response needed)• SGI Challenge: separate NACK lines for address and data buses
– Asserted before ack phase of request (response) cycle is done– Request (response) cancelled everywhere, and retries later– Backoff and priorities to reduce traffic and starvation
• SUN Enterprise: destination initiates retry when it has a free buffer– source keeps watch for this retry– guaranteed space will still be there, so only two “tries” needed at most
84
Handling a Read Miss
Need to issue BusRd
First check request table. If hit:• If prior request exists for same block, want to grab data too!
– “want to grab response” bit– “original requestor” bit
• non-original grabber must assert sharing line so others will load in S rather than E state
• If prior request incompatible with BusRd (e.g. BusRdX)– wait for it to complete and retry (processor-side controller)
• If no prior request, issue request and watch out for race conditions– conflicting request may win arbitration before this one, but this one
receives bus grant before conflict is apparent• watch for conflicting request in slot before own, degrade request to
“no action” and withdraw till conflicting request satisfied
85
Upon Issuing the BusRd Request
All processors enter request into table, snoop for request in cache
Memory starts fetching block
1. Cache with dirty block responds before memory ready• Memory aborts on seeing response• Waiters grab data
– some may assert inhibit to extend response phase till done snooping– memory must accept response as WB (might even have to NACK)
2. Memory responds before cache with dirty block• Cache with dirty block asserts inhibit line till done with snoop• When done, asserts dirty, causing memory to cancel response• Cache with dirty issues response, arbitrating for bus
3. No dirty block: memory responds when inhibit line released• Assume cache-to-cache sharing not used (for non-modified data)
86
Handling a Write Miss
Similar to read miss, except:• Generate BusRdX• Main memory does not sink response since will be modified again• No other processor can grab the data
If block present in shared state, issue BusUpgr instead• No response needed• If another processor was going to issue BusUpgr, changes to BusRdX
as with atomic bus
87
Write Serialization
With split-transaction buses, usually bus order is determined by order of requests appearing on bus
• actually, the ack phase, since requests may be NACKed• by end of this phase, they are committed for visibility in order
A write that follows a read transaction to the same location should not be able to affect the value returned by that read
• Easy in this case, since conflicting requests not allowed• Read response precedes write request on bus
Similarly, a read that follows a write transaction won’t return old value
88
Detecting Write Completion
Problem: invalidations don’t happen as soon as request appears on bus• They’re buffered between bus and cache• Commitment does not imply performing or completion• Need additional mechanisms
Key property to preserve: processor shouldn’t see new value produced by a write before previous writes in bus order are visible to it
1. Don’t let certain types of incoming transactions be reordered in buffers– in particular, data reply should not overtake invalidation request – okay for invalidations to be reordered: only reply actually brings data in
2. Allow reordering in buffers, but ensure important orders preserved at key points
– e.g. flush incoming invalidations/updates from queues and apply before processor completes operation that may enable it to see a new value
89
Commitment of Writes (Operations)
More generally, distinguish between performing and commitment of a write w:
Performed w.r.t a processor: invalidation actually applied
Committed w.r.t a processor: guaranteed that once that processor sees the new value associated with W, any subsequent read by it will see new values of all writes that were committed w.r.t that processor before W.
Global bus serves as point of commitment, if buffers are FIFO• benefit of a serializing broadcast medium for interconnect
Note: acks from bus to processor must logically come via same FIFO• not via some special signal, since otherwise can violate ordering
90
Write Atomicity
Still provided naturally by broadcast nature of bus
Recall that bus implies:• writes commit in same order w.r.t. all processors• read cannot see value produced by write before write has committed
on bus and hence w.r.t. all processors
Previous techniques allow substitution of “complete” for “commit” in above statements
• that’s write atomicity
Will discuss deadlock, livelock, starvation after multilevel caches plus split transaction bus
91
Alternatives: In-order ResponsesFIFO request table suffices
Dirty cache does not release inhibit line till it is ready to supply data• No deadlock problem since does not rely on anyone else
But performance problems possible at interleaved memory• Major motivation for allowing out-of-order responses
Allow conflicting requests more easily• Two BusRdX requests one after the other on bus for same block
– latter controller invalidates its block, as before– but earlier requestor sees later request before its own data response– with out-of-order response, not known which response will appear first– with in-order, known, and actually can use performance optimization– earlier controller responds to latter request by noting that latter is pending– when its response arrives, updates word, short-cuts block back on to bus,
invalidates its copy (reduces ping-pong latency)
92
Other Alternatives
Fixed delay from request to snoop result also makes it easier• Can have conflicting requests even if data responses not in order• e.g. SUN Enterprise
– 64-byte line and 256-bit bus => 2 cycle data transfer– so 2-cycle request phase used too, for uniform pipelines– too little time to snoop and extend request phase– snoop results presented 5 cycles after address (unless inhibited)– by later data response arrival, conflicting requestors know what to do
Don’t even need request to go on same bus, as long as order is well-defined
• SUN SparcCenter2000 had 2 busses, Cray 6400 had 4• Multiple requests go on bus in same cycle• Priority order established among them is logical order
93
Multi-Level Caches with ST Bus
Introduces deadlock and serialization problems
Key new problem: many cycles to propagate through hierarchy• Must let others propagate too for bandwidth, so queues between levels
Response Processor request
Request/responseto bus
L1 $
L2 $
1
27
8
Processor
Bus
L1 $
L2 $
5
63
4
Processor
Response/requestfrom bus
Response/requestfrom L2 to L1
Response/requestfrom L1 to L2
94
Deadlock Considerations
Fetch deadlock:• Must buffer incoming requests/responses while request outstanding• One outstanding request per processor => need space to hold p requests
plus one reply (latter is essential)• If smaller (or if multiple o/s requests), may need to NACK• Then need priority mechanism in bus arbiter to ensure progress
Buffer deadlock:• L1 to L2 queue filled with read requests, waiting for response from L2• L2 to L1 queue filled with bus requests waiting for response from L1• Latter condition only when cache closer than lowest level is write back• Could provide enough buffering, or general solutions discussed later
If # o/s bus transactions smaller than total o/s cache misses, response from cache must get bus before new requests from it allowed
• Queues may need to support bypassing
95
Sequential Consistency
Separation of commitment from completion even greater now• More performance-critical that commitment replace completion
Fortunately techniques for single-level cache and ST bus extend• Just use them at each level• i.e. either don’t allow certain reorderings of transactions at any level• Or don’t let outgoing operation proceed past level before incoming
invalidations/updates at that level are applied
96
Multiple Outstanding Processor Requests
So far assumed only one: not true of modern processors
Danger: operations from same processor can complete out of order• e.g. write buffer: until serialized by bus, should not be visible to others• Uniprocessors use write buffer to insert multiple writes in succession
– multiprocessors usually can’t do this while ensuring consistent serialization– exception: writes are to same block, and no intervening ops in program order
Key question: who should wait to issue next op till previous completes• Key to high performance: processor needn’t do it (so can overlap)• Queues/buffers/controllers can ensure writes not visible to external world and
reads don’t complete (even if back) until allowed (more later)
Other requirement: caches must be lockup free to be effective• Merge operations to a block, so rest of system sees only one o/s to block
All needed mechanisms for correctness available (deeper queues for performance)
97
Case Studies of Bus-based Machines
SGI Challenge, with Powerpath bus
SUN Enterprise, with Gigaplane bus• Take very different positions on the design issues discussed above
Overview
For each system:• Bus design• Processor and Memory System• Input/Output system• Microbenchmark memory access results
Application performance and scaling (SGI Challenge)
98
SGI Challenge Overview
36 MIPS R4400 (peak 2.7 GFLOPS, 4 per board) or 18 MIPS R8000 (peak 5.4 GFLOPS, 2 per board)
8-way interleaved memory (up to 16 GB)4 I/O busses of 320 MB/s each1.2 GB/s Powerpath-2 bus @ 47.6 MHz, 16 slots, 329 signals128 Bytes lines (1 + 4 cycles)Split-transaction with up to 8 outstanding reads
• all transactions take five cycles
(a) A four-processor board
VM
E-6
4
SC
SI-
2
Gra
ph
ics
HP
PI
I/O subsystem
Interleavedmemory:
16 GB maximum
Powerpath-2 bus (256 data, 40 address, 47.6 MHz)
R4400 CPUsand caches
(b) Machine organization
99
SUN Enterprise Overview
Up to 30 UltraSPARC processors (peak 9 GFLOPs)
GigaplaneTM bus has peak bw 2.67 GB/s; upto 30GB memory
16 bus slots, for processing or I/O boards • 2 CPUs and 1GB memory per board
– memory distributed, unlike Challenge, but protocol treats as centralized
• Each I/O board has 2 64-bit 25Mhz SBUSes
GigaplaneTM bus (256 data, 41 address, 83 MHz)
I/O Cards
P
$2
$P
$2
$
mem ctrl
Bus Interface / SwitchBus Interface
CPU/MemCards
100
Bus Design Issues
Multiplexed versus non-multiplexed (separate addr and data lines)
Wide versus narrow data busses
Bus clock rate• Affected by signaling technology, length, number of slots...
Split transaction versus atomic
Flow control strategy
101
SGI Powerpath-2 Bus
Non-multiplexed, 256-data/40-address, 47.6 MHz, 8 o/s requests
Wide => more interface chips so higher latency, but more bw at slower clock
Large block size also calls for wider bus
Uses Illinois MESI protocol (cache-to-cache sharing)
More detail in chapter
1. Arbitration
2. Resolution
3. Address
4. Decode5. Acknowledge
Norequestors
At least onerequestor
102
Bus Timing
1. Arbitration
2. Resolution
3. Address
4. Decode5. Acknowledge
Norequestors
At least onerequestor
Arb Rslv Addr Decode Ack Arb Rslv Addr Decode Ack
Command
Address
Data bus
Cmd
Address
Cmd
D0 D1 D2 D3D1 D2 D0D3
Inhib Inhib Inhib Inhib
Urgentarb
Addressack
Urgentarb
Addressack
Address arbData arb
Data ackState
Address arbData arb
Address Data ackState
Dataresource
ID
Dataresource
ID
Dataresourceand inhibitbus
bus
bus
103
Processor and Memory Systems
4 MIPS R4400 processors per board share A and D chipsA chip has address bus interface, request table, control logicCC chip per processor has duplicate set of tagsProcessor requests go from CC chip to A chip to bus4 bit-sliced D chips interface CC chip to bus
L2 $
CC-chip
D-chipslice 1
D-chipslice 2
D-chipslice 3
D-chipslice 4
A-chip
Powerpath-2 bus
MIPSR4400
MIPSR4400
MIPSR4400
MIPSR4400
L2 $L2 $L2 $
CC-chip CC-chipCC-chip
Du
plic
ate
tag
s
Du
plic
ate
tag
s
Du
plic
ate
tag
s
Du
plic
ate
tag
s
104
Memory Access Latency
250ns access time from address on bus to data on bus
But overall latency seen by processor is 1000ns!• 300 ns for request to get from processor to bus
– down through cache hierarchy, CC chip and A chip
• 400ns later, data gets to D chips– 3 bus cycles to address phase of request transaction, 12 to access main
memory, 5 to deliver data across bus to D chips
• 300ns more for data to get to processor chip– up through D chips, CC chip, and 64-bit wide interface to processor
chip, load data into primary cache, restart pipeline
105
Challenge I/O Subsystem
Multiple I/O cards on system bus, each has 320MB/s HIO bus• Personality ASICs connect these to devices (standard and graphics)
Proprietary HIO bus• 64-bit multiplexed address/data, same clock as system bus• Split read transactions, up to 4 per device• Pipelined, but centralized arbitration, with several transaction lengths• Address translation via mapping RAM in system bus interface
Why the decouplings? (Why not connect directly to system bus?)I/O board acts like a processor to memory system
HIO bus (320 MB/s)
System address bus
System data bus (1.2 GB/s)
Address DatapathAddress map
HIOPeripheral
HIOSCSI
HIOVME
HIOHPPI
HIOgraphics
PersonalityASICs
System bus to HIO businterface
106
Challenge Memory System Performance
Read microbenchmark with various strides and array sizes
Ping-pong flag-spinning microbenchmark: round-trip time 6.2 s.
Tim
e (
ns)
Stride (bytes)
4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M
0
500
1,000
1,500
TLB
MEM
L2
8 M 4 M 2 M 1 M 512 K
256 K 128 K 64 K
32 K 16 K
107
Sun Gigaplane Bus
Non-multiplexed, split-transaction, 256-data/41-address, 83.5 MHz• Plus 32 ECC lines, 7 tag, 18 arbitration, etc. Total 388.
Cards plug in on both sides: 8 per side
112 outstanding transactions, up to 7 from each board• Designed for multiple outstanding transactions per processor
Emphasis on reducing latency, unlike Challenge• Speculative arbitration if address bus not scheduled from prev. cycle• Else regular 1-cycle arbitration, and 7-bit tag assigned in next cycle
Snoop result associated with request phase (5 cycles later)
Main memory can stake claim to data bus 3 cycles into this, and start memory access speculatively
• Two cycles later, asserts tag bus to inform others of coming transfer
MOESI protocol (owned state for cache-to-cache sharing)
108
Gigaplane Bus Timing
Arbitration
Address
State
Tag
Status
Data
1
Rd A Tag
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
Tag
OK
D0 D1
4,5
Rd B Tag
Own
Tag
6
Cancel
Tag
7
109
Enterprise Processor and Memory System
2 procs per board, external L2 caches, 2 mem banks with x-barData lines buffered through UDB to drive internal 1.3 GB/s UPA busWide path to memory so full 64-byte line in 1 mem cycle (2 bus cyc)Addr controller adapts proc and bus protocols, does cache coherence
• its tags keep a subset of states needed by bus (e.g. no M/E distinction)
UltraSparc
L2 $ Tags
UDB
L2 $ Tags
UDB
Address controller Data controller (crossbar)
Memory (16 72-bit SIMMS)
D-tags
576144
Gigaplane connector
Control Address Data 288
Address controller Data controller (crossbar)
Gigaplane connector
Control Address Data 288
72
SysIO SysIO
SBUS25 MHz 64
SBUS slots
Fast wide SCSI
10/100 Ethernet
FiberChannelmodule (2)
UltraSparc
110
Enterprise I/O System
I/O board has same bus interface ASICs as processor boards
But internal bus half as wide, and no memory path
Only cache block sized transactions, like processing boards• Uniformity simplifies design• ASICs implement single-block cache, follows coherence protocol
Two independent 64-bit, 25 MHz Sbuses• One for two dedicated FiberChannel modules connected to disk• One for Ethernet and fast wide SCSI• Can also support three SBUS interface cards for arbitrary peripherals
Performance and cost of I/O scale with no. of I/O boards
111
Memory Access Latency300ns read miss latency11 cycle min bus protocol at 83.5 Mhz is 130ns of this timeRest is path through caches and the DRAM accessTLB misses add 340 ns
Tim
e (n
s)
Stride (bytes)
4 16 64 256 1 K 4 K 16 K 64 K 256 K 1 M 4 M
0
100
200
300
400
500
600
700
8 M 4 M 2 M 1 M 512 K
256 K 128 K 64 K
32 K16 K
Ping-pong microbenchmark is 1.7 s round-trip (5 mem accesses)
112
Application Speedups (Challenge)
• Problem in Ocean with small problem: communication and barrier cost• Problem in Radix: contention on bus due to very high traffic
– also leads to high imbalances and barrier wait time
Sp
eedu
p
Number of processors Number of processors
Barnes-Hut: 16-K particles Barnes-Hut: 512-K particles Ocean: n = 130 Ocean: n = 1,024 Radix: 1-M keys
Radix: 4-M keys
Sp
eedu
p
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
14
16
0
2
4
6
8
10
12
14
16
LU: n = 1,024 LU: n = 2,048 Raytrace: balls Raytrace: car Radiosity: room
Radiosity: large room
113
Application Scaling under Other Models
Wo
rk (
inst
ruct
ion
s)
Number of processors Number of processors
Wo
rk (
inst
ruct
ion
s)
Num
ber
of b
odie
s
Number of processors Number of processors
Num
ber
of p
oint
s p
er
grid
Spe
edup
Number of processors Number of processors
Sp
eed
up
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000 Naive TC Naive MC TC MC
1 3 5 7 9 11 13 151 3 5 7 9 11 13 15
1 3 5 7 9 11 13 15
1 3 5 7 9 11 13 15
0
100
200
300
400
500
600
0
50
100
150
200
250
300
1 3 5 7 9 11 13 150
200,000
400,000
600,000
800,000
1,000,000
1,200,000
0
2
4
6
8
10
12
14
16
1 3 5 7 9 11 13 150
2
4
6
8
10
12
14
16
Naive TC Naive MC TC MC
Naive TC Naive MC TC MC
Naive TC Naive MC TC MC PC
Naive TC Naive MC TC MC PC
Naive TC Naive MC TC MC