Shared Memory Multiprocessors
A. Jantsch / Z. Lu / I. Sander
Outline
Shared memory architectures Centralized memory Distributed memory
Shared memory programming Critical section Mutex and semaphore
Caches Write through / write-back caches The cache coherency problem
April 20, 2023 SoC Architecture 2
Shared Memory Architectures
April 20, 2023 SoC Architecture 4
Shared Memory Architectures
Shared Memory Multiprocessor are often usedSymmetric Multiprocessors (SMP)
Symmetric access to all of main memory from any processor
also called UMA (uniform memory access)Distributed Shared Memory (DSM)
Access time depends on the location of data word in memory
also called NUMA (non-uniform memory access)
April 20, 2023 SoC Architecture 5
Shared Memory Architectures
A shared memory programming model has a direct representation in hardware
Caches Increase performanceDemand cache coherence and memory
consistency protocols
April 20, 2023 SoC Architecture 6
Shared Cache Architecture
Several processors are connected via a switch with a shared memory
Has been used for a very small number of processors
Is difficult to use for a large number of processors, since the shared cache must deliver an extremely large bandwidth
P1 Pm
Switch
Cache
Main Memory
April 20, 2023 SoC Architecture 7
Bus-shared Shared Memory
The interconnect is a shared bus between the processors local caches and the memory
Has been used up to 20 to 30 processors
Scaling is limited due to the bandwidth limitations of the shared bus
P1 Pm
Cache
Main Memory
CacheBus
April 20, 2023 SoC Architecture 8
Dancehall Architecture
Scalable Point-to-Point Network placed between caches and memory modules that together form the main memory
Due to the size of the interconnection network, the memory can be very far from the processors
P1 Pm
Cache
Memory
Cache
InterconnectionNetwork
Memory
April 20, 2023 SoC Architecture 9
Distributed Memory
No symmetric approach. The local memory is much closer than the rest of the global memory.
Structure works very well with scaling
Important in the design to use the local memory efficiently.
P1 Pm
Cache
Memory
Cache
InterconnectionNetwork
Memory
ARM’s Cortex-A9
April 20, 2023 SoC Architecture 10
L2 Cache Controller (PL310)
Cache-2-CacheTransfers
SnoopFiltering
Generalized Interrupt Control
and Distribution
Snoop Control Unit (SCU)
Timers
Advanced Bus Interface Unit
Optional 2nd I/F with Address FilteringPrimary AMBA 3 64bit Interface
AcceleratorCoherence
Port(AXI-3 Slave)
FPU/NEON
Cortex-A9 CPU
InstructionCache
DataCache
PTMI/F
FPU/NEON
Cortex-A9 CPU
InstructionCache
DataCache
PTMI/F
FPU/NEON
Cortex-A9 CPU
InstructionCache
DataCache
PTMI/F
FPU/NEON
Cortex-A9 CPU
InstructionCache
DataCache
PTMI/F
AC
P A
XI-
3 S
lave
ACP AXI-3 Master ACP AXI-3 Master
H/W Accelerators64-bit Non-cached
AXI-3 Masters
Three AXI-buses on the Cortex-A9 ACP port provides coherent access to L1 and L2 caches through the Snoop Control Unit Two AXI masters off of the L2 cache provide 8GB/s at 500Mhz access to the main SoC bus
Tilera Gx Architecture
April 20, 2023 SoC Architecture 11
4x4, 6x6, 8x8, 10x10 Chips 3 instructions per cycle per core 32 MB on chip cache 750 GOPS (32 bit operations) 200 Tbps on chip interconnect bandwidth 500 Gbps memory bandwidth ~ 1 GHz operating frequency 10W – 55W power consumption
5 mesh networks: 32 bit; Dimension order routing; 1 cycle traversal QDN: Request Dynamic Network, 64bit – memory requests RDN: Response Dynamic Network, 112bit – memory responses TDN :Tile Dynamic network, 128bit – cache-to-cache communication UDN: User Dynamic Network, 32bit – under SW control IDN: I/O Dynamic Network, 32bit – for non-memory I/O
Shared Memory Programming
Process and history
A process executes a sequence of statements. Each statement consists of one or more atomic (indivisible)
actions which transform one state into another (state transition).
Process state is formed of values of variables at a point in time.
The process history is a trace of one execution: a sequence of atomic operations.
Example
P1:
April 20, 2023 SoC Architecture 13
A1S0 S1 S2
A2Sm
Am
Atomic Operations
Indivisible sequence of state transitions Fine-grained atomic operations
Machine instructions (read, write, test-and-set, read-modify-write, swap etc.)
Atomicity is guaranteed by hardware Coarse-grained atomic actions
A sequence of fine-grained atomic actions Should not be interrupted Internal state transitions are not visible ”outside”.
April 20, 2023 SoC Architecture 14
Concurrent execution
The concurrent execution of multiple processes can be viewed as the interleaving of their sequences of atomic actions. A history is a trace of ONE execution, i.e., an intereleaving of atomic
actions of processes.
Example Individual histories
P1: s0 → s1
P2: p0 → p1 Interleaved execution history Trace 1: s0→p0→s1→p1
Trace 2: s0→s1→p0→p1
April 20, 2023 SoC Architecture 15
How many traces?
A concurrent program of n processes each with m atomic actions can produce N = ((nm)!) / ((m!)n) different histories!
Example 2 processes, each with 2 actions, i.e., n=2, m=2, N=6 3 processes, each with 2 actions, i.e., n=3, m=2, N=90 3 processes, each with 3 actions, i.e., n=3, m=3, N=1680
Implication This makes it impossible to show the correctness of a program by
tesing (run the program and see what happens). Design a ”correct” program in the first place. For shared variable
programming, problems are concerned with accessing shared variables. Therefore a key issue is process synchronization.
April 20, 2023 SoC Architecture 16s0
Concurrent Execution Example
Possible Results: 0, 1, 2, 3, Undefined
April 20, 2023 SoC Architecture 17
Task A
x:=0;y:=0;Print (x+y);
Task B
x:=x+1;y:=y+2;
Concurrent Execution Example
April 20, 2023 SoC Architecture 18
Task A
x:=0;y:=0;Print (x+y);
Task B
x:=x+1;y:=y+2;
Results: 0
Task A
x:=0;y:=0;
Print (x+y);
Task B
x:=x+1;y:=y+2;
Results: 3
Task A
x:=0;
y:=0;Print (x+y);
Task B
x:=x+1;y:=y+2;
Results: 1
Concurrent Execution Example
April 20, 2023 SoC Architecture 19
Task A
x:=0;y:=0;
Print (x+y);
Task B
x:=x+1;
y:=y+2;
Results: 2
Task A
x:=0;y:=0;
Print (x+y);
Task B
R1 := x;
x:=R1+1;R2:=y;y:=R2+2;
Results: Undefined
Synchronization Synchronization constrains possible histories to desirable
(good) histories. Synchronization methods
Mutual exclusion (mutex) Exclusive access to shared variables within a critical section A mechanism that guarantees serialization of critical sections
(atomicity of critical sections with respect to each other) Condition synchronization
Delaying a process until the state satisfies a boolean condition. Lessons learnt: synchronization is required whenever
processes read and write shared variables to preserve data dependencies.
April 20, 2023 SoC Architecture 20
Concurrent Execution Example
Result: 3
provided that a. Both tasks behave sequential, and
b. writes are seen in the same order by both tasksApril 20, 2023 SoC Architecture 21
Task Ax:=0;y:=0;S:=1;While (S2==0);Print (x+y);
Task Bwhile (S1==0);x:=x+1;y:=y+2;S2:=1;
S1:=0; S2:=0;
Critical section
CS: a piece of code that can only be executed by one process at a time To provide mutual exclusive access to shared resources
(sequence of statements accessing shared variables) Two sections can be critical wrt each other if they cannot
be executed simultaneously, i.e., mutual exclusive sections.
Some synchronization mechanism is required at the entry and exit of the CS to ensure exclusive use.
April 20, 2023 SoC Architecture 22
Critical Section Example
April 20, 2023 SoC Architecture 23
Task ADo private operations;
Critical section begin; Update shared state;Critical section end;
Do private operations;
Task BDo private operations;
Critical section begin; Update shared state;Critical section end;
Do private operations;
The Critical section problem Design entry and exit protocols that satisfy the following
properties: Mutual exlcusion
At most one process at a time is entering, executing and exiting the critical section.
Absence of deadlocks (livelocks) One of the competing processes succeed to enter Termination: CS should terminate in finite time
Absence of unnecessary delay A process is not prevented from entering if others do not compete.
Fairness (enventual entry, liveness) A process should eventually enter CS.
April 20, 2023 SoC Architecture 24
Solutions
Locking mechanisms Lock on enter; unlock on exit Variants of locks: spin lock (busy-waiting),
queueing locks, etc. Semphores
A general solution to the synchronization problem for both mutual exclusion and condition synchronization.
April 20, 2023 SoC Architecture 25
Lock
Enter CS: set the lock when it is cleared. < await (lock==false) lock := true >;
Exit CS: clear/release the lock
lock := false;
Synonyms: enter-exit, lock-unlock, acquire-release Example:
April 20, 2023 SoC Architecture 26
bool lock=0;
process CS1 {
while (true) {
<await ((lock==false) lock:= true>; //entry
CS;
lock :=false; //exit
non-critical section;}
}
bool lock=0;
process CS2 {
while (true) {
<await ((lock==false) lock := true>; //entry
CS;
lock :=false; //exit
non-critical section;}
}
Lock implementation Lock/unlock in terms of instructions:
Locking consists of several instructions Unlock is an ordinary store instruction.
To support the atomicity of locking, locks need hardware support, i.e., special atomic memory instructions. General semantics: <read location, test the value read, compute a new
value and store the new value to the location>
Many variants: read-modify-write, test&set, fetch&increment; swap, etc.
April 20, 2023 SoC Architecture 27
lock: load register, location //copy location to register
cmp register, #0 //compare with 0
bnz lock //if not 0, try again
store location, #1 //store 1, marking locked
ret
unlock: store location, #0
ret
Semaphore
A semaphore is a special kind of shared variable manipulated by two atomic operations, P and V.
Semaphores provide a low-level but efficient signaling mechanim for both mutual exclusion and condition synchronization
Inspired by a railroad semaphore: up/down signal flag Semaphore operation in Dutch
P (decrement when nonnegative), stands for ”proberen”(test) or ”passeren”
V (increment) stands for ”verhogen” or ”vrijeven”
April 20, 2023 SoC Architecture 28
Semaphore syntax and semantics
Declarationsem s = expr // single semaphore
InitializationDefault to 1
The value of a semaphore is non-negative integer Operations
P(s): <await (s>0) s:=s-1;> //wait, down
V(s): <s:=s+1> //signal, up
April 20, 2023 SoC Architecture 29
Semaphore types
Binary semaphore taking the value of 0 and 1 only.
Split binary semaphore is a set of binary semaphore where at most one semaphore is 1 at a time. The sum of semaphore values [0,1]
General (counting) semaphore takes any nonnegative integer value and can be used for condition synchronization, for example, Serves as a resource counter: counts the number of resources
April 20, 2023 SoC Architecture 30
Mutex semaphore A CS may be executed with mutual exclusion by
enclosing it within P and V operations on a binary semaphore.
Example: initiates to 1 to indicated CS is free
April 20, 2023 SoC Architecture 31
sem mutex=1;
process CS[i=0 to n] {
while (true) {
P(mutex); //entry, down
CS;
V(mutex); //exit, up
non-critical section;}
}
Caches and Cache Coherency
April 20, 2023 SoC Architecture 33
Caches and Cache Coherence
Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared
interconnect
But private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others
They’ll keep accessing stale value in their caches Cache coherence problem Need to take actions to ensure visibility
April 20, 2023 SoC Architecture 34
Cache Memories
A cache memory is used to reduce the access time to memory
Cache misses can occur since the cache is much smaller than the memory
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 35
Cache Memories
The decision which parts of the memory reside in the cache is taken by a replacement-algorithm
There are different protocols for a write operation: Write-Back and Write-Through
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 36
Cache MemoriesRead Operation
If the memory location is in the cache (cache hit), the data is read from the cache.
If the memory location is not in the cache (cache miss), the block containing the data (is read from memory) and the cache is updated.
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 37
Cache MemoriesWrite Operation (Write Hit)
Write-Through Protocol A write operation updates the main memory
location depending on protocol cache may also be updated in this course we assume cache to be updated during
write hit
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 38
Cache MemoriesWrite Operation (Write Hit)
Write-Back Protocol A write operation updates only the cache location
and marks it as updated with an associated flag bit (dirty flag)
The main memory is updated later, when the block containing the marked address is removed from the cache.
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 39
Cache MemoriesWrite Operation (Write Miss)
Since data is not necessarily needed on a write there are two options Write Allocate: The block is allocated on a write miss,
followed by the corresponding write-hit actions No-Write Allocate: Write misses do not affect the cache,
instead only the lower-level memory is updated.
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 40
Cache MemoriesWrite Operation (Write Miss)
Write-through and write-back can be combined with write-allocate or no-write-allocate
Typically Write-back caches use write-allocate Write-through uses no-write-allocate
Processor CacheMain
Memory
April 20, 2023 SoC Architecture 41
States for Cache Blocks
Write-through Invalid Valid
Write-Back Invalid Valid Dirty (not updated in memory)
April 20, 2023 SoC Architecture 42
Cache Coherence(Uniprocessor)
1. P1 reads location u (value 5) from main memory2. P3 reads location u from main memory3. P3 writes u, changing the value to 74. P1 reads value u again5. P2 reads location u from main memory
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
Write-Back Cache
April 20, 2023 SoC Architecture 43
Cache Coherence(Uniprocessor)
1. P1 reads location u (value 5) from main memory Cache of P1 is updated The block containing u=5 is loaded into the cache
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
u=5
u=5
Read
April 20, 2023 SoC Architecture 44
Cache Coherence(Uniprocessor)
2. P3 reads location u from main memory Cache and Memory have still the value u=5
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
u=5
u=5
Read
April 20, 2023 SoC Architecture 45
Cache Coherence(Uniprocessor)
3. P3 writes u, changing the value to 7 Cache is updated (u=7) and u is marked. Memory is not
changed!
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
u=5
u=7
Write
April 20, 2023 SoC Architecture 46
Cache Coherence(Uniprocessor)
4. P1 reads value u again Since cache is common to all processes, there is no
problem though the main memory is not updated! All processes have the same view of the cache!
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
u=5
u=7
Read
April 20, 2023 SoC Architecture 47
Cache Coherence(Uniprocessor)
5. P2 reads location u from main memory Since cache is common to all processes, there is no
problem though the main memory is not updated! All processes have the same view of the cache!
P1 P2
Cache
Main Memory
Bus
P3Single Processorrunning 3 processes
u=5
u=7
Read
April 20, 2023 SoC Architecture 48
Cache Coherence(Uniprocessor)
If only a uniprocessor is involved there is no cache coherence problem!
However, if another device on the bus is involved that has direct memory access (like a DMA), the cache may not represent the contents of the memory and the cache coherence problem can occur!
April 20, 2023 SoC Architecture 49
Cache Coherence Problem
1. P1 reads location u (value 5) from main memory2. P3 reads location u from main memory3. P3 writes u, changing the value to 74. P1 reads value u again5. P2 reads location u from main memory
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
April 20, 2023 SoC Architecture 50
Cache Coherence Problem(Write-Through Cache)
1. P1 reads location u (value 5) from main memory P1’s cache is updated (u=5)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=5
u=5
Read
Write-Through
April 20, 2023 SoC Architecture 51
Cache Coherence Problem(Write-Through Cache)
2. P3 reads location u from main memory P3’s cache is updated (u=5)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=5
u=5 u=5
Read
April 20, 2023 SoC Architecture 52
Cache Coherence Problem(Write-Through Cache)
3. P3 writes u, changing the value to 7 main memory is updated (u=7) P3’s cache is updated (no-write-allocate caches
update cache on a write hit)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=7
u=7 (Valid)
Write Through
u=5
April 20, 2023 SoC Architecture 53
Cache Coherence Problem(Write-Through Cache)
4. P1 reads value u again P1 reads the value from the cache (u=5), which
is not the correct value!
P1 P2
Cache
Main Memory
CacheBus
P3
Cache u=7 (Valid)
u=7
u=5
Read
April 20, 2023 SoC Architecture 54
Cache Coherence Problem(Write-Through Cache)
5. P2 reads location u from main memory P2 reads the value from the main memory (u=7)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache u=7 (Valid)
u=7
u=5
Read
April 20, 2023 SoC Architecture 55
Cache Coherence Problem(Write-Back Cache)
1. P1 reads location u (value 5) from main memory P1’s cache is updated (u=5)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=5
u=5
April 20, 2023 SoC Architecture 56
Cache Coherence Problem(Write-Back Cache)
2. P3 reads location u from main memory P3’s cache is updated (u=5)
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=5
u=5 u=5
Read
April 20, 2023 SoC Architecture 57
Cache Coherence Problem(Write-Back Cache)
3. P3 writes u, changing the value to 7 P3’s cache is updated (u=7) and location u is
marked Main memory is NOT updated!
P1 P2
Cache
Main Memory
CacheBus
P3
Cache
u=5
u=7
Write Back
u=5
April 20, 2023 SoC Architecture 58
Cache Coherence Problem(Write-Back Cache)
4. P1 reads value u again P1 reads the value from the cache (u=5), which
is not the correct value!
P1 P2
Cache
Main Memory
CacheBus
P3
Cache u=7
u=5
u=5
Read
April 20, 2023 SoC Architecture 59
Cache Coherence Problem(Write-Back Cache)
5. P2 reads location u from main memory P2 reads the value from the cache (u=5), which
is not the correct value!
P1 P2
Cache
Main Memory
CacheBus
P3
Cache u=7
u=5
u=5
Read
April 20, 2023 SoC Architecture 60
Cache Coherence
Since communication between processors is done by means of shared memory, cache coherence must be guaranteed.
The hardware should support cache coherence!
Everybody should have the same view of the memory system!
Top Related