Download - Shared Memory Multiprocessors A. Jantsch / Z. Lu / I. Sander.

Shared Memory Multiprocessors

A. Jantsch / Z. Lu / I. Sander

Outline

Shared memory architectures Centralized memory Distributed memory

Shared memory programming Critical section Mutex and semaphore

Caches Write through / write-back caches The cache coherency problem

April 20, 2023 SoC Architecture 2

Shared Memory Architectures



Shared Memory Multiprocessor are often usedSymmetric Multiprocessors (SMP)

Symmetric access to all of main memory from any processor

also called UMA (uniform memory access)Distributed Shared Memory (DSM)

Access time depends on the location of data word in memory

also called NUMA (non-uniform memory access)



A shared memory programming model has a direct representation in hardware

Caches Increase performanceDemand cache coherence and memory

consistency protocols


Shared Cache Architecture

Several processors are connected via a switch with a shared memory

Has been used for a very small number of processors

Is difficult to use for a large number of processors, since the shared cache must deliver an extremely large bandwidth

P1 Pm

Switch

Cache

Main Memory


Bus-shared Shared Memory

The interconnect is a shared bus between the processors local caches and the memory

Has been used up to 20 to 30 processors

Scaling is limited due to the bandwidth limitations of the shared bus

P1 Pm

Cache

Main Memory

CacheBus


Dancehall Architecture

Scalable Point-to-Point Network placed between caches and memory modules that together form the main memory

Due to the size of the interconnection network, the memory can be very far from the processors

P1 Pm

Cache

Memory

Cache

InterconnectionNetwork

Memory


Distributed Memory

No symmetric approach. The local memory is much closer than the rest of the global memory.

Structure works very well with scaling

Important in the design to use the local memory efficiently.

P1 Pm

Cache

Memory

Cache

InterconnectionNetwork

Memory

ARM’s Cortex-A9


L2 Cache Controller (PL310)

Cache-2-CacheTransfers

SnoopFiltering

Generalized Interrupt Control

and Distribution

Snoop Control Unit (SCU)

Timers

Advanced Bus Interface Unit

Optional 2nd I/F with Address FilteringPrimary AMBA 3 64bit Interface

AcceleratorCoherence

Port(AXI-3 Slave)

FPU/NEON

Cortex-A9 CPU

InstructionCache

DataCache

PTMI/F

FPU/NEON

Cortex-A9 CPU

InstructionCache

DataCache

PTMI/F

FPU/NEON

Cortex-A9 CPU

InstructionCache

DataCache

PTMI/F

FPU/NEON

Cortex-A9 CPU

InstructionCache

DataCache

PTMI/F

AC

P A

XI-

3 S

lave

ACP AXI-3 Master ACP AXI-3 Master

H/W Accelerators64-bit Non-cached

AXI-3 Masters

Three AXI-buses on the Cortex-A9 ACP port provides coherent access to L1 and L2 caches through the Snoop Control Unit Two AXI masters off of the L2 cache provide 8GB/s at 500Mhz access to the main SoC bus

Tilera Gx Architecture


4x4, 6x6, 8x8, 10x10 Chips 3 instructions per cycle per core 32 MB on chip cache 750 GOPS (32 bit operations) 200 Tbps on chip interconnect bandwidth 500 Gbps memory bandwidth ~ 1 GHz operating frequency 10W – 55W power consumption

5 mesh networks: 32 bit; Dimension order routing; 1 cycle traversal QDN: Request Dynamic Network, 64bit – memory requests RDN: Response Dynamic Network, 112bit – memory responses TDN :Tile Dynamic network, 128bit – cache-to-cache communication UDN: User Dynamic Network, 32bit – under SW control IDN: I/O Dynamic Network, 32bit – for non-memory I/O

Shared Memory Programming

Process and history

A process executes a sequence of statements. Each statement consists of one or more atomic (indivisible)

actions which transform one state into another (state transition).

Process state is formed of values of variables at a point in time.

The process history is a trace of one execution: a sequence of atomic operations.

Example

P1:


A1S0 S1 S2

A2Sm

Am

Atomic Operations

Indivisible sequence of state transitions Fine-grained atomic operations

Machine instructions (read, write, test-and-set, read-modify-write, swap etc.)

Atomicity is guaranteed by hardware Coarse-grained atomic actions

A sequence of fine-grained atomic actions Should not be interrupted Internal state transitions are not visible ”outside”.


Concurrent execution

The concurrent execution of multiple processes can be viewed as the interleaving of their sequences of atomic actions. A history is a trace of ONE execution, i.e., an intereleaving of atomic

actions of processes.

Example Individual histories

P1: s0 → s1

P2: p0 → p1 Interleaved execution history Trace 1: s0→p0→s1→p1

Trace 2: s0→s1→p0→p1


How many traces?

A concurrent program of n processes each with m atomic actions can produce N = ((nm)!) / ((m!)n) different histories!

Example 2 processes, each with 2 actions, i.e., n=2, m=2, N=6 3 processes, each with 2 actions, i.e., n=3, m=2, N=90 3 processes, each with 3 actions, i.e., n=3, m=3, N=1680

Implication This makes it impossible to show the correctness of a program by

tesing (run the program and see what happens). Design a ”correct” program in the first place. For shared variable

programming, problems are concerned with accessing shared variables. Therefore a key issue is process synchronization.

April 20, 2023 SoC Architecture 16s0

Concurrent Execution Example

Possible Results: 0, 1, 2, 3, Undefined


Task A

x:=0;y:=0;Print (x+y);

Task B

x:=x+1;y:=y+2;



Task A

x:=0;y:=0;Print (x+y);

Task B

x:=x+1;y:=y+2;

Results: 0

Task A

x:=0;y:=0;

Print (x+y);

Task B

x:=x+1;y:=y+2;

Results: 3

Task A

x:=0;

y:=0;Print (x+y);

Task B

x:=x+1;y:=y+2;

Results: 1



Task A

x:=0;y:=0;

Print (x+y);

Task B

x:=x+1;

y:=y+2;

Results: 2

Task A

x:=0;y:=0;

Print (x+y);

Task B

R1 := x;

x:=R1+1;R2:=y;y:=R2+2;

Results: Undefined

Synchronization Synchronization constrains possible histories to desirable

(good) histories. Synchronization methods

Mutual exclusion (mutex) Exclusive access to shared variables within a critical section A mechanism that guarantees serialization of critical sections

(atomicity of critical sections with respect to each other) Condition synchronization

Delaying a process until the state satisfies a boolean condition. Lessons learnt: synchronization is required whenever

processes read and write shared variables to preserve data dependencies.



Result: 3

provided that a. Both tasks behave sequential, and

b. writes are seen in the same order by both tasksApril 20, 2023 SoC Architecture 21

Task Ax:=0;y:=0;S:=1;While (S2==0);Print (x+y);

Task Bwhile (S1==0);x:=x+1;y:=y+2;S2:=1;

S1:=0; S2:=0;

Critical section

CS: a piece of code that can only be executed by one process at a time To provide mutual exclusive access to shared resources

(sequence of statements accessing shared variables) Two sections can be critical wrt each other if they cannot

be executed simultaneously, i.e., mutual exclusive sections.

Some synchronization mechanism is required at the entry and exit of the CS to ensure exclusive use.


Critical Section Example


Task ADo private operations;

Critical section begin; Update shared state;Critical section end;

Do private operations;

Task BDo private operations;

Critical section begin; Update shared state;Critical section end;

Do private operations;

The Critical section problem Design entry and exit protocols that satisfy the following

properties: Mutual exlcusion

At most one process at a time is entering, executing and exiting the critical section.

Absence of deadlocks (livelocks) One of the competing processes succeed to enter Termination: CS should terminate in finite time

Absence of unnecessary delay A process is not prevented from entering if others do not compete.

Fairness (enventual entry, liveness) A process should eventually enter CS.


Solutions

Locking mechanisms Lock on enter; unlock on exit Variants of locks: spin lock (busy-waiting),

queueing locks, etc. Semphores

A general solution to the synchronization problem for both mutual exclusion and condition synchronization.


Lock

Enter CS: set the lock when it is cleared. < await (lock==false) lock := true >;

Exit CS: clear/release the lock

lock := false;

Synonyms: enter-exit, lock-unlock, acquire-release Example:


bool lock=0;

process CS1 {

while (true) {

<await ((lock==false) lock:= true>; //entry

CS;

lock :=false; //exit

non-critical section;}

}

bool lock=0;

process CS2 {

while (true) {

<await ((lock==false) lock := true>; //entry

CS;

lock :=false; //exit


}

Lock implementation Lock/unlock in terms of instructions:

Locking consists of several instructions Unlock is an ordinary store instruction.

To support the atomicity of locking, locks need hardware support, i.e., special atomic memory instructions. General semantics: <read location, test the value read, compute a new

value and store the new value to the location>

Many variants: read-modify-write, test&set, fetch&increment; swap, etc.


lock: load register, location //copy location to register

cmp register, #0 //compare with 0

bnz lock //if not 0, try again

store location, #1 //store 1, marking locked

ret

unlock: store location, #0

ret

Semaphore

A semaphore is a special kind of shared variable manipulated by two atomic operations, P and V.

Semaphores provide a low-level but efficient signaling mechanim for both mutual exclusion and condition synchronization

Inspired by a railroad semaphore: up/down signal flag Semaphore operation in Dutch

P (decrement when nonnegative), stands for ”proberen”(test) or ”passeren”

V (increment) stands for ”verhogen” or ”vrijeven”


Semaphore syntax and semantics

Declarationsem s = expr // single semaphore

InitializationDefault to 1

The value of a semaphore is non-negative integer Operations

P(s): <await (s>0) s:=s-1;> //wait, down

V(s): <s:=s+1> //signal, up


Semaphore types

Binary semaphore taking the value of 0 and 1 only.

Split binary semaphore is a set of binary semaphore where at most one semaphore is 1 at a time. The sum of semaphore values [0,1]

General (counting) semaphore takes any nonnegative integer value and can be used for condition synchronization, for example, Serves as a resource counter: counts the number of resources


Mutex semaphore A CS may be executed with mutual exclusion by

enclosing it within P and V operations on a binary semaphore.

Example: initiates to 1 to indicated CS is free


sem mutex=1;

process CS[i=0 to n] {

while (true) {

P(mutex); //entry, down

CS;

V(mutex); //exit, up


}

Caches and Cache Coherency


Caches and Cache Coherence

Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared

interconnect

But private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others

They’ll keep accessing stale value in their caches Cache coherence problem Need to take actions to ensure visibility


Cache Memories

A cache memory is used to reduce the access time to memory

Cache misses can occur since the cache is much smaller than the memory

Processor CacheMain

Memory


Cache Memories

The decision which parts of the memory reside in the cache is taken by a replacement-algorithm

There are different protocols for a write operation: Write-Back and Write-Through

Processor CacheMain

Memory


Cache MemoriesRead Operation

If the memory location is in the cache (cache hit), the data is read from the cache.

If the memory location is not in the cache (cache miss), the block containing the data (is read from memory) and the cache is updated.

Processor CacheMain

Memory


Cache MemoriesWrite Operation (Write Hit)

Write-Through Protocol A write operation updates the main memory

location depending on protocol cache may also be updated in this course we assume cache to be updated during

write hit

Processor CacheMain

Memory


Cache MemoriesWrite Operation (Write Hit)

Write-Back Protocol A write operation updates only the cache location

and marks it as updated with an associated flag bit (dirty flag)

The main memory is updated later, when the block containing the marked address is removed from the cache.

Processor CacheMain

Memory


Cache MemoriesWrite Operation (Write Miss)

Since data is not necessarily needed on a write there are two options Write Allocate: The block is allocated on a write miss,

followed by the corresponding write-hit actions No-Write Allocate: Write misses do not affect the cache,

instead only the lower-level memory is updated.

Processor CacheMain

Memory


Cache MemoriesWrite Operation (Write Miss)

Write-through and write-back can be combined with write-allocate or no-write-allocate

Typically Write-back caches use write-allocate Write-through uses no-write-allocate

Processor CacheMain

Memory


States for Cache Blocks

Write-through Invalid Valid

Write-Back Invalid Valid Dirty (not updated in memory)


Cache Coherence(Uniprocessor)

1. P1 reads location u (value 5) from main memory2. P3 reads location u from main memory3. P3 writes u, changing the value to 74. P1 reads value u again5. P2 reads location u from main memory

P1 P2

Cache

Main Memory

Bus

P3Single Processorrunning 3 processes

Write-Back Cache



1. P1 reads location u (value 5) from main memory Cache of P1 is updated The block containing u=5 is loaded into the cache

P1 P2

Cache

Main Memory

Bus


u=5

u=5

Read



2. P3 reads location u from main memory Cache and Memory have still the value u=5

P1 P2

Cache

Main Memory

Bus


u=5

u=5

Read



3. P3 writes u, changing the value to 7 Cache is updated (u=7) and u is marked. Memory is not

changed!

P1 P2

Cache

Main Memory

Bus


u=5

u=7

Write



4. P1 reads value u again Since cache is common to all processes, there is no

problem though the main memory is not updated! All processes have the same view of the cache!

P1 P2

Cache

Main Memory

Bus


u=5

u=7

Read



5. P2 reads location u from main memory Since cache is common to all processes, there is no

problem though the main memory is not updated! All processes have the same view of the cache!

P1 P2

Cache

Main Memory

Bus


u=5

u=7

Read



If only a uniprocessor is involved there is no cache coherence problem!

However, if another device on the bus is involved that has direct memory access (like a DMA), the cache may not represent the contents of the memory and the cache coherence problem can occur!


Cache Coherence Problem

1. P1 reads location u (value 5) from main memory2. P3 reads location u from main memory3. P3 writes u, changing the value to 74. P1 reads value u again5. P2 reads location u from main memory

P1 P2

Cache

Main Memory

CacheBus

P3

Cache


Cache Coherence Problem(Write-Through Cache)

1. P1 reads location u (value 5) from main memory P1’s cache is updated (u=5)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=5

u=5

Read

Write-Through



2. P3 reads location u from main memory P3’s cache is updated (u=5)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=5

u=5 u=5

Read



3. P3 writes u, changing the value to 7 main memory is updated (u=7) P3’s cache is updated (no-write-allocate caches

update cache on a write hit)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=7

u=7 (Valid)

Write Through

u=5



4. P1 reads value u again P1 reads the value from the cache (u=5), which

is not the correct value!

P1 P2

Cache

Main Memory

CacheBus

P3

Cache u=7 (Valid)

u=7

u=5

Read



5. P2 reads location u from main memory P2 reads the value from the main memory (u=7)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache u=7 (Valid)

u=7

u=5

Read


Cache Coherence Problem(Write-Back Cache)

1. P1 reads location u (value 5) from main memory P1’s cache is updated (u=5)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=5

u=5



2. P3 reads location u from main memory P3’s cache is updated (u=5)

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=5

u=5 u=5

Read



3. P3 writes u, changing the value to 7 P3’s cache is updated (u=7) and location u is

marked Main memory is NOT updated!

P1 P2

Cache

Main Memory

CacheBus

P3

Cache

u=5

u=7

Write Back

u=5



4. P1 reads value u again P1 reads the value from the cache (u=5), which


P1 P2

Cache

Main Memory

CacheBus

P3

Cache u=7

u=5

u=5

Read



5. P2 reads location u from main memory P2 reads the value from the cache (u=5), which


P1 P2

Cache

Main Memory

CacheBus

P3

Cache u=7

u=5

u=5

Read


Cache Coherence

Since communication between processors is done by means of shared memory, cache coherence must be guaranteed.

The hardware should support cache coherence!

Everybody should have the same view of the memory system!