©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

1

E0-243: Computer Architecture

L2 – Parallel Architecture


2

Overview

Parallel Architecture Cache coherence problem Memory consistency


3

Trends

Ever increasing transistor density multiple processors (multiple core) on a single chip (CMP)

Beyond Instruction level parallelism thread-level parallelism

Speculative execution Speculative Multithreaded execution


4

Recall:

Amdahl’s Law:For a program with x part sequential

execution, speedup is limited by 1/x . Speedup = (Exec. Time in Uniproc.)/

Exec. Time in N Procs.)

Efficiency = Speedup of N Procs. /N


5

Space of Parallel Computing

Programming Models

What programmer uses in coding applns.

Specifies synch. And communication.

Programming Models: Shared address

space, e.g., OpenMP

Message passing, e.g., MPI

Parallel Architecture Shared Memory

Centralized shared memory (UMA)

Distributed Shared Memory (NUMA)

Distributed Memory A.k.a. Message

passing E.g., Clusters


6

Shared Memory Architectures Shared, global, address space,

hence called Shared Address Space

Any processor can directly reference any memory locationCommunication occurs implicitly as

result of loads and stores Centralized: latencies to memory

uniform, but uniformly large Distributed: Non-Uniform Memory

Access (NUMA)


7

M

Network

° ° °

Centralized Shared Memory

M M

$P

$P

$P

° ° °

Network

Distributed Shared Memory

M

$

P

M

$

P

° ° °

Shared Memory Architecture


8

Distributed Memory Architecture

Network

M $

P ° ° °

M $

P

M $

P

Message Passing ArchitectureMemory is private to each node Processes communicate by

messages

Proc.Node

Proc.Node

Proc.Node


9

Caches and Cache Coherence

Caches play key role in all casesReduce average data access timeReduce bandwidth demands placed on shared

interconnect Private processor caches create a

problemCopies of a variable can be present in multiple

caches A write by processor P may not be visible to

P’ ! P’ will keep accessing stale value from its

cache! Cache coherence problem


10

Cache Coherence Problem: Example

Processors see different values for u after event 3 With write back caches, value written back to

memory depends on which cache flushes or writes back value.

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

Read Read WriteRead Read


11

Cache Coherence Problem

Multiple processors with private caches Potential data consistency problem: the cache

coherence problem

Processes shouldn’t read `stale’ data Intuitively, Reading an address should return

the last value written to that address

Solutions Hardware: cache coherence mechanisms

Invalidation-based vs. Update-based Snoopy vs. directory

Software: compiler assisted cache coherence


12

Example: Snoopy Bus Protocols Assumption: shared bus

interconnect where all cache controllers monitor all bus activity Called snooping

There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in cachesCorrective action could involve

updating or invalidating a cache block


13

Snoopy Invalidate Protocol

I/O devices

Memory

P1

$ $ $

P2 P3

4

u = ?

u :51

u :5

2

u :5

3

u = 7


14

Invalidate vs Update

Basic question of program behavior: Is a block written by one processor later

read by others before it is overwritten? Invalidate

readers will take a miss multiple writes without additional traffic clears out copies that are not used again

Update avoids misses on later references multiple useless updates


15

MSI Invalidation Protocol

Cache Block States I: Invalid S: Shared (one or more cache copies) M: Modified or Dirty (only copy)

Encoded in 2 bits and updated by protocol Processor Events:

PrRd (read) PrWr (write)

Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for copy with intent to modify Flush: write back (updates main memory)


16

MSI: State Transition Diagram

M

I

PrR

d/B

usR

dP

rWr/

Bu

sRd

X

PrRd/- PrWr/-

Bu

sRd

X/F

lush

Bu

sRd

/Flu

shB

usR

dX

/—

PrRd/—

BusRd/—

PrW

r/B

usR

dX

S


17

MESI (4-state) Invalidation Protocol

Problem with MSI protocol Reading and modifying data is 2 bus xactions,

even if no one is sharing BusRd (I->S) followed by BusRdX or BusUpgr (S->M)

Add exclusive state: write locally without xaction, Memory is up to date, so cache not necessarily

owner States

invalid exclusive or exclusive-clean (only this cache has copy,

but not modified) shared (two or more caches may have copies) modified (dirty)


18

MESI - State Transition Diagram

BusR

d/Flush

BusR

dX/F

lush

PrW

r/B

usR

dX

PrWr

PrRd/—

PrRd/—

BusR

d/Flush

¢

E

M

I

S

PrWr/--PrRd

PrR

d/

Bus

Rd(

S)

BusR

dX/F

lush

¢

BusR

dX/F

lush

BusRd/

Flush

PrW

r/B

usR

dXP

rRd/

Bus

Rd

(S )


19

Scalability Issues of Snoopy Protocol

Snoopy cache ideally suited for bus-based IN.

Shared bus IN saturates performance for large no. of procs. (beyond 8 procs.)

For non-bus-based IN, coherence messages can be broadcast – expensive

Only a few procs. may have a copy of the shared data.

May be more efficient to maintain a directory of caches that have a copy of the cache block.


20

Directory Based Coherence

Memory (or Cache) maintains a list (directory) of procs. that have the copy of a block

On write, memory controller sends Invalidate (or Update) signal only to procs. that have a copy

Memory also knows the current owner (in case of Dirty blocks) memory controller requests owner for updated copy


21

Generic Solution: Directories

P1

Cache

Memory

Scalable Interconnection Network

Comm.Assist

P1

Cache

CommAssist

Directory MemoryDirectory

• •• Directory

presence bits dirty bit


22

Memory Consistency Model

Memory consistency model

Order in which memory operations will appear to execute Þ What value can a read return?

Contract between appln. software and system.

Þ Affects ease-of-programming and performance


23

Understanding Program Order: Example

Initially A = B = 0;Process P1 Process P2 Process P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;

What value of A will be printed by process P3?

Role of Program order in ensuring P3 reads the value of A as 1.


24

Example 2

Software Implementation of Mutex:Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical section critical section Can both P1 and P2 enter the critical

section? i.e., evaluate the “if” condition as true?


25

Sequential Consistency: Definition

A system is sequentially consistent if Operations within a processor follow

program order Operations of all processors were

executed in some (interleaved) sequential order

All processors see the same sequential order


26

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

• Operations from different processors executed in some sequential (interleaved) order

• Memory operations of each process in program order

MEMORY

P1 P3P2 Pn


27

Sequential Consistency: Definition

A system is sequentially consistent if Operations within a

processor follow program order

Operations of all processors were executed in some (interleaved) sequential order

All processors see the same sequential order

Initially A = B = 0;Process P1 Process P2 Process

P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;


28

Under SC can P3 print A as 0?

Initially A = B = 0;

Process P1 Process P2 Process P3

(w1)A = 1; (r2) while (A==0); (r3) while (B==0);

(w2) B = 1; (r3’) Print A;w1

r3’w2

r2 r3


29

Sequential Consistency

SC ensures all Memory orders: Write Read Write Write Read Read Read Write

SC treats all Memory operations same way!


31

Processor Consistency: Definition

A system is Processor consistent if Writes issued by a

processor must be in program order Read read, read write,

and write write order But no write read order

Operations of all processors were executed in some (interleaved) sequential order

All processors need not see the same sequential order of writes from different processors

Initially A = B = 0;Process P1 Process P2 Process

P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;

Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical section


32

Example 2

Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical

section


33

Weak Consistency

Distinguishes between ordinary memory operations and synchronization operations (e.g., lock acquire/release)

A system is weak consistent if Before a load/store is allowed to perform, all

previous synchronization accesses must be performed

Before a synchronization operation is performed, all previous load/store must be performed

Synchronization accesses are sequentially consistent.


34

Weak Consistency

Weak ordering: Divide memory operations into data

operations and synchronization operations Synchronization operations act like a fence:

All data operations before synch in program order must complete before synch is executed

All data operations after synch in program order must wait for synch to complete

Synchs are performed in program order


35

Weak Consistency

Weak ordering: Implementation of fence: processor has

counter that is incremented when data op is issued, and decremented when data op is completed

Example: PowerPC has SYNC instruction


36

An Example:

Load

Load

Store

Load

Store

Store

Load

Load

Store

Load

Store

Store

Sequential Consistency Processor Consistency


37

Example: Weak Consistency :

Sync(Acq)

Load/StoreLoad/Store

Sync(Rel)

Sync(Acq)

Sync(Rel)



No ordering among the loads/stores here!

No ordering among the loads/stores here!


38

Another model: Release Consistency Synchronization accesses are divided into

Acquires: operations like lock Release: operations like unlock

Semantics of acquire: Acquire must complete before all following

memory accesses

Semantics of release: all memory operations before release must

complete but accesses after release in program order do not

have to wait for release operations which follow release and which need to

wait must be protected by an acquire


39

Release Consistency

Further distinguishes between lock acquire and lock release synch. Operation.

A system is release consistent if Before a load/store is allowed to perform, all

previous acquire accesses must be performed Before a release synchronization operation is

performed, all previous load/store must be performed

Synchronization accesses are processor consistent.


40

Example: Release Consistency

Sync(Acq)


Sync(Rel)

Sync(Acq)

Sync(Rel)



Weak C

on

sis

ten

cy

Acquire


Release

Acquire

Release



• Acquire treated as READ/LOAD

• Release treated as WRITE/STORE


41

Ordering in Consistency Models

Model RRRW

WR WW SAR

SAW

R SA

W SA

SRR

SRW

R SR

W SR

SC

PC R SA SRW

WC

RC

WC : SA SA, SA SR, SR SR , SR SA

RC : SA SA, SA SR, SR SR

Reading Material

S. V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial” WRL Research Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbon, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors”’ ISCA 1991.

©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

Documents

Transcript of ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.