©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

41
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture

Transcript of ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

Page 1: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

1

E0-243: Computer Architecture

L2 – Parallel Architecture

Page 2: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

2

Overview

Parallel Architecture Cache coherence problem Memory consistency

Page 3: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

3

Trends

Ever increasing transistor density multiple processors (multiple core) on a single chip (CMP)

Beyond Instruction level parallelism thread-level parallelism

Speculative execution Speculative Multithreaded execution

Page 4: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

4

Recall:

Amdahl’s Law:For a program with x part sequential

execution, speedup is limited by 1/x . Speedup = (Exec. Time in Uniproc.)/

Exec. Time in N Procs.)

Efficiency = Speedup of N Procs. /N

Page 5: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

5

Space of Parallel Computing

Programming Models

What programmer uses in coding applns.

Specifies synch. And communication.

Programming Models: Shared address

space, e.g., OpenMP

Message passing, e.g., MPI

Parallel Architecture Shared Memory

Centralized shared memory (UMA)

Distributed Shared Memory (NUMA)

Distributed Memory A.k.a. Message

passing E.g., Clusters

Page 6: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

6

Shared Memory Architectures Shared, global, address space,

hence called Shared Address Space

Any processor can directly reference any memory locationCommunication occurs implicitly as

result of loads and stores Centralized: latencies to memory

uniform, but uniformly large Distributed: Non-Uniform Memory

Access (NUMA)

Page 7: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

7

M

Network

° ° °

Centralized Shared Memory

M M

$P

$P

$P

° ° °

Network

Distributed Shared Memory

M

$

P

M

$

P

° ° °

Shared Memory Architecture

Page 8: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

8

Distributed Memory Architecture

Network

M $

P ° ° °

M $

P

M $

P

Message Passing ArchitectureMemory is private to each node Processes communicate by

messages

Proc.Node

Proc.Node

Proc.Node

Page 9: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

9

Caches and Cache Coherence

Caches play key role in all casesReduce average data access timeReduce bandwidth demands placed on shared

interconnect Private processor caches create a

problemCopies of a variable can be present in multiple

caches A write by processor P may not be visible to

P’ ! P’ will keep accessing stale value from its

cache! Cache coherence problem

Page 10: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

10

Cache Coherence Problem: Example

Processors see different values for u after event 3 With write back caches, value written back to

memory depends on which cache flushes or writes back value.

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

Read Read WriteRead Read

Page 11: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

11

Cache Coherence Problem

Multiple processors with private caches Potential data consistency problem: the cache

coherence problem

Processes shouldn’t read `stale’ data Intuitively, Reading an address should return

the last value written to that address

Solutions Hardware: cache coherence mechanisms

Invalidation-based vs. Update-based Snoopy vs. directory

Software: compiler assisted cache coherence

Page 12: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

12

Example: Snoopy Bus Protocols Assumption: shared bus

interconnect where all cache controllers monitor all bus activity Called snooping

There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in cachesCorrective action could involve

updating or invalidating a cache block

Page 13: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

13

Snoopy Invalidate Protocol

I/O devices

Memory

P1

$ $ $

P2 P3

4

u = ?

u :51

u :5

2

u :5

3

u = 7

Page 14: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

14

Invalidate vs Update

Basic question of program behavior: Is a block written by one processor later

read by others before it is overwritten? Invalidate

readers will take a miss multiple writes without additional traffic clears out copies that are not used again

Update avoids misses on later references multiple useless updates

Page 15: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

15

MSI Invalidation Protocol

Cache Block States I: Invalid S: Shared (one or more cache copies) M: Modified or Dirty (only copy)

Encoded in 2 bits and updated by protocol Processor Events:

PrRd (read) PrWr (write)

Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for copy with intent to modify Flush: write back (updates main memory)

Page 16: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

16

MSI: State Transition Diagram

M

I

PrR

d/B

usR

dP

rWr/

Bu

sRd

X

PrRd/- PrWr/-

Bu

sRd

X/F

lush

Bu

sRd

/Flu

shB

usR

dX

/—

PrRd/—

BusRd/—

PrW

r/B

usR

dX

S

Page 17: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

17

MESI (4-state) Invalidation Protocol

Problem with MSI protocol Reading and modifying data is 2 bus xactions,

even if no one is sharing BusRd (I->S) followed by BusRdX or BusUpgr (S->M)

Add exclusive state: write locally without xaction, Memory is up to date, so cache not necessarily

owner States

invalid exclusive or exclusive-clean (only this cache has copy,

but not modified) shared (two or more caches may have copies) modified (dirty)

Page 18: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

18

MESI - State Transition Diagram

BusR

d/Flush

BusR

dX/F

lush

PrW

r/B

usR

dX

PrWr

PrRd/—

PrRd/—

BusR

d/Flush

¢

E

M

I

S

PrWr/--PrRd

PrR

d/

Bus

Rd(

S)

BusR

dX/F

lush

¢

BusR

dX/F

lush

BusRd/

Flush

PrW

r/B

usR

dXP

rRd/

Bus

Rd

(S )

Page 19: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

19

Scalability Issues of Snoopy Protocol

Snoopy cache ideally suited for bus-based IN.

Shared bus IN saturates performance for large no. of procs. (beyond 8 procs.)

For non-bus-based IN, coherence messages can be broadcast – expensive

Only a few procs. may have a copy of the shared data.

May be more efficient to maintain a directory of caches that have a copy of the cache block.

Page 20: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

20

Directory Based Coherence

Memory (or Cache) maintains a list (directory) of procs. that have the copy of a block

On write, memory controller sends Invalidate (or Update) signal only to procs. that have a copy

Memory also knows the current owner (in case of Dirty blocks) memory controller requests owner for updated copy

Page 21: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

21

Generic Solution: Directories

P1

Cache

Memory

Scalable Interconnection Network

Comm.Assist

P1

Cache

CommAssist

Directory MemoryDirectory

• •• Directory

presence bits dirty bit

Page 22: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

22

Memory Consistency Model

Memory consistency model

Order in which memory operations will appear to execute Þ What value can a read return?

Contract between appln. software and system.

Þ Affects ease-of-programming and performance

Page 23: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

23

Understanding Program Order: Example

Initially A = B = 0;Process P1 Process P2 Process P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;

What value of A will be printed by process P3?

Role of Program order in ensuring P3 reads the value of A as 1.

Page 24: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

24

Example 2

Software Implementation of Mutex:Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical section critical section Can both P1 and P2 enter the critical

section? i.e., evaluate the “if” condition as true?

Page 25: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

25

Sequential Consistency: Definition

A system is sequentially consistent if Operations within a processor follow

program order Operations of all processors were

executed in some (interleaved) sequential order

All processors see the same sequential order

Page 26: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

26

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

• Operations from different processors executed in some sequential (interleaved) order

• Memory operations of each process in program order

MEMORY

P1 P3P2 Pn

Page 27: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

27

Sequential Consistency: Definition

A system is sequentially consistent if Operations within a

processor follow program order

Operations of all processors were executed in some (interleaved) sequential order

All processors see the same sequential order

Initially A = B = 0;Process P1 Process P2 Process

P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;

Page 28: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

28

Under SC can P3 print A as 0?

Initially A = B = 0;

Process P1 Process P2 Process P3

(w1)A = 1; (r2) while (A==0); (r3) while (B==0);

(w2) B = 1; (r3’) Print A;w1

r3’w2

r2 r3

Page 29: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

29

Sequential Consistency

SC ensures all Memory orders: Write Read Write Write Read Read Read Write

SC treats all Memory operations same way!

Page 30: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

31

Processor Consistency: Definition

A system is Processor consistent if Writes issued by a

processor must be in program order Read read, read write,

and write write order But no write read order

Operations of all processors were executed in some (interleaved) sequential order

All processors need not see the same sequential order of writes from different processors

Initially A = B = 0;Process P1 Process P2 Process

P3

A = 1; while (A==0); while (B==0);

B = 1; Print A;

Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical section

Page 31: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

32

Example 2

Process P1 Process P2 A = 0; B = 0; ... ... A = 1; B = 1; if (B = 0) if (A=0) critical sectio n critical

section

Page 32: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

33

Weak Consistency

Distinguishes between ordinary memory operations and synchronization operations (e.g., lock acquire/release)

A system is weak consistent if Before a load/store is allowed to perform, all

previous synchronization accesses must be performed

Before a synchronization operation is performed, all previous load/store must be performed

Synchronization accesses are sequentially consistent.

Page 33: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

34

Weak Consistency

Weak ordering: Divide memory operations into data

operations and synchronization operations Synchronization operations act like a fence:

All data operations before synch in program order must complete before synch is executed

All data operations after synch in program order must wait for synch to complete

Synchs are performed in program order

Page 34: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

35

Weak Consistency

Weak ordering: Implementation of fence: processor has

counter that is incremented when data op is issued, and decremented when data op is completed

Example: PowerPC has SYNC instruction

Page 35: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

36

An Example:

Load

Load

Store

Load

Store

Store

Load

Load

Store

Load

Store

Store

Sequential Consistency Processor Consistency

Page 36: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

37

Example: Weak Consistency :

Sync(Acq)

Load/StoreLoad/Store

Sync(Rel)

Sync(Acq)

Sync(Rel)

Load/StoreLoad/Store

Load/StoreLoad/Store

No ordering among the loads/stores here!

No ordering among the loads/stores here!

Page 37: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

38

Another model: Release Consistency Synchronization accesses are divided into

Acquires: operations like lock Release: operations like unlock

Semantics of acquire: Acquire must complete before all following

memory accesses

Semantics of release: all memory operations before release must

complete but accesses after release in program order do not

have to wait for release operations which follow release and which need to

wait must be protected by an acquire

Page 38: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

39

Release Consistency

Further distinguishes between lock acquire and lock release synch. Operation.

A system is release consistent if Before a load/store is allowed to perform, all

previous acquire accesses must be performed Before a release synchronization operation is

performed, all previous load/store must be performed

Synchronization accesses are processor consistent.

Page 39: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

40

Example: Release Consistency

Sync(Acq)

Load/StoreLoad/Store

Sync(Rel)

Sync(Acq)

Sync(Rel)

Load/StoreLoad/Store

Load/StoreLoad/Store

Weak C

on

sis

ten

cy

Acquire

Load/StoreLoad/Store

Release

Acquire

Release

Load/StoreLoad/Store

Load/StoreLoad/Store

• Acquire treated as READ/LOAD

• Release treated as WRITE/STORE

Page 40: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

©RG:E0243:L2- Parallel Architecture

41

Ordering in Consistency Models

Model RRRW

WR WW SAR

SAW

R SA

W SA

SRR

SRW

R SR

W SR

SC

PC R SA SRW

WC

RC

WC : SA SA, SA SR, SR SR , SR SA

RC : SA SA, SA SR, SR SR

Page 41: ©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.

Reading Material

S. V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial” WRL Research Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbon, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors”’ ISCA 1991.