MCU architecture

7/30/2019 MCU architecture

1/232

ProgramOpt im izat ion f o r

Mu l t i - co re :Har dw ar e si de o f i t


2/232

June-July 2009 2

Conten ts Virtual Memory and Caches (Recap)

Fundamentals of Parallel Computers: ILP vs. TLP

Parallel Programming: Shared Memory andMessage Passing

Performance Issues in Shared Memory Shared Memory Multiprocessors: Consistency and

Coherence

Synchronization Memory consistency models

Case Studies of CMP


3/232

RECAP:VI RTUAL MEMORY

ANDCACHE


4/232

June-July 2009 4

W h y v i r t u a l m em or y ? With a 32-bit address you can access 4 GB of

physical memory (you will never get the fullmemory though) Seems enough for most day-to-day applications

But there are important applications that have much

bigger memory footprint: databases, scientific appsoperating on large matrices etc.

Even if your application fits entirely in physical memory itseems unfair to load the full image at startup

Just takes away memory from other processes, butprobably doesnt need the full image at any point of timeduring execution: hurts multiprogramming

Need to provide an illusion of bigger memory:Virtual Memory (VM)


5/232

June-July 2009 5

Vi r t u a l m em or y Need an address to access virtual memory

Virtual Address (VA)

Assume a 32-bit VA Every process sees a 4 GB of virtual memory

This is much better than a 4 GB physical memory sharedbetween multiprogrammed processes

The size of VA is really fixed by the processor data pathwidth

64-bit processors (Alpha 21264, 21364; SunUltraSPARC; AMD Athlon64, Opteron; IBM POWER4,

POWER5; MIPS R10000 onwards; Intel Itanium etc., andrecently Intel Pentium4) provide bigger virtual memory toeach process

Large virtual and physical memory is very important incommercial server market: need to run large databases


6/232

June-July 2009 6

Add r essin g VM There are primarily three ways to address VM

Paging, Segmentation, Segmented paging

We will focus on flat paging only

Paged VM

The entire VM is divided into small units called pages Virtual pages are loaded into physical page frames as

and when needed (demand paging)

Thus the physical memory is also divided into equal

sized page frames The processor generates virtual addresses

But memory is physically addressed: need a VA to PA

translation


7/232

June-July 2009 7

VA t o PA t r anslat ion The VA generated by the processor is divided into

two parts: Page offset and Virtual page number (VPN)

Assume a 4 KB page: within a 32-bit VA, lower 12 bits

will be page offset (offset within a page) and the

remaining 20 bits are VPN (hence 1 M virtual pages total)

The page offset remains unchanged in the translation

Need to translate VPN to a physical page frame number

(PPFN)

This translation is held in a page table resident in

memory: so first we need to access this page table

How to get the address of the page table?


8/232

June-July 2009 8

VA t o PA t r anslat ion Accessing the page table

The Page table base register (PTBR) contains thestarting physical address of the page table

PTBR is normally accessible in the kernel mode only

Assume each entry in page table is 32 bits (4 bytes)

Thus the required page table address is

PTBR + (VPN


9/232

June-July 2009 9

Page f au l t The valid bit within the 32 bits tells you if the

translation is valid If this bit is reset that means the page is not

resident in memory: results in a page fault

In case of a page fault the kernel needs to bring inthe page to memory from disk

The disk address is normally provided by the pagetable entry (different interpretation of 31 bits)

Also kernel needs to allocate a new physical pageframe for this virtual page

If all frames are occupied it invokes a pagereplacement policy


10/232


11/232

June-July 2009 11

TLB Why cant we cache the most recently used

translations? Translation Look-aside Buffers (TLB)

Small set of registers (normally fully associative)

Each entry has two parts: the tag which is simply VPN

and the corresponding PTE

The tag may also contain a process id

On a TLB hit you just get the translation in one cycle

(may take slightly longer depending on the design)

On a TLB miss you may need to access memory to load

the PTE in TLB (more later)

Normally there are two TLBs: instruction and data


12/232

June-July 2009 12

Caches Once you have completed the VA to PA translation

you have the physical address. Whats next?

You need to access memory with that PA

Instruction and data caches hold most recently

used (temporally close) and nearby (spatially close)data

Use the PA to access the cache first

Caches are organized as arrays of cache lines Each cache line holds several contiguous bytes

(32, 64 or 128 bytes)


13/232

June-July 2009 13

Add r essin g a cach e The PA is divided into several parts

The block offset determines the starting byte

address within a cache line

The index tells you which cache line to access

In that cache line you compare the tag to determine

hit/miss

TAG INDEX BLK. OFFSET


14/232

June-July 2009 14

Add r essin g a cach eTAG INDEX BLK. OFFSET

TAG DATA

STATE

PA

HIT/

MISS

DATA

ACCESS SIZE

(HOW MANY BYTES?)


15/232


16/232

June-July 2009 16

Set assoc ia t iv e cach e The example assumes one cache line per index

Called a direct-mapped cache

A different access to a line evicts the resident cache line

This is eithera capacity or a conflict miss

Conflict misses can be reduced by providingmultiple lines per index

Access to an index returns a set of cache lines

For an n-way set associative cache there are n lines per

set

Carry out multiple tag comparisons in parallel tosee if any one in the set hits


17/232

June-July 2009 17

2 -w ay set associat i veTAG INDEX BLK. OFFSET

TAG DATA

STATE

PA

TAG DATA

STATE

TAG0 TAG1


18/232

June-July 2009 18

Set assoc ia t iv e cach e When you need to evict a line in a particular set you

run a replacement policy LRU is a good choice: keeps the most recently used lines

(favors temporal locality)

Thus you reduce the number of conflict misses

Two extremes of set size: direct-mapped (1-way)and fully associative (all lines are in a single set) Example: 32 KB cache, 2-way set associative, line size

of 64 bytes: number of indices or number ofsets=32*1024/(2*64)=256 and hence index is 8 bits wide

Example: Same size and line size, but fully associative:number of sets is 1, within the set there are 32*1024/64or 512 lines; you need 512 tag comparisons for eachaccess


19/232

June-July 2009 19

Cach e h ier ar ch y Ideally want to hold everything in a fast cache

Never want to go to the memory

But, with increasing size the access time increases

A large cache will slow down every access

So, put increasingly bigger and slower cachesbetween the processor and the memory

Keep the most recently used data in the nearest

cache: register file (RF) Next level of cache: level 1 or L1 (same speed or

slightly slower than RF, but much bigger)

Then L2: way bigger than L1 and much slower


20/232

June-July 2009 20

Cach e h ier ar ch y Example: Intel Pentium 4 (Netburst)

128 registers accessible in 2 cycles

L1 date cache: 8 KB, 4-way set associative, 64 bytes linesize, accessible in 2 cycles for integer loads

L2 cache: 256 KB, 8-way set associative, 128 bytes line

size, accessible in 7 cycles Example: Intel Itanium 2 (code name Madison)

128 registers accessible in 1 cycle

L1 instruction and data caches: each 16 KB, 4-way set

associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128

bytes line size, accessible in 5 cycles

Unified L3 cache: 6 MB, 24-way set associative, 128bytes line size, accessible in 14 cycles


21/232

June-July 2009 21

St a t es o f a cach e l in e The life of a cache line starts off in invalid state (I)

An access to that line takes a cache miss andfetches the line from main memory

If it was a read miss the line is filled in shared state(S) [we will discuss it later; for now just assume thatthis is equivalent to a valid state]

In case of a store miss the line is filled in modifiedstate (M); instruction cache lines do not normallyenter the M state (no store to Icache)

The eviction of a line in M state must write the lineback to the memory (this is called a writebackcache); otherwise the effect of the store would belost


22/232


23/232

June-July 2009 23

Th e f i r st i nst r u ct i on Accessing the first instruction

Take the starting PC

Access iTLB with the VPN extracted from PC: iTLB miss

Invoke iTLB miss handler

Calculate PTE address

If PTEs are cached in L1 data and L2 caches, look them

up with PTE address: you will miss there also

Access page table in main memory: PTE is invalid: page

fault

Invoke page fault handler

Allocate page frame, read page from disk, update PTE,

load PTE in iTLB, restart fetch


24/232

June-July 2009 24

Th e f i r st i nst r u ct i on Now you have the physical address

Access Icache: miss

Send refill request to higher levels: you miss everywhere

Send request to memory controller (north bridge)

Access main memory Read cache line

Refill all levels of cache as the cache line returns to the

processor

Extract the appropriate instruction from the cache linewith the block offset

This is the longest possible latency in aninstruction/data access


25/232

June-July 2009 25

TLB access For every cache access (instruction or data) you

need to access the TLB first

Puts the TLB in the critical path

Want to start indexing into cache and read the tags

while TLB lookup takes place Virtually indexed physically tagged cache

Extract index from the VA, start reading tag while looking

up TLB

Once the PA is available do tag comparison

Overlaps TLB reading and tag reading


26/232

June-July 2009 26

Mem or y op l at en cy L1 hit: ~1 ns

L2 hit: ~5 ns L3 hit: ~10-15 ns

Main memory: ~70 ns DRAM access time + bustransfer etc. = ~110-120 ns

If a load misses in all caches it will eventually cometo the head of the ROB and block instructionretirement (in-order retirement is a must)

Gradually, the pipeline backs up, processor runsout of resources such as ROB entries and physicalregisters

Ultimately, the fetcher stalls: severely limits ILP

MLP


27/232

June-July 2009 27

MLP Need memory-level parallelism (MLP)

Simply speaking, need to mutually overlap severalmemory operations

Step 1: Non-blocking cache Allow multiple outstanding cache misses

Mutually overlap multiple cache misses

Supported by all microprocessors today (Alpha 21364supported 16 outstanding cache misses)

Step 2: Out-of-order load issue Issue loads out of program order (address is not known

at the time of issue)

How do you know the load didnt issue before a store tothe same address? Issuing stores must check for thismemory-order violation

O


28/232

June-July 2009 28

Ou t -o f - o r der l oadssw 0(r7), r6

/* other instructions */

lw r2, 80(r20)

Assume that the load issues before the store

because r20 gets ready before r6 or r7 The load accesses the store buffer (used forholding already executed store values before theyare committed to the cache at retirement)

If it misses in the store buffer it looks up the cachesand, say, gets the value somewhere

After several cycles the store issues and it turns outthat 0(r7)==80(r20) or they overlap; now what?

L d / d i


29/232

June-July 2009 29

Load / st o r e o r d er i n g Out-of-order load issue relies on speculative

memory disambiguation Assumes that there will be no conflicting store

If the speculation is correct, you have issued the loadmuch earlier and you have allowed the dependents to

also execute much earlier If there is a conflicting store, you have to squash the load

and all the dependents that have consumed the loadvalue and re-execute them systematically

Turns out that the speculation is correct most of the time

To further minimize the load squash, microprocessorsuse simple memory dependence predictors (predicts if aload is going to conflict with a pending store based onthat loads or load/store pairs past behavior)

MLP d l l


30/232

June-July 2009 30

MLP an d m em o r y w al l Today microprocessors try to hide cache misses by

initiating early prefetches: Hardware prefetchers try to predict next several loadaddresses and initiate cache line prefetch if they are notalready in the cache

All processors today also support prefetch instructions;

so you can specify in your program when to prefetchwhat: this gives much better control compared to ahardware prefetcher

Researchers are working on load value prediction

Even after doing all these, memory latency remainsthe biggest bottleneck

Today microprocessors are trying to overcome onesingle wall: the memory wall


31/232

Fu n dam en t a l s o fPara l le l

C om pu t e r s

A d


32/232

June-July 2009 32

Agenda

Convergence of parallel architectures Fundamental design issues

ILP vs. TLP

Commun ica t i on


33/232

June-July 2009 33

Commun ica t i ona rch i t ec tu re

Historically, parallel architectures are tied toprogramming models Diverse designs made it impossible to write portable

parallel software

But the driving force was the same: need for fastprocessing

Today parallel architecture is seen as an extensionof microprocessor architecture with a

communication architecture Defines the basic communication and synchronization

operations and provides hw/sw implementation of those

L d h i t t


34/232

June-July 2009 34

Lay er ed ar ch i t ect u r e A parallel architecture can be divided into several

layers Parallel applications

Programming models: shared address, messagepassing, multiprogramming, data parallel, dataflow etc

Compiler + libraries

Operating systems support

Communication hardware

Physical communication medium

Communication architecture = user/systeminterface + hw implementation (roughly defined bythe last four layers) Compiler and OS provide the user interface to

communicate between and synchronize threads

Sh d dd


35/232

June-July 2009 35

Sh ar ed add r ess Communication takes place through a logically

shared portion of memory

User interface is normal load/store instructions

Load/store instructions generate virtual addresses

The VAs are translated to PAs by TLB or page table

The memory controller then decides where to find this PA Actual communication is hidden from the programmer

The general communication hw consists of multipleprocessors connected over some medium so that

they can talk to memory banks and I/O devices The architecture of the interconnect may vary depending

on projected cost and target performance

Sh ar ed add r ess


36/232

June-July 2009 36

Sh ar ed add r ess Communication medium

Interconnect could be a crossbar switch so that anyprocessor can talk to any memory bank in one hop(provides latency and bandwidth advantages)

Scaling a crossbar becomes a problem: cost isproportional to square of the size

Instead, could use a scalable switch-based network;latency increases and bandwidth decreases becausenow multiple processors contend for switch ports

INTERCONNECT

MEM MEM MEM I/O I/O

P PDANCE HALL

Sh d dd


37/232

June-July 2009 37

Sh ar ed add r ess Communication medium

From mid 80s shared bus became popular leading to the

design of SMPs Pentium Pro Quad was the first commodity SMP

Sun Enterprise server provided a highly pipelined wideshared bus for scalability reasons; it also distributed the

memory to each processor, but there was no local bus onthe boards i.e. the memory was still symmetric (mustuse the shared bus)

NUMA or DSM architectures provide a better solution tothe scalability problem; the symmetric view is replaced by

local and remote memory and each node (containingprocessor(s) with caches, memory controller and router)gets connected via a scalable network (mesh, ring etc.);Examples include Cray/SGI T3E, SGI Origin 2000, AlphaGS320, Alpha/HP GS1280 etc.

M i


38/232

June-July 2009 38

Message p assin g Very popular for large-scale computing

The system architecture looks exactly same asDSM, but there is no shared memory

The user interface is via send/receive calls to themessage layer

The message layer is integrated to the I/O systeminstead of the memory system

Send specifies a local data buffer that needs to betransmitted; send also specifies a tag

A matching receive at dest. node with the same tagreads in the data from kernel space buffer to usermemory

Effectively, provides a memory-to-memory copy

M i


39/232

June-July 2009 39

Message p assin g Actual implementation of message layer

Initially it was very topology dependent

A node could talk only to its neighbors through FIFObuffers

These buffers were small in size and therefore whilesending a message send would occasionally block

waiting for the receive to start reading the buffer(synchronous message passing)

Soon the FIFO buffers got replaced by DMA (directmemory access) transfers so that a send can initiate atransfer from memory to I/O buffers and finishimmediately (DMA happens in background); sameapplies to the receiving end also

The parallel algorithms were designed specifically forcertain topologies: a big problem

Message p assin g


40/232

June-July 2009 40

Message p assin g To improve usability of machines, the message

layer started providing support for arbitrary source

and destination (not just nearest neighbors) Essentially involved storing a message in intermediate

hops and forwarding it to the next node on the route

Later this store-and-forward routing got moved tohardware where a switch could handle all the routingactivities

Further improved to do pipelined wormhole routing sothat the time taken to traverse the intermediate hopsbecame small compared to the time it takes to push the

message from processor to network (limited by node-to-network bandwidth)

Examples include IBM SP2, Intel Paragon

Each node of Paragon had two i860 processors, one ofwhich was dedicated to servicing the network (send/recv.etc.)

Convergence


41/232

June-July 2009 41

Convergence Shared address and message passing are two

distinct programming models, but the architectures

look very similar Both have a communication assist or network interface to

initiate messages or transactions

In shared memory this assist is integrated with thememory controller

In message passing this assist normally used to beintegrated with the I/O, but the trend is changing

There are message passing machines where the assistsits on the memory bus or machines where DMA over

network is supported (direct transfer from source memoryto destination memory)

Finally, it is possible to emulate send/recv. on sharedmemory through shared buffers, flags and locks

Possible to emulate a shared virtual mem. on messagepassing machines through modified page fault handlers

A gener i c ar ch i t ect u r e


42/232

June-July 200942

A gener i c ar ch i t ect u r e In all the architectures we have discussed thus far

a node essentially contains processor(s) + caches,

memory and a communication assist (CA)

CA = network interface (NI) + communication controller

The nodes are connected over a scalable network

The main difference remains in the architecture ofthe CA

And even under a particular programming model (e.g.,

shared memory) there is a lot of choices in the design of

the CA

Most innovations in parallel architecture takes place in

the communication assist (also called communication

controller or node controller)



43/232

June-July 200943


SCALABLE NETWORK

NODE NODE NODE NODE

CACHE

P

MEM CAXBAR

Desig n issu es


44/232

June-July 200944

Desig n issu es Need to understand architectural components that

affect software Compiler, library, program

User/system interface and hw/sw interface

How programming models efficiently talk to the

communication architecture? How to implement efficient primitives in the

communication layer?

In a nutshell, what issues of a parallel machine will affect

the performance of the parallel applications?

Naming, Operations, Ordering, Replication,Communication cost

N a m i n g


45/232

June-July 2009 45

N a m i n g How are the data in a program referenced?

In sequential programs a thread can access any variablein its virtual address space

In shared memory programs a thread can access any

private or shared variable (same load/store model of

sequential programs) In message passing programs a thread can access local

data directly

Clearly, naming requires some support from hw

and OS Need to make sure that the accessed virtual address

gets translated to the correct physical address

Opera t ions


46/232

June-July 2009 46

Opera t ions What operations are supported to access data?

For sequential and shared memory models load/store aresufficient

For message passing models send/receive are needed

to access remote data

For shared memory, hw (essentially the CA) needs tomake sure that a load/store operation gets correctly

translated to a message if the address is remote

For message passing, CA or the message layer needs to

copy data from local memory and initiate send, or copydata from receive buffer to local memory

Order ing


47/232

June-July 2009 47

Order ing How are the accesses to the same data ordered?

For sequential model, it is the program order: true

dependence order For shared memory, within a thread it is the program

order, across threads some valid interleaving ofaccesses as expected by the programmer and enforced

by synchronization operations (locks, point-to-pointsynchronization through flags, global synchronizationthrough barriers)

Ordering issues are very subtle and important in sharedmemory model (some microprocessor re-ordering tricks

may easily violate correctness when used in sharedmemory context)

For message passing, ordering across threads is impliedthrough point-to-point send/receive pairs (producer-consumer relationship) and mutual exclusion is inherent

(no shared variable)

Repl ica t ion


48/232

June-July 2009 48

Repl ica t ion How is the shared data locally replicated?

This is very important for reducing communication traffic In microprocessors data is replicated in the cache to

reduce memory accesses

In message passing, replication is explicit in the program

and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache

hierarchy so that subsequent accesses can be fast; this

is totally hidden from the program and therefore the

hardware must provide a layer that keeps track of themost recent copies of the data (this layer is central to the

performance of shared memory multiprocessors and is

called the cache coherence protocol)

Com m u n icat i on cost


49/232

June-July 2009 49

Com m u n icat i on cost Three major components of the communication

architecture that affect performance Latency: time to do an operation (e.g., load/store orsend/recv.)

Bandwidth: rate of performing an operation

Overhead or occupancy: how long is the communicationlayer occupied doing an operation

Latency Already a big problem for microprocessors

Even bigger problem for multiprocessors due to remoteoperations

Must optimize application or hardware to hide or lowerlatency (algorithmic optimizations or prefetching oroverlapping computation with communication)



50/232

June-July 2009 50

Com m u n icat i on cost Bandwidth

How many ops in unit time e.g. how many bytes

transferred per second Local BW is provided by heavily banked memory or

faster and wider system bus

Communication BW has two components: 1. node-to-network BW (also called network link BW) measures howfast bytes can be pushed into the router from the CA,2. within-network bandwidth: affected by scalability of thenetwork and architecture of the switch or router

Linear cost model: Transfer time = T0 + n/B where

T0 is start-up overhead, n is number of bytestransferred and B is BW Not sufficient since overlap of comp. and comm. is not

considered; also does not count how the transfer is done

(pipelined or not)



51/232

June-July 2009 51

Com m u n icat i on cost Better model:

Communication time for n bytes = Overhead + CA

occupancy + Network latency + Size/BW + Contention T(n) = Ov + Oc + L + n/B + Tc Overhead and occupancy may be functions of n

Contention depends on the queuing delay at various

components along the communication path e.g. waitingtime at the communication assist or controller, waitingtime at the router etc.

Overall communication cost = frequency ofcommunication x (communication time overlap with

useful computation)

Frequency of communication depends on various factorssuch as how the program is written or the granularity ofcommunication supported by the underlying hardware

I LP v s TLP


52/232

June-July 2009 52

I LP v s. TLP Microprocessors enhance performance of a

sequential program by extracting parallelism from

an instruction stream (called instruction-levelparallelism)

Multiprocessors enhance performance of anexplicitly parallel program by running multiple

threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger

granularity compared to ILP

In multiprocessors ILP and TLP work together Within a thread ILP provides performance boost

Across threads TLP provides speedup over a sequentialversion of the parallel program


53/232

Para l le lP r o g r a m m i n g


54/232

Agenda


55/232

June-July 2009 55

Agenda Steps in writing a parallel program

Example

W r i t i ng a par a l l el


56/232

June-July 2009 56

W r i t i ng a par a l l elp r o g r a m

Start from a sequential description Identify work that can be done in parallel

Partition work and/or data among threads orprocesses

Decomposition and assignment

Add necessary communication and synchronization Orchestration

Map threads to processors (Mapping) How good is the parallel program? Measure speedup = sequential execution time/parallel

execution time = number of processors ideally

Som e def i n i t i on s


57/232

June-July 2009 57

Som e def i n i t i on s Task

Arbitrary piece of sequential work Concurrency is only across tasks

Fine-grained task vs. coarse-grained task: controls

granularity of parallelism (spectrum of grain: one

instruction to the whole sequential program) Process/thread

Logical entity that performs a task

Communication and synchronization happen between

threads

Processors

Physical entity on which one or more processes execute

Decompos i t i on


58/232

June-July 2009 58

Decompos i t i on Find concurrent tasks and divide the program into

tasks Level or grain of concurrency needs to be decided here

Too many tasks: may lead to too much of overhead

communicating and synchronizing between tasks

Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy

Number of tasks may vary dynamically

New tasks may get created as the computation

proceeds: new rays in ray tracing

Number of available tasks at any point in time is an upper

bound on the achievable speedup

St a t i c assign m en t


59/232

June-July 2009 59

St a t i c assign m en t Given a decomposition it is possible to assign

tasks statically For example, some computation on an array of size N

can be decomposed statically by assigning a range ofindices to each process: for k processes P0 operates onindices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,,

Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and

low-overhead

What if the nature computation depends on the

index? For certain index ranges you do some heavy-weight

computation while for others you do something simple

Is there a problem?


60/232


61/232

Decom posi t i on t yp es


62/232

June-July 2009 62

Decom posi t i on t yp es Decomposition by data

The most commonly found decomposition technique The data set is partitioned into several subsets and each

subset is assigned to a process

The type of computation may or may not be identical on

each subset Very easy to program and manage

Computational decomposition

Not so popular: tricky to program and manage

All processes operate on the same data, but probablycarry out different kinds of computation

More common in systolic arrays, pipelined graphics

processor units (GPUs) etc.

Orches t ra t i on


63/232

June-July 2009 63

Orches t ra t i on Involves structuring communication and

synchronization among processes, organizing data

structures to improve locality, and scheduling tasks

This step normally depends on the programming model

and the underlying architecture

Goal is to Reduce communication and synchronization costs

Maximize locality of data reference

Schedule tasks to maximize concurrency: do not

schedule dependent tasks in parallel

Reduce overhead of parallelization and concurrency

management (e.g., management of the task queue,

overhead of initiating a task etc.)

Mapp ing


64/232

June-July 2009 64

Mapp ing At this point you have a parallel program

Just need to decide which and how many processes go

to each processor of the parallel machine Could be specified by the program

Pin particular processes to a particular processor for thewhole life of the program; the processes cannot migrate

to other processors Could be controlled entirely by the OS

Schedule processes on idle processors

Various scheduling algorithms are possible e.g., round

robin: process#k goes to processor#k NUMA-aware OS normally takes into account

multiprocessor-specific metrics in scheduling

How many processes per processor? Most common

is one-to-one

An ex am p le


65/232

June-July 2009 65

An ex am p le Iterative equation solver

Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations

A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j]

Pad the n by n grid to (n+2) by (n+2) to avoid corner

problems

Update only interior n by n grid

One iteration consists of updating all n2 points in-place

and accumulating the difference from the previous value

at each point

If the difference is less than a threshold, the solver is said

to have converged to a stable grid equilibrium

Sequ en t ia l p r og r am


66/232

June-July 2009 66

Sequ en t ia l p r og r amint n;

float **A, diff;

begin main()

read (n); /* size of grid */

Allocate (A);

Initialize (A);

Solve (A);

end main

begin Solve (A)

int i, j, done = 0;

float temp;

while (!done)

diff = 0.0;

for i = 0 to n-1

for j = 0 to n-1

temp = A[i,j];

A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+

A[i-1,j]+A[i+1,j];

diff += fabs (A[i,j] - temp);

endfor

endfor

if (diff/(n*n) < TOL) then done = 1;

endwhile

end Solve

Decompos i t i on


67/232

June-July 2009 67

Decompos i t i on Look for concurrency in loop iterations

In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)

Each anti-diagonal can be computed in parallel

Must synchronize after each anti-diagonal (or pt-to-pt)

Alternative: red-black ordering (different update pattern)


68/232

Decompos i t i on


69/232

June-July 2009 69

Decompos i t i onwhile (!done)

diff = 0.0;

for_all i = 0 to n-1

for_all j = 0 to n-1

temp = A[i, j];

A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j];

diff += fabs (A[i, j] temp);

end for_allend for_all

if (diff/(n*n) < TOL) then done = 1;

end while

Offers concurrency across elements: degree ofconcurrency is n2

Make the j loop sequential to have row-wisedecomposition: degree n concurrency

Ass ignmen t


70/232

June-July 2009 70

Ass ignmen t Possible static assignment: block row

decomposition

Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/pto (2n/p)-1 etc.

Another static assignment: cyclic rowdecomposition

Process 0 gets rows 0, p, 2p,; process 1 gets rows 1,p+1, 2p+1,.

Dynamic assignment Grab next available row, work on that, grab a new row,

Static block row assignment minimizes nearestneighbor communication by assigning contiguousrows to the same process


71/232

ar e m em or y


72/232

June-July 2009 72

ar e m em or yvers ion

void Solve (void)

{

int i, j, pid, done = 0;

float temp, local_diff;

GET_PID (pid);

while (!done) {

local_diff = 0.0;

if (!pid) gm->diff = 0.0;

BARRIER (gm->barrier, P);/*why?*/

for (i = pid*(n/P); i < (pid+1)*(n/P);

i++) {

for (j = 0; j < n; j++) {

temp = gm->A[i] [j];

gm->A[i] [j] = 0.2*(gm->A[i] [j] +

gm->A[i] [j-1] + gm->A[i] [j+1] + gm-

>A[i+1] [j] + gm->A[i-1] [j];

local_diff += fabs (gm->A[i] [j]

temp);} /* end for */

} /* end for */

LOCK (gm->diff_lock);

gm->diff += local_diff;

UNLOCK (gm->diff_lock);

BARRIER (gm->barrier, P);

if (gm->diff/(n*n) < TOL) done = 1;

BARRIER (gm->barrier, P); /* why? */

} /* end while */

}

Mut u al ex clus ion


73/232

June-July 2009 73

Mut u al ex clus ion Use LOCK/UNLOCK around critical sections

Updates to shared variable diff must be sequential Heavily contended locks may degrade performance

Try to minimize the use of critical sections: they are

sequential anyway and will limit speedup

This is the reason for using a local_diff instead ofaccessing gm->diff every time

Also, minimize the size of critical section because the

longer you hold the lock, longer will be the waiting time

for other processors at lock acquire

LOCK op t im iza t ion


74/232

June-July 2009 74

LOCK op t im iza t ion Suppose each processor updates a shared variable

holding a global cost value, only if its local cost isless than the global cost: found frequently in

minimization problemsLOCK (gm->cost_lock); if (my_cost < gm->cost) {

if (my_cost < gm->cost) { LOCK (gm->cost_lock);gm->cost = my_cost; if (my_cost < gm->cost) { /* make

sure*/

} gm->cost = my_cost;

UNLOCK (gm->cost_lock); }

/* May lead to heavy lock UNLOCK (gm->cost_lock);

contention if everyone } /* this works because gm->cost is

tries to update at the monotonically decreasing */

same time */

Mor e sy n ch r on i zat i on


75/232

June-July 2009 75

Mor e sy n ch r on i zat i on Global synchronization

Through barriers

Often used to separate computation phases

Point-to-point synchronization

A process directly notifies another about a certain event

on which the latter was waiting Producer-consumer communication pattern

Semaphores are used for concurrent programming on

uniprocessor through P and V functions

Normally implemented through flags on shared memorymultiprocessors (busy wait or spin)

P0: A = 1; flag = 1;

P1: while (!flag); use (A);

Message p assin g


76/232

June-July 2009 76

Message p assin g What is different from shared memory?

No shared variable: expose communication through

send/receive No lock or barrier primitive

Must implement synchronization through send/receive

Grid solver example

P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e.

P1 waits to receive rows n/P to 2n/P-1 etc. (this is one-time)

Within the while loop the first thing that every processordoes is to send its first and last rows to the upper and thelower processors (corner cases need to be handled)

Then each processor waits to receive the neighboringtwo rows from the upper and the lower processors

Message p assin g


77/232

June-July 2009 77

Message p assin g

At the end of the loop each processor sends itslocal_diff to P0 and P0 sends back the accumulated

diff so that each processor can locally compute the

done flag

Maj o r ch an ges


78/232

June-July 2009 78

Maj o r ch an ges

/* include files */

MAIN_ENV;

int P, n;

void Solve ();

struct gm_t {

LOCKDEC (diff_lock);

BARDEC (barrier);float **A, diff;

} *gm;

int main (char **argv, int argc)

{ int i; int P, n; float **A;

MAIN_INITENV;

gm = (struct gm_t*) G_MALLOC(sizeof (struct gm_t));

LOCKINIT (gm->diff_lock);

BARINIT (gm->barrier);

n = atoi (argv[1]);P = atoi (argv[2]);

gm->A = (float**) G_MALLOC((n+2)*sizeof (float*));

for (i = 0; i < n+2; i++) {

gm->A[i] = (float*) G_MALLOC((n+2)*sizeof (float));

}

Initialize (gm->A);

for (i = 1; i < P; i++) { /* starts at 1 */

CREATE (Solve);

}Solve ();

WAIT_FOR_END (P-1);

MAIN_END;

}

Local

Alloc.

Maj o r ch an ges


79/232

June-July 2009 79

Maj o r ch an ges

void Solve (void)

{

int i, j, pid, done = 0;

float temp, local_diff;

GET_PID (pid);

while (!done) {

local_diff = 0.0;

if (!pid) gm->diff = 0.0;

BARRIER (gm->barrier, P);/*why?*/

for (i = pid*(n/P); i < (pid+1)*(n/P);

i++) {

for (j = 0; j < n; j++) {

temp = gm->A[i] [j];

gm->A[i] [j] = 0.2*(gm->A[i] [j] +

gm->A[i] [j-1] + gm->A[i] [j+1] + gm-

>A[i+1] [j] + gm->A[i-1] [j];

local_diff += fabs (gm->A[i] [j]

temp);} /* end for */

} /* end for */

LOCK (gm->diff_lock);

gm->diff += local_diff;

UNLOCK (gm->diff_lock);

BARRIER (gm->barrier, P);

if (gm->diff/(n*n) < TOL) done = 1;

BARRIER (gm->barrier, P); /* why? */

} /* end while */

}

if (pid) Recv rows, n, P

Send up/downRecv up/down

Send local_diff

to P0

Recv diff

Message p assin g


80/232

June-July 2009 80

Message p assin g

This algorithm is deterministic

May converge to a different solution compared tothe shared memory version if there are multiple

solutions: why? There is a fixed specific point in the program (at the

beginning of each iteration) when the neighboring rows

are communicated

This is not true for shared memory


81/232

Messag e Passin gGr id So lv er

MPI - l i k e en v i r o n m en t


82/232

June-July 2009 82

MPI l i k e en v i r o n m en t MPI stands for Message Passing Interface

A C library that provides a set of message passingprimitives (e.g., send, receive, broadcast etc.) to the user

PVM (Parallel Virtual Machine) is another well-known platform for message passing programming

Background in MPI is not necessary forunderstanding this lecture

Only need to know When you start an MPI program every thread runs the

same mainfunction We will assume that we pin one thread to one processor

just as we did in shared memory

Instead of using the exact MPI syntax we will use

some macros that call the MPI functions

MAIN_ENV;/* define message tags */

#define ROW 99

while (!done) {local_diff = 0.0;

/* MPI CHAR means raw byte format */


83/232

June-July 2009 83

#define ROW 99

#define DIFF 98

#define DONE 97

int main(int argc, char **argv){

int pid, P, done, i, j, N;

float tempdiff, local_diff, temp, **A;

MAIN_INITENV;

GET_PID(pid);

GET_NUMPROCS(P);

N = atoi(argv[1]);

tempdiff = 0.0;

done = 0;

A = (double **) malloc ((N/P+2) *sizeof(float *));

for (i=0; i < N/P+2; i++) {

A[i] = (float *) malloc (sizeof(float)* (N+2));

}

initialize(A);

/ MPI_CHAR means raw byte format /

if (pid) { /* send my first row up */

SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

}if (pid != P-1) { /* recv last row */

RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

}

if (pid != P-1) { /* send last row down */

SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

}

if (pid) { /* recv first row from above */

RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

}

for (i=1; i


84/232

June-July 2009 84

#define ROW 99

#define DIFF 98

#define DONE 97

int main(int argc, char **argv){

int pid, P, done, i, j, N;

float tempdiff, local_diff, temp, **A;

MAIN_INITENV;

GET_PID(pid);

GET_NUMPROCS(P);

N = atoi(argv[1]);

tempdiff = 0.0;

done = 0;

A = (double **) malloc ((N/P+2) *sizeof(float *));

for (i=0; i < N/P+2; i++) {

A[i] = (float *) malloc (sizeof(float)* (N+2));

}

initialize(A);

/ MPI_CHAR means raw byte format /

if (pid) { /* send my first row up */

SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

}if (pid != P-1) { /* recv last row */

RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

}

if (pid != P-1) { /* send last row down */

SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

}

if (pid) { /* recv first row from above */

RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

}

for (i=1; i


85/232

June-July 2009 85

Per fo rmanceI ssu es

Agenda


86/232

June-July 2009 86

g Partitioning for performance

Data access and communication Summary

Goal is to understand simple trade-offs involved inwriting a parallel program keeping an eye on

parallel performance

Getting good performance out of a multiprocessor is

difficult Programmers need to be careful

A little carelessness may lead to extremely poor

performance

Par t i t ion i ng f o r per f .


87/232

June-July 2009 87

g p Partitioning plays an important role in the parallel

performance This is where you essentially determine the tasks

A good partitioning should practise

Load balance

Minimal communication

Low overhead to determine and manage task

assignment (sometimes called extra work)

A well-balanced parallel program automatically haslow barrier or point-to-point synchronization time

Ideally I want all the threads to arrive at a barrier at the

same time

Load b a lan cin g


88/232

June-July 2009 88

g Achievable speedup is bounded above by

Sequential exec. time / Max. time for any processor

Thus speedup is maximized when the maximum timeand minimum time across all processors are close (wantto minimize the variance of parallel execution time)

This directly gets translated to load balancing

What leads to a high variance? Ultimately all processors finish at the same time But some do useful work all over this period while others

may spend a significant time at synchronization points

This may arise from a bad partitioning

There may be other architectural reasons for loadimbalance beyond the scope of a programmer e.g.,network congestion, unforeseen cache conflicts etc.(slows down a few threads)

Dyn am ic t ask qu eu es


89/232

June-July 2009 89

y q Introduced in the last lecture

Normally implemented as part of the parallelprogram

Two possible designs

Centralized task queue: a single queue of tasks; may

lead to heavy contention because insertion and deletion

to/from the queue must be critical sections

Distributed task queues: one queue per processor

Issue with distributed task queues When a queue of a particular processor is empty whatdoes it do? Task stealing

Task st eal in g


90/232

June-July 2009 90

g A processor may choose to steal tasks from

another processors queue if the formers queue is

empty How many tasks to steal? Whom to steal from?

The biggest question: how to detect termination? Reallya distributed consensus!

Task stealing, in general, may increase overhead andcommunication, but a smart design may lead to excellentload balance (normally hard to design efficiently)

This is a form of a more general technique calledReceiver Initiated Diffusion (RID) where the receiver ofthe task initiates the task transfer

In Sender Initiated Diffusion (SID) a processor maychoose to insert into another processors queue if theformers task queue is full above a threshold

Ar ch i t ect s j ob


91/232

June-July 2009 91

j Normally load balancing is a responsibility of the

programmer

However, an architecture may provide efficient primitivesto implement task queues and task stealing

For example, the task queue may be allocated in aspecial shared memory segment, accesses to which maybe optimized by special hardware in the memorycontroller

But this may expose some of the architectural features tothe programmer

There are multiprocessors that provide efficient

implementations for certain synchronization primitives;this may improve load balance

Sophisticated hardware tricks are possible: dynamic loadmonitoring and favoring slow threads dynamically

Par t i t i on in g an d


92/232

June-July 2009 92

gc o m m u n i c a t i o n

Need to reduce inherent communication This is the part of communication determined by

assignment of tasks

There may be other communication traffic also (more

later)

Goal is to assign tasks such that accessed data aremostly local to a process

Ideally I do not want any communication But in life sometimes you need to talk to people to get

some work done!

Dom ain decom posi t i on


93/232

June-July 2009 93

p Normally applications show a local bias on data

usage Communication is short-range e.g. nearest neighbor

Even if it is long-range it falls off with distance

View the dataset of an application as the domain of the

problem e.g., the 2-D grid in equation solver If you consider a point in this domain, in most of the

applications it turns out that this point depends on points

that are close by

Partitioning can exploit this property by assigningcontiguous pieces of data to each process

Exact shape of decomposed domain depends on the

application and load balancing requirements

Co m m - t o - co m p r at io


94/232

June-July 2009 94

p Surely, there could be many different domain

decompositions for a particular problem For grid solver we may have a square blockdecomposition, block row decomposition or cyclic row

decomposition

How to determine which one is good? Communication-to-computation ratio

Assume P processors and NxN grid for grid solver

P0 P1 P2 P3

P4 P5 P6 P7

P15

Sq. block decomp. for P=16

Size of each block: N/P by N/P

Communication (perimeter): 4N/PComputation (area): N2/P

Comm-to-comp ratio = 4P/N

Co m m - t o - co m p r at io


95/232

June-July 2009 95

p For block row decomposition

Each strip has N/P rows Communication (boundary rows): 2N

Computation (area): N2/P (same as square block)

Comm-to-comp ratio: 2P/N

For cyclic row decomposition Each processor gets N/P isolated rows

Communication: 2N2/P

Computation: N2/P

Comm-to-comp ratio: 2

Normally N is much much larger than P

Asymptotically, square block yields lowest comm-to-

comp ratio

Co m m - t o - co m p r at ioId i t th l f i h t


96/232

June-July 2009 96

p Idea is to measure the volume of inherent

communication per computation

In most cases it is beneficial to pick the decompositionwith the lowest comm-to-comp ratio

But depends on the application structure i.e. picking thelowest comm-to-comp may have other problems

Normally this ratio gives you a rough estimate aboutaverage communication bandwidth requirement of theapplication i.e. how frequent is communication

But it does not tell you the nature of communication i.e.bursty or uniform

For grid solver comm. happens only at the start of eachiteration; it is not uniformly distributed over computation

Thus the worst case BW requirement may exceed theaverage comm-to-comp ratio

Ex t r a w o r kf


97/232

June-July 2009 97

Extra work in a parallel version of a sequentialprogram may result from

Decomposition

Assignment techniques

Management of the task pool etc.

Speedup is bounded above by

Sequential work / Max (Useful work +

Synchronization + Comm. cost + Extra work)

where the Max is taken over all processors

But this is still incomplete

We have only considered communication cost from the

viewpoint of the algorithm and ignored the architecture

completely

Dat a access an di t i


98/232

June-July 2009 98

c o m m u n i c a t i o n The memory hierarchy (caches and main memory)

plays a significant role in determiningcommunication cost May easily dominate the inherent communication of the

algorithm For uniprocessor, the execution time of a program

is given by useful work time + data access time Useful work time is normally called the busy time or busy

cycles Data access time can be reduced either by architectural

techniques (e.g., large caches) or by cache-awarealgorithm design that exploits spatial and temporallocality

Dat a accessI lti


99/232

June-July 2009 99

In multiprocessors

Every processor wants to see the memory interface as its

own local cache and the main memory In reality it is much more complicated

If the system has a centralized memory (e.g., SMPs),

there are still caches of other processors; if the memory

is distributed then some part of it is local and some isremote

For shared memory, data movement from local or remote

memory to cache is transparent while for message

passing it is explicit View a multiprocessor as an extended memory hierarchy

where the extension includes caches of other

processors, remote memory modules and the network

topology

Ar t i f act u a l co m m .


100/232

June-July 2009 100

Communication caused by artifacts of extendedmemory hierarchy

Data accesses not satisfied in the cache or local memory

cause communication

Inherent communication is caused by data transfers

determined by the program

Artifactual communication is caused by poor allocation of

data across distributed memories, unnecessary data in a

transfer, unnecessary transfers due to system-dependent

transfer granularity, redundant communication of data,

finite replication capacity (in cache or memory)

Inherent communication assumes infinite capacityand perfect knowledge of what should be

transferred


101/232


102/232

Sp at ia l loca l i t y


103/232

June-July 2009 103

Consider a square block decomposition of grid

solver and a C-like row major layout i.e. A[i][j] andA[i][j+1] have contiguous memory locations

Memory

allocation

Page straddles

partition boundary

Cache line

across partition

Page

Cache line

The same page is local

to a processor while

remote to others; sameapplies to straddling

cache lines. Ideally, I

want to have all pages

within a partition local toa single processor.

Standard trick is to

covert the 2D array to

4D.

2 D t o 4 D con v er sion


104/232

June-July 2009 104

Essentially you need to change the way memory isallocated

The matrix A needs to be allocated in such a way that theelements falling within a partition are contiguous

The first two dimensions of the new 4D matrix are blockrow and column indices i.e. for the partition assigned to

processor P6 these are 1 and 2 respectively (assuming16 processors)

The next two dimensions hold the data elements withinthat partition

Thus the 4D array may be declared as

float B[P][P][N/P][N/P] The element B[3][2][5][10] corresponds to the element in

10th column, 5th row of the partition of P14 Now all elements within a partition have contiguous

addresses

Tr an sfer g r an u lar i t y


105/232

June-July 2009 105

How much data do you transfer in onecommunication?

For message passing it is explicit in the program

For shared memory this is really under the control of the

cache coherence protocol: there is a fixed size for which

transactions are defined (normally the block size of the

outermost level of cache hierarchy)

In shared memory you have to be careful

Since the minimum transfer size is a cache line you may

end up transferring extra data e.g., in grid solver theelements of the left and right neighbors for a square

block decomposition (you need only one element, but

must transfer the whole cache line): no good solution

W or se: fa lse sh ar in g


106/232

June-July 2009 106

If the algorithm is designed so poorly that

Two processors write to two different words within acache line at the same time

The cache line keeps on moving between two processors

The processors are not really accessing or updating the

same element, but whatever they are updating happen tofall within a cache line: not a true sharing, but false

sharing

For shared memory programs false sharing can easily

degrade performance by a lot

Easy to avoid: just pad up to the end of the cache line

before starting the allocation of the data for the next

processor (wastes memory, but improves performance)


107/232

H ot - spo t s


108/232

June-July 2009 108

Avoid location hot-spot by either staggering

accesses to the same location or by designing thealgorithm to exploit a tree structured

communication

Module hot-spot

Normally happens when a particular node saturateshandling too many messages (need not be to same

memory location) within a short amount of time

Normal solution again is to design the algorithm in such a

way that these messages are staggered over time

Rule of thumb: design communication pattern suchthat it is not bursty; want to distribute it uniformly

over time

Over lap


109/232

June-July 2009 109

Increase overlap between communication andcomputation

Not much to do at algorithm level unless the

programming model and/or OS provide some primitives

to carry out prefetching, block data transfer, non-blocking

receive etc.

Normally, these techniques increase bandwidth demand

because you end up communicating the same amount of

data, but in a shorter amount of time (execution time

hopefully goes down if you can exploit overlap)

S u m m a r y


110/232

June-July 2009 110

Parallel programs introduce three overhead terms:busy overhead (extra work), remote data access

time, and synchronization time

Goal of a good parallel program is to minimize these

three terms

Goal of a good parallel computer architecture is to

provide sufficient support to let programmers optimize

these three terms (and this is the focus of the rest of the

course)


111/232

Fou r o r gan i zat i onsSh d h Interconnect is between the


112/232

June-July 2009 112

Shared cache

The switch is a simplecontroller for granting

access to cache banks

Interconnect is between theprocessors and the sharedcache

Which level of cachehierarchy is shared dependson the design: Chipmultiprocessors today

normally share the outermostlevel (L2 or L3 cache)

The cache and memory areinterleaved to improvebandwidth by allowing

multiple concurrent accesses Normally small scale due to

heavy bandwidth demand onswitch and shared cache

P0 Pn

SWITCH

INTERLEAVEDSHARED CACHE

INTERLEAVED

MEMORY

Fou r o r gan i zat i ons


113/232

June-July 2009 113

Bus-based SMP

Scalability is limited bythe shared bus bandwidth

Interconnect is a shared buslocated between the private

cache hierarchies andmemory controller

The most popularorganization for small to

medium-scale servers Possible to connect 30 or so

processors with smart bus

design

Bus bandwidth requirement islower compared to shared

cache approach

Why?

P0 Pn

CACHE CACHE

MEM

BUS



114/232

June-July 2009 114

Dancehall Better scalability comparedto previous two designs

The difference from bus-based SMP is that the

interconnect is a scalable

point-to-point network (e.g.

crossbar or other topology)

Memory is still symmetricfrom all processors

Drawback: a cache miss may

take a long time since allmemory banks too far off

from the processors (may be

several network hops)

P0 Pn

CACHE CACHE

INTERCONNECT

MEM MEM



115/232

June-July 2009 115

Distributed shared memory The most popular scalableorganization

Each node now has localmemory banks

Shared memory on othernodes must be accessedover the network Remote memory access

Non-uniform memory access(NUMA) Latency to access local

memory is much smallercompared to remote memory

Caching is very important toreduce remote memoryaccess

P0 Pn

CACHE CACHE

MEM MEM

INTERCONNECT

Fou r o r gan i zat i ons In all four organizations caches play an important


116/232

June-July 2009 116

In all four organizations caches play an importantrole in reducing latency and bandwidth requirement

If an access is satisfied in cache, the transaction will notappear on the interconnect and hence the bandwidth

requirement of the interconnect will be less (shared L1

cache does not have this advantage)

In distributed shared memory (DSM) cache andlocal memory should be used cleverly

Bus-based SMP and DSM are the two designssupported today by industry vendors

In bus-based SMP every cache miss is launched on the

shared bus so that all processors can see all transactions

In DSM this is not the case

Hier ar ch ica l d esig n


117/232

June-July 2009 117

Possible to combine bus-based SMP and DSM tobuild hierarchical shared memory Sun Wildfire connects four large SMPs (28 processors)

over a scalable interconnect to form a 112p

multiprocessor IBM POWER4 has two processors on-chip with privateL1 caches, but shared L2 and L3 caches (this is called achip multiprocessor); connect these chips over a networkto form scalable multiprocessors

Next few lectures will focus on bus-based SMPsonly

Cach e Coh er en ce


118/232

June-July 2009 118

Intuitive memory model

For sequential programs we expect a memory location to

return the latest value written to that location

For concurrent programs running on multiple threads or

processes on a single processor we expect the same

model to hold because all threads see the same cache

hierarchy (same as shared L1 cache)

For multiprocessors there remains a danger of using a

stale value: in SMP or DSM the caches are not shared

and processors are allowed to replicate data

independently in each cache; hardware must ensure thatcached values are coherent across the system and they

satisfy programmers intuitive memory model

Examp le


119/232

June-July 2009 119

Assume a write-through cache i.e. every storeupdates the value in cache as well as in memory

P0: reads x from memory, puts it in its cache, and gets

the value 5

P1: reads x from memory, puts it in its cache, and gets

the value 5

P1: writes x=7, updates its cached value and memory

value

P0: reads x from its cache and gets the value 5

P2: reads x from memory, puts it in its cache, and getsthe value 7 (now the system is completely incoherent)

P2: writes x=10, updates its cached value and memory

value


120/232

W h at w en t w r on g?


121/232

June-July 2009 121

For write through cache

The memory value may be correct if the writes are

correctly ordered

But the system allowed a store to proceed when there is

already a cached copy

Lesson learned: must invalidate all cached copies before

allowing a store to proceed

Writeback cache

Problem is even more complicated: stores are no longer

visible to memory immediately Writeback order is important

Lesson learned: do not allow more than one copy of a

cache line in M state


122/232

Def in i t i onsM ti d (l d) it ( t )


123/232

June-July 2009 123

Memory operation: a read (load), a write (store), ora read-modify-write

Assumed to take place atomically

A memory operation is said to issue when it leavesthe issue queue and looks up the cache

A memory operation is said to perform with respectto a processor when a processor can tell that fromother issued memory operations A read is said to perform with respect to a processor

when subsequent writes issued by that processor cannot

affect the returned read value A write is said to perform with respect to a processor

when a subsequent read from that processor to the sameaddress returns the new value

Or d er i n g m em or y opA ti i id t l t h it h


124/232

June-July 2009 124

A memory operation is said to complete when it hasperformed with respect to all processors in the

system

Assume that there is a single shared memory andno caches

Memory operations complete in shared memory whenthey access the corresponding memory locations

Operations from the same processor complete in

program order: this imposes a partial orderamong the

memory operations Operations from different processors are interleaved in

such a way that the program order is maintained for each

processor: memory imposes some total order(many are

possible)


125/232

Cach e coh er en ce Formal definition


126/232

June-July 2009 126

A memory system is coherent if the values returned byreads to a memory location during an execution of a

program are such that all operations to that location canform a hypothetical total order that is consistent with theserial order and has the following two properties:

1. Operations issued by any particular processor perform

according to the issue order2. The value returned by a read is the value written to thatlocation by the last write in the total order

Two necessary features that follow from above:

A. Write propagation: writes must eventually becomevisible to all processors

B. Write serialization: Every processor should see thewrites to a location in the same order (if I see w1 beforew2, you should not see w2 before w1)

Bu s- b ased SMP Extend the philosophy of uniprocessor bus


127/232

June-July 2009 127

p p y ptransactions

Three phases: arbitrate for bus, launch command (oftencalled request) and address, transfer data

Every device connected to the bus can observe thetransaction

Appropriate device responds to the request

In SMP, processors also observe the transactions andmay take appropriate actions to guarantee coherence

The other device on the bus that will be of interest to usis the memory controller (north bridge in standard mother

boards) Depending on the bus transaction a cache block

executes a finite state machine implementing thecoherence protocol

Sn oopy p r o t oco l s Cache coherence protocols implemented in bus-


128/232

June-July 2009 128

Cache coherence protocols implemented in busbased machines are called snoopy protocols

The processors snoop or monitor the bus and takeappropriate protocol actions based on snoop results

Cache controller now receives requests both fromprocessor and bus

Since cache state is maintained on a per line basis thatalso dictates the coherence granularity

Cannot normally take a coherence action on parts of acache line

The coherence protocol is implemented as a finite state

machine on a per cache line basis The snoop logic in each processor grabs the address

from the bus and decides if any action should be takenon the cache line containing that address (only if the lineis in cache)


129/232

St at e t r an si t i on The finite state machine for each cache line:


130/232

June-July 2009 130

The finite state machine for each cache line:

On a write miss no line is allocated The state remains at I: called write through write no-

allocated

A/B means: A is generated by processor, B is theresulting bus transaction (if any)

Changes for write through write allocate?

I V

PrWr/BusWr

BusWr (snoop)

PrRd/BusRd PrRd/-

PrWr/BusWr


131/232

W r i t e t h r o u g h i s b ad High bandwidth requirement


132/232

June-July 2009 132

High bandwidth requirement Every write appears on the bus

Assume a 3 GHz processor running application with 10%store instructions, assume CPI of 1

If the application runs for 100 cycles it generates 10stores; assume each store is 4 bytes; 40 bytes are

generated per 100/3 ns i.e. BW of 1.2 GB/s A 1 GB/s bus cannot even support one processor

There are multiple processors and also there are readmisses

Writeback caches absorb most of the write traffic Writes that hit in cache do not go on bus (not visible toothers)

Complicated coherence protocol with many choices

Mem or y con si st en cy Need a more formal description of memory


133/232

June-July 2009 133

p yordering

How to establish the order between reads and writesfrom different processors?

The most clear way is to use synchronization

P0: A=1; flag=1

P1: while (!flag); print A; Another example (assume A=0, B=0 initially)

P0: A=1; print B;

P1: B=1; print A; What do you expect?

Memory consistency model is a contract betweenprogrammer and hardware regarding memory

ordering

Con sist en cy m od el A multiprocessor normally advertises the supported


134/232

June-July 2009 134

A multiprocessor normally advertises the supportedmemory consistency model

This essentially tells the programmer what the possible

correct outcome of a program could be when run on that

machine

Cache coherence deals with memory operations to the

same location, but not different locations

Without a formally defined order across all memory

operations it often becomes impossible to argue about

what is correct and what is wrong in shared memory

Various memory consistency models

Sequential consistency (SC) is the most intuitive one and

we will focus on it now (more consistency models later)


135/232


136/232

OOO an d SC Consider a simple example (all are zero initially)P0 1 1


137/232

June-July 2009 137

P0: x=w+1; r=y+1;

P1: y=2; w=y+1;

Suppose the load that reads w takes a miss and so w isnot ready for a long time; therefore, x=w+1 cannotcomplete immediately; eventually w returns with value 3

Inside the microprocessor r=y+1 completes (but does not

commit) before x=w+1 and gets the old value of y(possibly from cache); eventually instructions commit inorder with x=4, r=1, y=2, w=3

So we have the following partial orders

P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2

Combine these to get a contradictory total order

What went wrong? We will discuss it in detail later

SC ex am p le Consider the following example


138/232

June-July 2009 138

Consider the following exampleP0: A=1; print B;

P1: B=1; print A;

Possible outcomes for an SC machine (A, B) = (0,1); interleaving: B=1; print A; A=1; print B

(A, B) = (1,0); interleaving: A=1; print B; B=1; print A

(A, B) = (1,1); interleaving: A=1; B=1; print A; print B

A=1; B=1; print B; print A

(A, B) = (0,0) is impossible: read of A must occur beforewrite of A and read of B must occur before write of B i.e.

print A < A=1 and print B < B=1, but A=1 < print B andB=1 < print A; thus print B < B=1 < print A < A=1 < print Bwhich implies print B < print B, a contradiction

I m p lem en t in g SC Two basic requirements


139/232

June-July 2009 139

Two basic requirements

Memory operations issued by a processor must become

visible to others in program order

Need to make sure that all processors see the same total

order of memory operations: in the previous example for

the (0,1) case both P0 and P1 should see the same

interleaving: B=1; print A; A=1; print B

The tricky part is to make sure that writes becomevisible in the same order to all processors

Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different

values (which may still be correct from the viewpoint of

cache coherence, but will violate SC)

W r i t e at om ici t yE l (A 0 B 0 i iti ll )


140/232

June-July 2009 140

Example (A=0, B=0 initially)

P0: A=1;P1: while (!A); B=1;

P2: while (!B); print A;

A correct execution on an SC machine should print

A=1 A=0 will be printed only if write to A is not visible to P2,but clearly it is visible to P1 since it came out of the loop

Thus A=0 is possible if P1 sees the order A=1 < B=1 andP2 sees the order B=1 < A=1 i.e. from the viewpoint of

the whole system the write A=1 was not atomic Without write atomicity P2 may proceed to print 0 with a

stale value from its cache

Su m m ar y o f SC Program order from each processor creates apartial order among memory operations


141/232

June-July 2009 141

partial order among memory operations

Interleaving of these partial orders defines a totalorder

Sequential consistency: one of many total orders

A multiprocessor is said to be SC if any execution

on this machine is SC compliant Sufficient but not necessary conditions for SC

Issue memory operation in program order

Every processor waits for write to complete before

issuing the next operation Every processor waits for read to complete and the write

that affects the returned value to complete before issuingthe next operation (important for write atomicity)

Back t o sh ar ed bu s Centralized shared bus makes it easy to support

SC


142/232

June-July 2009 142

SC

Writes and reads are all serialized in a total order throughthe bus transaction ordering

If a read gets a value of a previous write, that write isguaranteed to be complete because that bus transactionis complete

The write order seen by all processors is the same in awrite through system because every write causes atransaction and hence is visible to all in the same order

In a nutshell, every processor sees the same total busorder for all memory operations and therefore any bus-based SMP with write through caches is SC

What about a multiprocessor with writeback cache? No SMP uses write through protocol due to high BW


143/232

Stores Look at stores a little more closely


144/232

June-July 2009 144

Look at stores a little more closely There are three situations at the time a store issues: the

line is not in the cache, the line is in the cache in S state,the line is in the cache in one of M, E and O states

If the line is in I state, the store generates aread-exclusive request on the bus and gets the line in Mstate

If the line is in S or O state, that means the processoronly has read permission for that line; the storegenerates an upgrade request on the bus and theupgrade acknowledgment gives it the write permission

(this is a data-less transaction) If the line is in M or E state, no bus transaction is

generated; the cache already has write permission forthe line (this is the case of a write hit; previous two arewrite misses)

I n v a l id at io n v s. u p d at e Two main classes of protocols:

I lid ti b d d d t b d


145/232

June-July 2009 145

Invalidation-based and update-based

Dictates what action should be taken on a write Invalidation-based protocols invalidate sharers when awrite miss (upgrade or readX) appears on the bus

Update-based protocols update the sharer caches withnew value on a write: requires write transactions

(carrying just the modified bytes) on the bus even onwrite hits (not very attractive with writeback caches)

Advantage of update-based protocols: sharers continueto hit in the cache while in invalidation-based protocolssharers will miss next time they try to access the line

Advantage of invalidation-based protocols: only writemisses go on bus (suited for writeback caches) andsubsequent stores to the same line are cache hits

W h ich on e is bet t er ? Difficult to answer

D d b h i d h d t


146/232

June-July 2009 146

Depends on program behavior and hardware cost

When is update-based protocol good? What sharing pattern? (large-scale producer/consumer) Otherwise it would just waste bus bandwidth doing

useless updates

When is invalidation-protocol good? Sequence of multiple writes to a cache line Saves intermediate write transactions

Also think about the overhead of initiating small

updates for every write in update protocols Invalidation-based protocols are much more popular

Some systems support both or maybe some hybridbased on dynamic sharing pattern of a cache line

MSI p r o t oco l Forms the foundation of invalidation-based


147/232

June-July 2009 147

Forms the foundation of invalidation based

writeback protocols

Assumes only three supported cache line states: I, S,

and M

There may be multiple processors caching a line in S

state

There must be exactly one processor caching a line in M

state and it is the owner of the line

If none of the caches have the line, memory must have

the most up-to-date copy of the line

Processor requests to cache: PrRd, PrWr

Bus transactions: BusRd, BusRdX, BusUpgr,BusWB

St at e t r an si t i on


148/232

June-July 2009 148

I S M

PrRd/BusRd

PrWr/BusRdX

PrRd/-BusRd/-

{BusRdX, BusUpgr}/-CacheEvict/-

PrWr/BusUpgr

PrRd/-

PrWr/-BusRd/Flush

BusRdX/Flush

CacheEvict/BusWB

MSI p r o t oco l Few things to note


149/232

June-July 2009 149

Few things to note

Flush operation essentially launches the line on the bus

Processor with the cache line in M state is responsible

for flushing the line on bus whenever there is a BusRd or

BusRdX transaction generated by some other processor

On BusRd the line transitions from M to S, but not M to I.

Why? Also at this point both the requester and memory

pick up the line from the bus; the requester puts the line

in its cache in S state while memory writes the line back.

Why does memory need to write back?

On BusRdX the line transitions from M to I and this timememory does not need to pick up the line from bus. Only

the requester picks up the line and puts it in M state in its

cache. Why?


150/232

MSI ex am p le Take the following example P0 reads x P1 reads x P1 writes x P0 reads x P2 reads


151/232

June-July 2009 151

P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 readsx, P3 writes x

Assume the state of the cache line containing theaddress of x is I in all processors

P0 generates BusRd, memory provides line, P0 puts line inS state


P1 generates BusUpgr, P0 snoops and invalidates line,memory does not respond, P1 sets state of line to M

P0 generates BusRd, P1 flushes line and goes to S state,P0 puts line in S state, memory writes back


P3 generates BusRdX, P0, P1, P2 snoop and invalidate,

memor rovides line P3 uts line in cache in M state

MESI p r o t oco l The most popular invalidation-based protocol e.g.,


152/232

June-July 2009 152

The most popular invalidation based protocol e.g.,

appears in Intel Xeon MP

Why need E state?

The MSI protocol requires two transactions to go from I to

M even if there is no intervening requests for the line:

BusRd followed by BusUpgr

We can save one transaction by having memory

controller respond to the first BusRd with E state if there

is no other sharer in the system

How to know if there is no other sharer? Needs a

dedicated control wire that gets asserted by a sharer

(wired OR)

Processor can write to a line in E state silently and take it

to M state

St at e t r an si t i onPrRd/BusRd(S) PrRd/-


153/232

June-July 2009 153

I

ME

S

PrRd/BusRd(S)

PrRd/BusRd(!S)

PrWr/BusRdX

PrRd/-

BusRdX/Flush

CacheEvict/-

PrWr/-

BusRd/Flush

PrRd/

BusRd/Flush

{BusRdX, BusUpgr}/FlushCacheEvict/-

PrWr/BusUpgr

PrRd/-

PrWr/-

BusRd/Flush

BusRdX/Flush

CacheEvict/BusWB

MESI p r o t oco l If a cache line is in M state definitely the processor


154/232

June-July 2009 154

If a cache line is in M state definitely the processorwith the line is responsible for flushing it on the next

BusRd or BusRdX transaction If a line is not in M state who is responsible?

Memory or other caches in S or E state?

Original Illinois MESI protocol assumed cache-to-cachetransfer i.e. any processor in E or S state is responsiblefor flush

MCU architecture

Documents

Transcript of MCU architecture