MCU architecture

download MCU architecture

of 232

Transcript of MCU architecture

  • 7/30/2019 MCU architecture

    1/232

    ProgramOpt im izat ion f o r

    Mu l t i - co re :Har dw ar e si de o f i t

  • 7/30/2019 MCU architecture

    2/232

    June-July 2009 2

    Conten ts Virtual Memory and Caches (Recap)

    Fundamentals of Parallel Computers: ILP vs. TLP

    Parallel Programming: Shared Memory andMessage Passing

    Performance Issues in Shared Memory Shared Memory Multiprocessors: Consistency and

    Coherence

    Synchronization Memory consistency models

    Case Studies of CMP

  • 7/30/2019 MCU architecture

    3/232

    RECAP:VI RTUAL MEMORY

    ANDCACHE

  • 7/30/2019 MCU architecture

    4/232

    June-July 2009 4

    W h y v i r t u a l m em or y ? With a 32-bit address you can access 4 GB of

    physical memory (you will never get the fullmemory though) Seems enough for most day-to-day applications

    But there are important applications that have much

    bigger memory footprint: databases, scientific appsoperating on large matrices etc.

    Even if your application fits entirely in physical memory itseems unfair to load the full image at startup

    Just takes away memory from other processes, butprobably doesnt need the full image at any point of timeduring execution: hurts multiprogramming

    Need to provide an illusion of bigger memory:Virtual Memory (VM)

  • 7/30/2019 MCU architecture

    5/232

    June-July 2009 5

    Vi r t u a l m em or y Need an address to access virtual memory

    Virtual Address (VA)

    Assume a 32-bit VA Every process sees a 4 GB of virtual memory

    This is much better than a 4 GB physical memory sharedbetween multiprogrammed processes

    The size of VA is really fixed by the processor data pathwidth

    64-bit processors (Alpha 21264, 21364; SunUltraSPARC; AMD Athlon64, Opteron; IBM POWER4,

    POWER5; MIPS R10000 onwards; Intel Itanium etc., andrecently Intel Pentium4) provide bigger virtual memory toeach process

    Large virtual and physical memory is very important incommercial server market: need to run large databases

  • 7/30/2019 MCU architecture

    6/232

    June-July 2009 6

    Add r essin g VM There are primarily three ways to address VM

    Paging, Segmentation, Segmented paging

    We will focus on flat paging only

    Paged VM

    The entire VM is divided into small units called pages Virtual pages are loaded into physical page frames as

    and when needed (demand paging)

    Thus the physical memory is also divided into equal

    sized page frames The processor generates virtual addresses

    But memory is physically addressed: need a VA to PA

    translation

  • 7/30/2019 MCU architecture

    7/232

    June-July 2009 7

    VA t o PA t r anslat ion The VA generated by the processor is divided into

    two parts: Page offset and Virtual page number (VPN)

    Assume a 4 KB page: within a 32-bit VA, lower 12 bits

    will be page offset (offset within a page) and the

    remaining 20 bits are VPN (hence 1 M virtual pages total)

    The page offset remains unchanged in the translation

    Need to translate VPN to a physical page frame number

    (PPFN)

    This translation is held in a page table resident in

    memory: so first we need to access this page table

    How to get the address of the page table?

  • 7/30/2019 MCU architecture

    8/232

    June-July 2009 8

    VA t o PA t r anslat ion Accessing the page table

    The Page table base register (PTBR) contains thestarting physical address of the page table

    PTBR is normally accessible in the kernel mode only

    Assume each entry in page table is 32 bits (4 bytes)

    Thus the required page table address is

    PTBR + (VPN

  • 7/30/2019 MCU architecture

    9/232

    June-July 2009 9

    Page f au l t The valid bit within the 32 bits tells you if the

    translation is valid If this bit is reset that means the page is not

    resident in memory: results in a page fault

    In case of a page fault the kernel needs to bring inthe page to memory from disk

    The disk address is normally provided by the pagetable entry (different interpretation of 31 bits)

    Also kernel needs to allocate a new physical pageframe for this virtual page

    If all frames are occupied it invokes a pagereplacement policy

  • 7/30/2019 MCU architecture

    10/232

  • 7/30/2019 MCU architecture

    11/232

    June-July 2009 11

    TLB Why cant we cache the most recently used

    translations? Translation Look-aside Buffers (TLB)

    Small set of registers (normally fully associative)

    Each entry has two parts: the tag which is simply VPN

    and the corresponding PTE

    The tag may also contain a process id

    On a TLB hit you just get the translation in one cycle

    (may take slightly longer depending on the design)

    On a TLB miss you may need to access memory to load

    the PTE in TLB (more later)

    Normally there are two TLBs: instruction and data

  • 7/30/2019 MCU architecture

    12/232

    June-July 2009 12

    Caches Once you have completed the VA to PA translation

    you have the physical address. Whats next?

    You need to access memory with that PA

    Instruction and data caches hold most recently

    used (temporally close) and nearby (spatially close)data

    Use the PA to access the cache first

    Caches are organized as arrays of cache lines Each cache line holds several contiguous bytes

    (32, 64 or 128 bytes)

  • 7/30/2019 MCU architecture

    13/232

    June-July 2009 13

    Add r essin g a cach e The PA is divided into several parts

    The block offset determines the starting byte

    address within a cache line

    The index tells you which cache line to access

    In that cache line you compare the tag to determine

    hit/miss

    TAG INDEX BLK. OFFSET

  • 7/30/2019 MCU architecture

    14/232

    June-July 2009 14

    Add r essin g a cach eTAG INDEX BLK. OFFSET

    TAG DATA

    STATE

    PA

    HIT/

    MISS

    DATA

    ACCESS SIZE

    (HOW MANY BYTES?)

  • 7/30/2019 MCU architecture

    15/232

  • 7/30/2019 MCU architecture

    16/232

    June-July 2009 16

    Set assoc ia t iv e cach e The example assumes one cache line per index

    Called a direct-mapped cache

    A different access to a line evicts the resident cache line

    This is eithera capacity or a conflict miss

    Conflict misses can be reduced by providingmultiple lines per index

    Access to an index returns a set of cache lines

    For an n-way set associative cache there are n lines per

    set

    Carry out multiple tag comparisons in parallel tosee if any one in the set hits

  • 7/30/2019 MCU architecture

    17/232

    June-July 2009 17

    2 -w ay set associat i veTAG INDEX BLK. OFFSET

    TAG DATA

    STATE

    PA

    TAG DATA

    STATE

    TAG0 TAG1

  • 7/30/2019 MCU architecture

    18/232

    June-July 2009 18

    Set assoc ia t iv e cach e When you need to evict a line in a particular set you

    run a replacement policy LRU is a good choice: keeps the most recently used lines

    (favors temporal locality)

    Thus you reduce the number of conflict misses

    Two extremes of set size: direct-mapped (1-way)and fully associative (all lines are in a single set) Example: 32 KB cache, 2-way set associative, line size

    of 64 bytes: number of indices or number ofsets=32*1024/(2*64)=256 and hence index is 8 bits wide

    Example: Same size and line size, but fully associative:number of sets is 1, within the set there are 32*1024/64or 512 lines; you need 512 tag comparisons for eachaccess

  • 7/30/2019 MCU architecture

    19/232

    June-July 2009 19

    Cach e h ier ar ch y Ideally want to hold everything in a fast cache

    Never want to go to the memory

    But, with increasing size the access time increases

    A large cache will slow down every access

    So, put increasingly bigger and slower cachesbetween the processor and the memory

    Keep the most recently used data in the nearest

    cache: register file (RF) Next level of cache: level 1 or L1 (same speed or

    slightly slower than RF, but much bigger)

    Then L2: way bigger than L1 and much slower

  • 7/30/2019 MCU architecture

    20/232

    June-July 2009 20

    Cach e h ier ar ch y Example: Intel Pentium 4 (Netburst)

    128 registers accessible in 2 cycles

    L1 date cache: 8 KB, 4-way set associative, 64 bytes linesize, accessible in 2 cycles for integer loads

    L2 cache: 256 KB, 8-way set associative, 128 bytes line

    size, accessible in 7 cycles Example: Intel Itanium 2 (code name Madison)

    128 registers accessible in 1 cycle

    L1 instruction and data caches: each 16 KB, 4-way set

    associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128

    bytes line size, accessible in 5 cycles

    Unified L3 cache: 6 MB, 24-way set associative, 128bytes line size, accessible in 14 cycles

  • 7/30/2019 MCU architecture

    21/232

    June-July 2009 21

    St a t es o f a cach e l in e The life of a cache line starts off in invalid state (I)

    An access to that line takes a cache miss andfetches the line from main memory

    If it was a read miss the line is filled in shared state(S) [we will discuss it later; for now just assume thatthis is equivalent to a valid state]

    In case of a store miss the line is filled in modifiedstate (M); instruction cache lines do not normallyenter the M state (no store to Icache)

    The eviction of a line in M state must write the lineback to the memory (this is called a writebackcache); otherwise the effect of the store would belost

  • 7/30/2019 MCU architecture

    22/232

  • 7/30/2019 MCU architecture

    23/232

    June-July 2009 23

    Th e f i r st i nst r u ct i on Accessing the first instruction

    Take the starting PC

    Access iTLB with the VPN extracted from PC: iTLB miss

    Invoke iTLB miss handler

    Calculate PTE address

    If PTEs are cached in L1 data and L2 caches, look them

    up with PTE address: you will miss there also

    Access page table in main memory: PTE is invalid: page

    fault

    Invoke page fault handler

    Allocate page frame, read page from disk, update PTE,

    load PTE in iTLB, restart fetch

  • 7/30/2019 MCU architecture

    24/232

    June-July 2009 24

    Th e f i r st i nst r u ct i on Now you have the physical address

    Access Icache: miss

    Send refill request to higher levels: you miss everywhere

    Send request to memory controller (north bridge)

    Access main memory Read cache line

    Refill all levels of cache as the cache line returns to the

    processor

    Extract the appropriate instruction from the cache linewith the block offset

    This is the longest possible latency in aninstruction/data access

  • 7/30/2019 MCU architecture

    25/232

    June-July 2009 25

    TLB access For every cache access (instruction or data) you

    need to access the TLB first

    Puts the TLB in the critical path

    Want to start indexing into cache and read the tags

    while TLB lookup takes place Virtually indexed physically tagged cache

    Extract index from the VA, start reading tag while looking

    up TLB

    Once the PA is available do tag comparison

    Overlaps TLB reading and tag reading

  • 7/30/2019 MCU architecture

    26/232

    June-July 2009 26

    Mem or y op l at en cy L1 hit: ~1 ns

    L2 hit: ~5 ns L3 hit: ~10-15 ns

    Main memory: ~70 ns DRAM access time + bustransfer etc. = ~110-120 ns

    If a load misses in all caches it will eventually cometo the head of the ROB and block instructionretirement (in-order retirement is a must)

    Gradually, the pipeline backs up, processor runsout of resources such as ROB entries and physicalregisters

    Ultimately, the fetcher stalls: severely limits ILP

    MLP

  • 7/30/2019 MCU architecture

    27/232

    June-July 2009 27

    MLP Need memory-level parallelism (MLP)

    Simply speaking, need to mutually overlap severalmemory operations

    Step 1: Non-blocking cache Allow multiple outstanding cache misses

    Mutually overlap multiple cache misses

    Supported by all microprocessors today (Alpha 21364supported 16 outstanding cache misses)

    Step 2: Out-of-order load issue Issue loads out of program order (address is not known

    at the time of issue)

    How do you know the load didnt issue before a store tothe same address? Issuing stores must check for thismemory-order violation

    O

  • 7/30/2019 MCU architecture

    28/232

    June-July 2009 28

    Ou t -o f - o r der l oadssw 0(r7), r6

    /* other instructions */

    lw r2, 80(r20)

    Assume that the load issues before the store

    because r20 gets ready before r6 or r7 The load accesses the store buffer (used forholding already executed store values before theyare committed to the cache at retirement)

    If it misses in the store buffer it looks up the cachesand, say, gets the value somewhere

    After several cycles the store issues and it turns outthat 0(r7)==80(r20) or they overlap; now what?

    L d / d i

  • 7/30/2019 MCU architecture

    29/232

    June-July 2009 29

    Load / st o r e o r d er i n g Out-of-order load issue relies on speculative

    memory disambiguation Assumes that there will be no conflicting store

    If the speculation is correct, you have issued the loadmuch earlier and you have allowed the dependents to

    also execute much earlier If there is a conflicting store, you have to squash the load

    and all the dependents that have consumed the loadvalue and re-execute them systematically

    Turns out that the speculation is correct most of the time

    To further minimize the load squash, microprocessorsuse simple memory dependence predictors (predicts if aload is going to conflict with a pending store based onthat loads or load/store pairs past behavior)

    MLP d l l

  • 7/30/2019 MCU architecture

    30/232

    June-July 2009 30

    MLP an d m em o r y w al l Today microprocessors try to hide cache misses by

    initiating early prefetches: Hardware prefetchers try to predict next several loadaddresses and initiate cache line prefetch if they are notalready in the cache

    All processors today also support prefetch instructions;

    so you can specify in your program when to prefetchwhat: this gives much better control compared to ahardware prefetcher

    Researchers are working on load value prediction

    Even after doing all these, memory latency remainsthe biggest bottleneck

    Today microprocessors are trying to overcome onesingle wall: the memory wall

  • 7/30/2019 MCU architecture

    31/232

    Fu n dam en t a l s o fPara l le l

    C om pu t e r s

    A d

  • 7/30/2019 MCU architecture

    32/232

    June-July 2009 32

    Agenda

    Convergence of parallel architectures Fundamental design issues

    ILP vs. TLP

    Commun ica t i on

  • 7/30/2019 MCU architecture

    33/232

    June-July 2009 33

    Commun ica t i ona rch i t ec tu re

    Historically, parallel architectures are tied toprogramming models Diverse designs made it impossible to write portable

    parallel software

    But the driving force was the same: need for fastprocessing

    Today parallel architecture is seen as an extensionof microprocessor architecture with a

    communication architecture Defines the basic communication and synchronization

    operations and provides hw/sw implementation of those

    L d h i t t

  • 7/30/2019 MCU architecture

    34/232

    June-July 2009 34

    Lay er ed ar ch i t ect u r e A parallel architecture can be divided into several

    layers Parallel applications

    Programming models: shared address, messagepassing, multiprogramming, data parallel, dataflow etc

    Compiler + libraries

    Operating systems support

    Communication hardware

    Physical communication medium

    Communication architecture = user/systeminterface + hw implementation (roughly defined bythe last four layers) Compiler and OS provide the user interface to

    communicate between and synchronize threads

    Sh d dd

  • 7/30/2019 MCU architecture

    35/232

    June-July 2009 35

    Sh ar ed add r ess Communication takes place through a logically

    shared portion of memory

    User interface is normal load/store instructions

    Load/store instructions generate virtual addresses

    The VAs are translated to PAs by TLB or page table

    The memory controller then decides where to find this PA Actual communication is hidden from the programmer

    The general communication hw consists of multipleprocessors connected over some medium so that

    they can talk to memory banks and I/O devices The architecture of the interconnect may vary depending

    on projected cost and target performance

    Sh ar ed add r ess

  • 7/30/2019 MCU architecture

    36/232

    June-July 2009 36

    Sh ar ed add r ess Communication medium

    Interconnect could be a crossbar switch so that anyprocessor can talk to any memory bank in one hop(provides latency and bandwidth advantages)

    Scaling a crossbar becomes a problem: cost isproportional to square of the size

    Instead, could use a scalable switch-based network;latency increases and bandwidth decreases becausenow multiple processors contend for switch ports

    INTERCONNECT

    MEM MEM MEM I/O I/O

    P PDANCE HALL

    Sh d dd

  • 7/30/2019 MCU architecture

    37/232

    June-July 2009 37

    Sh ar ed add r ess Communication medium

    From mid 80s shared bus became popular leading to the

    design of SMPs Pentium Pro Quad was the first commodity SMP

    Sun Enterprise server provided a highly pipelined wideshared bus for scalability reasons; it also distributed the

    memory to each processor, but there was no local bus onthe boards i.e. the memory was still symmetric (mustuse the shared bus)

    NUMA or DSM architectures provide a better solution tothe scalability problem; the symmetric view is replaced by

    local and remote memory and each node (containingprocessor(s) with caches, memory controller and router)gets connected via a scalable network (mesh, ring etc.);Examples include Cray/SGI T3E, SGI Origin 2000, AlphaGS320, Alpha/HP GS1280 etc.

    M i

  • 7/30/2019 MCU architecture

    38/232

    June-July 2009 38

    Message p assin g Very popular for large-scale computing

    The system architecture looks exactly same asDSM, but there is no shared memory

    The user interface is via send/receive calls to themessage layer

    The message layer is integrated to the I/O systeminstead of the memory system

    Send specifies a local data buffer that needs to betransmitted; send also specifies a tag

    A matching receive at dest. node with the same tagreads in the data from kernel space buffer to usermemory

    Effectively, provides a memory-to-memory copy

    M i

  • 7/30/2019 MCU architecture

    39/232

    June-July 2009 39

    Message p assin g Actual implementation of message layer

    Initially it was very topology dependent

    A node could talk only to its neighbors through FIFObuffers

    These buffers were small in size and therefore whilesending a message send would occasionally block

    waiting for the receive to start reading the buffer(synchronous message passing)

    Soon the FIFO buffers got replaced by DMA (directmemory access) transfers so that a send can initiate atransfer from memory to I/O buffers and finishimmediately (DMA happens in background); sameapplies to the receiving end also

    The parallel algorithms were designed specifically forcertain topologies: a big problem

    Message p assin g

  • 7/30/2019 MCU architecture

    40/232

    June-July 2009 40

    Message p assin g To improve usability of machines, the message

    layer started providing support for arbitrary source

    and destination (not just nearest neighbors) Essentially involved storing a message in intermediate

    hops and forwarding it to the next node on the route

    Later this store-and-forward routing got moved tohardware where a switch could handle all the routingactivities

    Further improved to do pipelined wormhole routing sothat the time taken to traverse the intermediate hopsbecame small compared to the time it takes to push the

    message from processor to network (limited by node-to-network bandwidth)

    Examples include IBM SP2, Intel Paragon

    Each node of Paragon had two i860 processors, one ofwhich was dedicated to servicing the network (send/recv.etc.)

    Convergence

  • 7/30/2019 MCU architecture

    41/232

    June-July 2009 41

    Convergence Shared address and message passing are two

    distinct programming models, but the architectures

    look very similar Both have a communication assist or network interface to

    initiate messages or transactions

    In shared memory this assist is integrated with thememory controller

    In message passing this assist normally used to beintegrated with the I/O, but the trend is changing

    There are message passing machines where the assistsits on the memory bus or machines where DMA over

    network is supported (direct transfer from source memoryto destination memory)

    Finally, it is possible to emulate send/recv. on sharedmemory through shared buffers, flags and locks

    Possible to emulate a shared virtual mem. on messagepassing machines through modified page fault handlers

    A gener i c ar ch i t ect u r e

  • 7/30/2019 MCU architecture

    42/232

    June-July 200942

    A gener i c ar ch i t ect u r e In all the architectures we have discussed thus far

    a node essentially contains processor(s) + caches,

    memory and a communication assist (CA)

    CA = network interface (NI) + communication controller

    The nodes are connected over a scalable network

    The main difference remains in the architecture ofthe CA

    And even under a particular programming model (e.g.,

    shared memory) there is a lot of choices in the design of

    the CA

    Most innovations in parallel architecture takes place in

    the communication assist (also called communication

    controller or node controller)

    A gener i c ar ch i t ect u r e

  • 7/30/2019 MCU architecture

    43/232

    June-July 200943

    A gener i c ar ch i t ect u r e

    SCALABLE NETWORK

    NODE NODE NODE NODE

    CACHE

    P

    MEM CAXBAR

    Desig n issu es

  • 7/30/2019 MCU architecture

    44/232

    June-July 200944

    Desig n issu es Need to understand architectural components that

    affect software Compiler, library, program

    User/system interface and hw/sw interface

    How programming models efficiently talk to the

    communication architecture? How to implement efficient primitives in the

    communication layer?

    In a nutshell, what issues of a parallel machine will affect

    the performance of the parallel applications?

    Naming, Operations, Ordering, Replication,Communication cost

    N a m i n g

  • 7/30/2019 MCU architecture

    45/232

    June-July 2009 45

    N a m i n g How are the data in a program referenced?

    In sequential programs a thread can access any variablein its virtual address space

    In shared memory programs a thread can access any

    private or shared variable (same load/store model of

    sequential programs) In message passing programs a thread can access local

    data directly

    Clearly, naming requires some support from hw

    and OS Need to make sure that the accessed virtual address

    gets translated to the correct physical address

    Opera t ions

  • 7/30/2019 MCU architecture

    46/232

    June-July 2009 46

    Opera t ions What operations are supported to access data?

    For sequential and shared memory models load/store aresufficient

    For message passing models send/receive are needed

    to access remote data

    For shared memory, hw (essentially the CA) needs tomake sure that a load/store operation gets correctly

    translated to a message if the address is remote

    For message passing, CA or the message layer needs to

    copy data from local memory and initiate send, or copydata from receive buffer to local memory

    Order ing

  • 7/30/2019 MCU architecture

    47/232

    June-July 2009 47

    Order ing How are the accesses to the same data ordered?

    For sequential model, it is the program order: true

    dependence order For shared memory, within a thread it is the program

    order, across threads some valid interleaving ofaccesses as expected by the programmer and enforced

    by synchronization operations (locks, point-to-pointsynchronization through flags, global synchronizationthrough barriers)

    Ordering issues are very subtle and important in sharedmemory model (some microprocessor re-ordering tricks

    may easily violate correctness when used in sharedmemory context)

    For message passing, ordering across threads is impliedthrough point-to-point send/receive pairs (producer-consumer relationship) and mutual exclusion is inherent

    (no shared variable)

    Repl ica t ion

  • 7/30/2019 MCU architecture

    48/232

    June-July 2009 48

    Repl ica t ion How is the shared data locally replicated?

    This is very important for reducing communication traffic In microprocessors data is replicated in the cache to

    reduce memory accesses

    In message passing, replication is explicit in the program

    and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache

    hierarchy so that subsequent accesses can be fast; this

    is totally hidden from the program and therefore the

    hardware must provide a layer that keeps track of themost recent copies of the data (this layer is central to the

    performance of shared memory multiprocessors and is

    called the cache coherence protocol)

    Com m u n icat i on cost

  • 7/30/2019 MCU architecture

    49/232

    June-July 2009 49

    Com m u n icat i on cost Three major components of the communication

    architecture that affect performance Latency: time to do an operation (e.g., load/store orsend/recv.)

    Bandwidth: rate of performing an operation

    Overhead or occupancy: how long is the communicationlayer occupied doing an operation

    Latency Already a big problem for microprocessors

    Even bigger problem for multiprocessors due to remoteoperations

    Must optimize application or hardware to hide or lowerlatency (algorithmic optimizations or prefetching oroverlapping computation with communication)

    Com m u n icat i on cost

  • 7/30/2019 MCU architecture

    50/232

    June-July 2009 50

    Com m u n icat i on cost Bandwidth

    How many ops in unit time e.g. how many bytes

    transferred per second Local BW is provided by heavily banked memory or

    faster and wider system bus

    Communication BW has two components: 1. node-to-network BW (also called network link BW) measures howfast bytes can be pushed into the router from the CA,2. within-network bandwidth: affected by scalability of thenetwork and architecture of the switch or router

    Linear cost model: Transfer time = T0 + n/B where

    T0 is start-up overhead, n is number of bytestransferred and B is BW Not sufficient since overlap of comp. and comm. is not

    considered; also does not count how the transfer is done

    (pipelined or not)

    Com m u n icat i on cost

  • 7/30/2019 MCU architecture

    51/232

    June-July 2009 51

    Com m u n icat i on cost Better model:

    Communication time for n bytes = Overhead + CA

    occupancy + Network latency + Size/BW + Contention T(n) = Ov + Oc + L + n/B + Tc Overhead and occupancy may be functions of n

    Contention depends on the queuing delay at various

    components along the communication path e.g. waitingtime at the communication assist or controller, waitingtime at the router etc.

    Overall communication cost = frequency ofcommunication x (communication time overlap with

    useful computation)

    Frequency of communication depends on various factorssuch as how the program is written or the granularity ofcommunication supported by the underlying hardware

    I LP v s TLP

  • 7/30/2019 MCU architecture

    52/232

    June-July 2009 52

    I LP v s. TLP Microprocessors enhance performance of a

    sequential program by extracting parallelism from

    an instruction stream (called instruction-levelparallelism)

    Multiprocessors enhance performance of anexplicitly parallel program by running multiple

    threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger

    granularity compared to ILP

    In multiprocessors ILP and TLP work together Within a thread ILP provides performance boost

    Across threads TLP provides speedup over a sequentialversion of the parallel program

  • 7/30/2019 MCU architecture

    53/232

    Para l le lP r o g r a m m i n g

  • 7/30/2019 MCU architecture

    54/232

    Agenda

  • 7/30/2019 MCU architecture

    55/232

    June-July 2009 55

    Agenda Steps in writing a parallel program

    Example

    W r i t i ng a par a l l el

  • 7/30/2019 MCU architecture

    56/232

    June-July 2009 56

    W r i t i ng a par a l l elp r o g r a m

    Start from a sequential description Identify work that can be done in parallel

    Partition work and/or data among threads orprocesses

    Decomposition and assignment

    Add necessary communication and synchronization Orchestration

    Map threads to processors (Mapping) How good is the parallel program? Measure speedup = sequential execution time/parallel

    execution time = number of processors ideally

    Som e def i n i t i on s

  • 7/30/2019 MCU architecture

    57/232

    June-July 2009 57

    Som e def i n i t i on s Task

    Arbitrary piece of sequential work Concurrency is only across tasks

    Fine-grained task vs. coarse-grained task: controls

    granularity of parallelism (spectrum of grain: one

    instruction to the whole sequential program) Process/thread

    Logical entity that performs a task

    Communication and synchronization happen between

    threads

    Processors

    Physical entity on which one or more processes execute

    Decompos i t i on

  • 7/30/2019 MCU architecture

    58/232

    June-July 2009 58

    Decompos i t i on Find concurrent tasks and divide the program into

    tasks Level or grain of concurrency needs to be decided here

    Too many tasks: may lead to too much of overhead

    communicating and synchronizing between tasks

    Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy

    Number of tasks may vary dynamically

    New tasks may get created as the computation

    proceeds: new rays in ray tracing

    Number of available tasks at any point in time is an upper

    bound on the achievable speedup

    St a t i c assign m en t

  • 7/30/2019 MCU architecture

    59/232

    June-July 2009 59

    St a t i c assign m en t Given a decomposition it is possible to assign

    tasks statically For example, some computation on an array of size N

    can be decomposed statically by assigning a range ofindices to each process: for k processes P0 operates onindices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,,

    Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and

    low-overhead

    What if the nature computation depends on the

    index? For certain index ranges you do some heavy-weight

    computation while for others you do something simple

    Is there a problem?

  • 7/30/2019 MCU architecture

    60/232

  • 7/30/2019 MCU architecture

    61/232

    Decom posi t i on t yp es

  • 7/30/2019 MCU architecture

    62/232

    June-July 2009 62

    Decom posi t i on t yp es Decomposition by data

    The most commonly found decomposition technique The data set is partitioned into several subsets and each

    subset is assigned to a process

    The type of computation may or may not be identical on

    each subset Very easy to program and manage

    Computational decomposition

    Not so popular: tricky to program and manage

    All processes operate on the same data, but probablycarry out different kinds of computation

    More common in systolic arrays, pipelined graphics

    processor units (GPUs) etc.

    Orches t ra t i on

  • 7/30/2019 MCU architecture

    63/232

    June-July 2009 63

    Orches t ra t i on Involves structuring communication and

    synchronization among processes, organizing data

    structures to improve locality, and scheduling tasks

    This step normally depends on the programming model

    and the underlying architecture

    Goal is to Reduce communication and synchronization costs

    Maximize locality of data reference

    Schedule tasks to maximize concurrency: do not

    schedule dependent tasks in parallel

    Reduce overhead of parallelization and concurrency

    management (e.g., management of the task queue,

    overhead of initiating a task etc.)

    Mapp ing

  • 7/30/2019 MCU architecture

    64/232

    June-July 2009 64

    Mapp ing At this point you have a parallel program

    Just need to decide which and how many processes go

    to each processor of the parallel machine Could be specified by the program

    Pin particular processes to a particular processor for thewhole life of the program; the processes cannot migrate

    to other processors Could be controlled entirely by the OS

    Schedule processes on idle processors

    Various scheduling algorithms are possible e.g., round

    robin: process#k goes to processor#k NUMA-aware OS normally takes into account

    multiprocessor-specific metrics in scheduling

    How many processes per processor? Most common

    is one-to-one

    An ex am p le

  • 7/30/2019 MCU architecture

    65/232

    June-July 2009 65

    An ex am p le Iterative equation solver

    Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations

    A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j]

    Pad the n by n grid to (n+2) by (n+2) to avoid corner

    problems

    Update only interior n by n grid

    One iteration consists of updating all n2 points in-place

    and accumulating the difference from the previous value

    at each point

    If the difference is less than a threshold, the solver is said

    to have converged to a stable grid equilibrium

    Sequ en t ia l p r og r am

  • 7/30/2019 MCU architecture

    66/232

    June-July 2009 66

    Sequ en t ia l p r og r amint n;

    float **A, diff;

    begin main()

    read (n); /* size of grid */

    Allocate (A);

    Initialize (A);

    Solve (A);

    end main

    begin Solve (A)

    int i, j, done = 0;

    float temp;

    while (!done)

    diff = 0.0;

    for i = 0 to n-1

    for j = 0 to n-1

    temp = A[i,j];

    A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+

    A[i-1,j]+A[i+1,j];

    diff += fabs (A[i,j] - temp);

    endfor

    endfor

    if (diff/(n*n) < TOL) then done = 1;

    endwhile

    end Solve

    Decompos i t i on

  • 7/30/2019 MCU architecture

    67/232

    June-July 2009 67

    Decompos i t i on Look for concurrency in loop iterations

    In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)

    Each anti-diagonal can be computed in parallel

    Must synchronize after each anti-diagonal (or pt-to-pt)

    Alternative: red-black ordering (different update pattern)

  • 7/30/2019 MCU architecture

    68/232

    Decompos i t i on

  • 7/30/2019 MCU architecture

    69/232

    June-July 2009 69

    Decompos i t i onwhile (!done)

    diff = 0.0;

    for_all i = 0 to n-1

    for_all j = 0 to n-1

    temp = A[i, j];

    A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j];

    diff += fabs (A[i, j] temp);

    end for_allend for_all

    if (diff/(n*n) < TOL) then done = 1;

    end while

    Offers concurrency across elements: degree ofconcurrency is n2

    Make the j loop sequential to have row-wisedecomposition: degree n concurrency

    Ass ignmen t

  • 7/30/2019 MCU architecture

    70/232

    June-July 2009 70

    Ass ignmen t Possible static assignment: block row

    decomposition

    Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/pto (2n/p)-1 etc.

    Another static assignment: cyclic rowdecomposition

    Process 0 gets rows 0, p, 2p,; process 1 gets rows 1,p+1, 2p+1,.

    Dynamic assignment Grab next available row, work on that, grab a new row,

    Static block row assignment minimizes nearestneighbor communication by assigning contiguousrows to the same process

  • 7/30/2019 MCU architecture

    71/232

    ar e m em or y

  • 7/30/2019 MCU architecture

    72/232

    June-July 2009 72

    ar e m em or yvers ion

    void Solve (void)

    {

    int i, j, pid, done = 0;

    float temp, local_diff;

    GET_PID (pid);

    while (!done) {

    local_diff = 0.0;

    if (!pid) gm->diff = 0.0;

    BARRIER (gm->barrier, P);/*why?*/

    for (i = pid*(n/P); i < (pid+1)*(n/P);

    i++) {

    for (j = 0; j < n; j++) {

    temp = gm->A[i] [j];

    gm->A[i] [j] = 0.2*(gm->A[i] [j] +

    gm->A[i] [j-1] + gm->A[i] [j+1] + gm-

    >A[i+1] [j] + gm->A[i-1] [j];

    local_diff += fabs (gm->A[i] [j]

    temp);} /* end for */

    } /* end for */

    LOCK (gm->diff_lock);

    gm->diff += local_diff;

    UNLOCK (gm->diff_lock);

    BARRIER (gm->barrier, P);

    if (gm->diff/(n*n) < TOL) done = 1;

    BARRIER (gm->barrier, P); /* why? */

    } /* end while */

    }

    Mut u al ex clus ion

  • 7/30/2019 MCU architecture

    73/232

    June-July 2009 73

    Mut u al ex clus ion Use LOCK/UNLOCK around critical sections

    Updates to shared variable diff must be sequential Heavily contended locks may degrade performance

    Try to minimize the use of critical sections: they are

    sequential anyway and will limit speedup

    This is the reason for using a local_diff instead ofaccessing gm->diff every time

    Also, minimize the size of critical section because the

    longer you hold the lock, longer will be the waiting time

    for other processors at lock acquire

    LOCK op t im iza t ion

  • 7/30/2019 MCU architecture

    74/232

    June-July 2009 74

    LOCK op t im iza t ion Suppose each processor updates a shared variable

    holding a global cost value, only if its local cost isless than the global cost: found frequently in

    minimization problemsLOCK (gm->cost_lock); if (my_cost < gm->cost) {

    if (my_cost < gm->cost) { LOCK (gm->cost_lock);gm->cost = my_cost; if (my_cost < gm->cost) { /* make

    sure*/

    } gm->cost = my_cost;

    UNLOCK (gm->cost_lock); }

    /* May lead to heavy lock UNLOCK (gm->cost_lock);

    contention if everyone } /* this works because gm->cost is

    tries to update at the monotonically decreasing */

    same time */

    Mor e sy n ch r on i zat i on

  • 7/30/2019 MCU architecture

    75/232

    June-July 2009 75

    Mor e sy n ch r on i zat i on Global synchronization

    Through barriers

    Often used to separate computation phases

    Point-to-point synchronization

    A process directly notifies another about a certain event

    on which the latter was waiting Producer-consumer communication pattern

    Semaphores are used for concurrent programming on

    uniprocessor through P and V functions

    Normally implemented through flags on shared memorymultiprocessors (busy wait or spin)

    P0: A = 1; flag = 1;

    P1: while (!flag); use (A);

    Message p assin g

  • 7/30/2019 MCU architecture

    76/232

    June-July 2009 76

    Message p assin g What is different from shared memory?

    No shared variable: expose communication through

    send/receive No lock or barrier primitive

    Must implement synchronization through send/receive

    Grid solver example

    P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e.

    P1 waits to receive rows n/P to 2n/P-1 etc. (this is one-time)

    Within the while loop the first thing that every processordoes is to send its first and last rows to the upper and thelower processors (corner cases need to be handled)

    Then each processor waits to receive the neighboringtwo rows from the upper and the lower processors

    Message p assin g

  • 7/30/2019 MCU architecture

    77/232

    June-July 2009 77

    Message p assin g

    At the end of the loop each processor sends itslocal_diff to P0 and P0 sends back the accumulated

    diff so that each processor can locally compute the

    done flag

    Maj o r ch an ges

  • 7/30/2019 MCU architecture

    78/232

    June-July 2009 78

    Maj o r ch an ges

    /* include files */

    MAIN_ENV;

    int P, n;

    void Solve ();

    struct gm_t {

    LOCKDEC (diff_lock);

    BARDEC (barrier);float **A, diff;

    } *gm;

    int main (char **argv, int argc)

    { int i; int P, n; float **A;

    MAIN_INITENV;

    gm = (struct gm_t*) G_MALLOC(sizeof (struct gm_t));

    LOCKINIT (gm->diff_lock);

    BARINIT (gm->barrier);

    n = atoi (argv[1]);P = atoi (argv[2]);

    gm->A = (float**) G_MALLOC((n+2)*sizeof (float*));

    for (i = 0; i < n+2; i++) {

    gm->A[i] = (float*) G_MALLOC((n+2)*sizeof (float));

    }

    Initialize (gm->A);

    for (i = 1; i < P; i++) { /* starts at 1 */

    CREATE (Solve);

    }Solve ();

    WAIT_FOR_END (P-1);

    MAIN_END;

    }

    Local

    Alloc.

    Maj o r ch an ges

  • 7/30/2019 MCU architecture

    79/232

    June-July 2009 79

    Maj o r ch an ges

    void Solve (void)

    {

    int i, j, pid, done = 0;

    float temp, local_diff;

    GET_PID (pid);

    while (!done) {

    local_diff = 0.0;

    if (!pid) gm->diff = 0.0;

    BARRIER (gm->barrier, P);/*why?*/

    for (i = pid*(n/P); i < (pid+1)*(n/P);

    i++) {

    for (j = 0; j < n; j++) {

    temp = gm->A[i] [j];

    gm->A[i] [j] = 0.2*(gm->A[i] [j] +

    gm->A[i] [j-1] + gm->A[i] [j+1] + gm-

    >A[i+1] [j] + gm->A[i-1] [j];

    local_diff += fabs (gm->A[i] [j]

    temp);} /* end for */

    } /* end for */

    LOCK (gm->diff_lock);

    gm->diff += local_diff;

    UNLOCK (gm->diff_lock);

    BARRIER (gm->barrier, P);

    if (gm->diff/(n*n) < TOL) done = 1;

    BARRIER (gm->barrier, P); /* why? */

    } /* end while */

    }

    if (pid) Recv rows, n, P

    Send up/downRecv up/down

    Send local_diff

    to P0

    Recv diff

    Message p assin g

  • 7/30/2019 MCU architecture

    80/232

    June-July 2009 80

    Message p assin g

    This algorithm is deterministic

    May converge to a different solution compared tothe shared memory version if there are multiple

    solutions: why? There is a fixed specific point in the program (at the

    beginning of each iteration) when the neighboring rows

    are communicated

    This is not true for shared memory

  • 7/30/2019 MCU architecture

    81/232

    Messag e Passin gGr id So lv er

    MPI - l i k e en v i r o n m en t

  • 7/30/2019 MCU architecture

    82/232

    June-July 2009 82

    MPI l i k e en v i r o n m en t MPI stands for Message Passing Interface

    A C library that provides a set of message passingprimitives (e.g., send, receive, broadcast etc.) to the user

    PVM (Parallel Virtual Machine) is another well-known platform for message passing programming

    Background in MPI is not necessary forunderstanding this lecture

    Only need to know When you start an MPI program every thread runs the

    same mainfunction We will assume that we pin one thread to one processor

    just as we did in shared memory

    Instead of using the exact MPI syntax we will use

    some macros that call the MPI functions

    MAIN_ENV;/* define message tags */

    #define ROW 99

    while (!done) {local_diff = 0.0;

    /* MPI CHAR means raw byte format */

  • 7/30/2019 MCU architecture

    83/232

    June-July 2009 83

    #define ROW 99

    #define DIFF 98

    #define DONE 97

    int main(int argc, char **argv){

    int pid, P, done, i, j, N;

    float tempdiff, local_diff, temp, **A;

    MAIN_INITENV;

    GET_PID(pid);

    GET_NUMPROCS(P);

    N = atoi(argv[1]);

    tempdiff = 0.0;

    done = 0;

    A = (double **) malloc ((N/P+2) *sizeof(float *));

    for (i=0; i < N/P+2; i++) {

    A[i] = (float *) malloc (sizeof(float)* (N+2));

    }

    initialize(A);

    / MPI_CHAR means raw byte format /

    if (pid) { /* send my first row up */

    SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

    }if (pid != P-1) { /* recv last row */

    RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

    }

    if (pid != P-1) { /* send last row down */

    SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

    }

    if (pid) { /* recv first row from above */

    RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

    }

    for (i=1; i

  • 7/30/2019 MCU architecture

    84/232

    June-July 2009 84

    #define ROW 99

    #define DIFF 98

    #define DONE 97

    int main(int argc, char **argv){

    int pid, P, done, i, j, N;

    float tempdiff, local_diff, temp, **A;

    MAIN_INITENV;

    GET_PID(pid);

    GET_NUMPROCS(P);

    N = atoi(argv[1]);

    tempdiff = 0.0;

    done = 0;

    A = (double **) malloc ((N/P+2) *sizeof(float *));

    for (i=0; i < N/P+2; i++) {

    A[i] = (float *) malloc (sizeof(float)* (N+2));

    }

    initialize(A);

    / MPI_CHAR means raw byte format /

    if (pid) { /* send my first row up */

    SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

    }if (pid != P-1) { /* recv last row */

    RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

    }

    if (pid != P-1) { /* send last row down */

    SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);

    }

    if (pid) { /* recv first row from above */

    RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);

    }

    for (i=1; i

  • 7/30/2019 MCU architecture

    85/232

    June-July 2009 85

    Per fo rmanceI ssu es

    Agenda

  • 7/30/2019 MCU architecture

    86/232

    June-July 2009 86

    g Partitioning for performance

    Data access and communication Summary

    Goal is to understand simple trade-offs involved inwriting a parallel program keeping an eye on

    parallel performance

    Getting good performance out of a multiprocessor is

    difficult Programmers need to be careful

    A little carelessness may lead to extremely poor

    performance

    Par t i t ion i ng f o r per f .

  • 7/30/2019 MCU architecture

    87/232

    June-July 2009 87

    g p Partitioning plays an important role in the parallel

    performance This is where you essentially determine the tasks

    A good partitioning should practise

    Load balance

    Minimal communication

    Low overhead to determine and manage task

    assignment (sometimes called extra work)

    A well-balanced parallel program automatically haslow barrier or point-to-point synchronization time

    Ideally I want all the threads to arrive at a barrier at the

    same time

    Load b a lan cin g

  • 7/30/2019 MCU architecture

    88/232

    June-July 2009 88

    g Achievable speedup is bounded above by

    Sequential exec. time / Max. time for any processor

    Thus speedup is maximized when the maximum timeand minimum time across all processors are close (wantto minimize the variance of parallel execution time)

    This directly gets translated to load balancing

    What leads to a high variance? Ultimately all processors finish at the same time But some do useful work all over this period while others

    may spend a significant time at synchronization points

    This may arise from a bad partitioning

    There may be other architectural reasons for loadimbalance beyond the scope of a programmer e.g.,network congestion, unforeseen cache conflicts etc.(slows down a few threads)

    Dyn am ic t ask qu eu es

  • 7/30/2019 MCU architecture

    89/232

    June-July 2009 89

    y q Introduced in the last lecture

    Normally implemented as part of the parallelprogram

    Two possible designs

    Centralized task queue: a single queue of tasks; may

    lead to heavy contention because insertion and deletion

    to/from the queue must be critical sections

    Distributed task queues: one queue per processor

    Issue with distributed task queues When a queue of a particular processor is empty whatdoes it do? Task stealing

    Task st eal in g

  • 7/30/2019 MCU architecture

    90/232

    June-July 2009 90

    g A processor may choose to steal tasks from

    another processors queue if the formers queue is

    empty How many tasks to steal? Whom to steal from?

    The biggest question: how to detect termination? Reallya distributed consensus!

    Task stealing, in general, may increase overhead andcommunication, but a smart design may lead to excellentload balance (normally hard to design efficiently)

    This is a form of a more general technique calledReceiver Initiated Diffusion (RID) where the receiver ofthe task initiates the task transfer

    In Sender Initiated Diffusion (SID) a processor maychoose to insert into another processors queue if theformers task queue is full above a threshold

    Ar ch i t ect s j ob

  • 7/30/2019 MCU architecture

    91/232

    June-July 2009 91

    j Normally load balancing is a responsibility of the

    programmer

    However, an architecture may provide efficient primitivesto implement task queues and task stealing

    For example, the task queue may be allocated in aspecial shared memory segment, accesses to which maybe optimized by special hardware in the memorycontroller

    But this may expose some of the architectural features tothe programmer

    There are multiprocessors that provide efficient

    implementations for certain synchronization primitives;this may improve load balance

    Sophisticated hardware tricks are possible: dynamic loadmonitoring and favoring slow threads dynamically

    Par t i t i on in g an d

  • 7/30/2019 MCU architecture

    92/232

    June-July 2009 92

    gc o m m u n i c a t i o n

    Need to reduce inherent communication This is the part of communication determined by

    assignment of tasks

    There may be other communication traffic also (more

    later)

    Goal is to assign tasks such that accessed data aremostly local to a process

    Ideally I do not want any communication But in life sometimes you need to talk to people to get

    some work done!

    Dom ain decom posi t i on

  • 7/30/2019 MCU architecture

    93/232

    June-July 2009 93

    p Normally applications show a local bias on data

    usage Communication is short-range e.g. nearest neighbor

    Even if it is long-range it falls off with distance

    View the dataset of an application as the domain of the

    problem e.g., the 2-D grid in equation solver If you consider a point in this domain, in most of the

    applications it turns out that this point depends on points

    that are close by

    Partitioning can exploit this property by assigningcontiguous pieces of data to each process

    Exact shape of decomposed domain depends on the

    application and load balancing requirements

    Co m m - t o - co m p r at io

  • 7/30/2019 MCU architecture

    94/232

    June-July 2009 94

    p Surely, there could be many different domain

    decompositions for a particular problem For grid solver we may have a square blockdecomposition, block row decomposition or cyclic row

    decomposition

    How to determine which one is good? Communication-to-computation ratio

    Assume P processors and NxN grid for grid solver

    P0 P1 P2 P3

    P4 P5 P6 P7

    P15

    Sq. block decomp. for P=16

    Size of each block: N/P by N/P

    Communication (perimeter): 4N/PComputation (area): N2/P

    Comm-to-comp ratio = 4P/N

    Co m m - t o - co m p r at io

  • 7/30/2019 MCU architecture

    95/232

    June-July 2009 95

    p For block row decomposition

    Each strip has N/P rows Communication (boundary rows): 2N

    Computation (area): N2/P (same as square block)

    Comm-to-comp ratio: 2P/N

    For cyclic row decomposition Each processor gets N/P isolated rows

    Communication: 2N2/P

    Computation: N2/P

    Comm-to-comp ratio: 2

    Normally N is much much larger than P

    Asymptotically, square block yields lowest comm-to-

    comp ratio

    Co m m - t o - co m p r at ioId i t th l f i h t

  • 7/30/2019 MCU architecture

    96/232

    June-July 2009 96

    p Idea is to measure the volume of inherent

    communication per computation

    In most cases it is beneficial to pick the decompositionwith the lowest comm-to-comp ratio

    But depends on the application structure i.e. picking thelowest comm-to-comp may have other problems

    Normally this ratio gives you a rough estimate aboutaverage communication bandwidth requirement of theapplication i.e. how frequent is communication

    But it does not tell you the nature of communication i.e.bursty or uniform

    For grid solver comm. happens only at the start of eachiteration; it is not uniformly distributed over computation

    Thus the worst case BW requirement may exceed theaverage comm-to-comp ratio

    Ex t r a w o r kf

  • 7/30/2019 MCU architecture

    97/232

    June-July 2009 97

    Extra work in a parallel version of a sequentialprogram may result from

    Decomposition

    Assignment techniques

    Management of the task pool etc.

    Speedup is bounded above by

    Sequential work / Max (Useful work +

    Synchronization + Comm. cost + Extra work)

    where the Max is taken over all processors

    But this is still incomplete

    We have only considered communication cost from the

    viewpoint of the algorithm and ignored the architecture

    completely

    Dat a access an di t i

  • 7/30/2019 MCU architecture

    98/232

    June-July 2009 98

    c o m m u n i c a t i o n The memory hierarchy (caches and main memory)

    plays a significant role in determiningcommunication cost May easily dominate the inherent communication of the

    algorithm For uniprocessor, the execution time of a program

    is given by useful work time + data access time Useful work time is normally called the busy time or busy

    cycles Data access time can be reduced either by architectural

    techniques (e.g., large caches) or by cache-awarealgorithm design that exploits spatial and temporallocality

    Dat a accessI lti

  • 7/30/2019 MCU architecture

    99/232

    June-July 2009 99

    In multiprocessors

    Every processor wants to see the memory interface as its

    own local cache and the main memory In reality it is much more complicated

    If the system has a centralized memory (e.g., SMPs),

    there are still caches of other processors; if the memory

    is distributed then some part of it is local and some isremote

    For shared memory, data movement from local or remote

    memory to cache is transparent while for message

    passing it is explicit View a multiprocessor as an extended memory hierarchy

    where the extension includes caches of other

    processors, remote memory modules and the network

    topology

    Ar t i f act u a l co m m .

  • 7/30/2019 MCU architecture

    100/232

    June-July 2009 100

    Communication caused by artifacts of extendedmemory hierarchy

    Data accesses not satisfied in the cache or local memory

    cause communication

    Inherent communication is caused by data transfers

    determined by the program

    Artifactual communication is caused by poor allocation of

    data across distributed memories, unnecessary data in a

    transfer, unnecessary transfers due to system-dependent

    transfer granularity, redundant communication of data,

    finite replication capacity (in cache or memory)

    Inherent communication assumes infinite capacityand perfect knowledge of what should be

    transferred

  • 7/30/2019 MCU architecture

    101/232

  • 7/30/2019 MCU architecture

    102/232

    Sp at ia l loca l i t y

  • 7/30/2019 MCU architecture

    103/232

    June-July 2009 103

    Consider a square block decomposition of grid

    solver and a C-like row major layout i.e. A[i][j] andA[i][j+1] have contiguous memory locations

    Memory

    allocation

    Page straddles

    partition boundary

    Cache line

    across partition

    Page

    Cache line

    The same page is local

    to a processor while

    remote to others; sameapplies to straddling

    cache lines. Ideally, I

    want to have all pages

    within a partition local toa single processor.

    Standard trick is to

    covert the 2D array to

    4D.

    2 D t o 4 D con v er sion

  • 7/30/2019 MCU architecture

    104/232

    June-July 2009 104

    Essentially you need to change the way memory isallocated

    The matrix A needs to be allocated in such a way that theelements falling within a partition are contiguous

    The first two dimensions of the new 4D matrix are blockrow and column indices i.e. for the partition assigned to

    processor P6 these are 1 and 2 respectively (assuming16 processors)

    The next two dimensions hold the data elements withinthat partition

    Thus the 4D array may be declared as

    float B[P][P][N/P][N/P] The element B[3][2][5][10] corresponds to the element in

    10th column, 5th row of the partition of P14 Now all elements within a partition have contiguous

    addresses

    Tr an sfer g r an u lar i t y

  • 7/30/2019 MCU architecture

    105/232

    June-July 2009 105

    How much data do you transfer in onecommunication?

    For message passing it is explicit in the program

    For shared memory this is really under the control of the

    cache coherence protocol: there is a fixed size for which

    transactions are defined (normally the block size of the

    outermost level of cache hierarchy)

    In shared memory you have to be careful

    Since the minimum transfer size is a cache line you may

    end up transferring extra data e.g., in grid solver theelements of the left and right neighbors for a square

    block decomposition (you need only one element, but

    must transfer the whole cache line): no good solution

    W or se: fa lse sh ar in g

  • 7/30/2019 MCU architecture

    106/232

    June-July 2009 106

    If the algorithm is designed so poorly that

    Two processors write to two different words within acache line at the same time

    The cache line keeps on moving between two processors

    The processors are not really accessing or updating the

    same element, but whatever they are updating happen tofall within a cache line: not a true sharing, but false

    sharing

    For shared memory programs false sharing can easily

    degrade performance by a lot

    Easy to avoid: just pad up to the end of the cache line

    before starting the allocation of the data for the next

    processor (wastes memory, but improves performance)

  • 7/30/2019 MCU architecture

    107/232

    H ot - spo t s

  • 7/30/2019 MCU architecture

    108/232

    June-July 2009 108

    Avoid location hot-spot by either staggering

    accesses to the same location or by designing thealgorithm to exploit a tree structured

    communication

    Module hot-spot

    Normally happens when a particular node saturateshandling too many messages (need not be to same

    memory location) within a short amount of time

    Normal solution again is to design the algorithm in such a

    way that these messages are staggered over time

    Rule of thumb: design communication pattern suchthat it is not bursty; want to distribute it uniformly

    over time

    Over lap

  • 7/30/2019 MCU architecture

    109/232

    June-July 2009 109

    Increase overlap between communication andcomputation

    Not much to do at algorithm level unless the

    programming model and/or OS provide some primitives

    to carry out prefetching, block data transfer, non-blocking

    receive etc.

    Normally, these techniques increase bandwidth demand

    because you end up communicating the same amount of

    data, but in a shorter amount of time (execution time

    hopefully goes down if you can exploit overlap)

    S u m m a r y

  • 7/30/2019 MCU architecture

    110/232

    June-July 2009 110

    Parallel programs introduce three overhead terms:busy overhead (extra work), remote data access

    time, and synchronization time

    Goal of a good parallel program is to minimize these

    three terms

    Goal of a good parallel computer architecture is to

    provide sufficient support to let programmers optimize

    these three terms (and this is the focus of the rest of the

    course)

  • 7/30/2019 MCU architecture

    111/232

    Fou r o r gan i zat i onsSh d h Interconnect is between the

  • 7/30/2019 MCU architecture

    112/232

    June-July 2009 112

    Shared cache

    The switch is a simplecontroller for granting

    access to cache banks

    Interconnect is between theprocessors and the sharedcache

    Which level of cachehierarchy is shared dependson the design: Chipmultiprocessors today

    normally share the outermostlevel (L2 or L3 cache)

    The cache and memory areinterleaved to improvebandwidth by allowing

    multiple concurrent accesses Normally small scale due to

    heavy bandwidth demand onswitch and shared cache

    P0 Pn

    SWITCH

    INTERLEAVEDSHARED CACHE

    INTERLEAVED

    MEMORY

    Fou r o r gan i zat i ons

  • 7/30/2019 MCU architecture

    113/232

    June-July 2009 113

    Bus-based SMP

    Scalability is limited bythe shared bus bandwidth

    Interconnect is a shared buslocated between the private

    cache hierarchies andmemory controller

    The most popularorganization for small to

    medium-scale servers Possible to connect 30 or so

    processors with smart bus

    design

    Bus bandwidth requirement islower compared to shared

    cache approach

    Why?

    P0 Pn

    CACHE CACHE

    MEM

    BUS

    Fou r o r gan i zat i ons

  • 7/30/2019 MCU architecture

    114/232

    June-July 2009 114

    Dancehall Better scalability comparedto previous two designs

    The difference from bus-based SMP is that the

    interconnect is a scalable

    point-to-point network (e.g.

    crossbar or other topology)

    Memory is still symmetricfrom all processors

    Drawback: a cache miss may

    take a long time since allmemory banks too far off

    from the processors (may be

    several network hops)

    P0 Pn

    CACHE CACHE

    INTERCONNECT

    MEM MEM

    Fou r o r gan i zat i ons

  • 7/30/2019 MCU architecture

    115/232

    June-July 2009 115

    Distributed shared memory The most popular scalableorganization

    Each node now has localmemory banks

    Shared memory on othernodes must be accessedover the network Remote memory access

    Non-uniform memory access(NUMA) Latency to access local

    memory is much smallercompared to remote memory

    Caching is very important toreduce remote memoryaccess

    P0 Pn

    CACHE CACHE

    MEM MEM

    INTERCONNECT

    Fou r o r gan i zat i ons In all four organizations caches play an important

  • 7/30/2019 MCU architecture

    116/232

    June-July 2009 116

    In all four organizations caches play an importantrole in reducing latency and bandwidth requirement

    If an access is satisfied in cache, the transaction will notappear on the interconnect and hence the bandwidth

    requirement of the interconnect will be less (shared L1

    cache does not have this advantage)

    In distributed shared memory (DSM) cache andlocal memory should be used cleverly

    Bus-based SMP and DSM are the two designssupported today by industry vendors

    In bus-based SMP every cache miss is launched on the

    shared bus so that all processors can see all transactions

    In DSM this is not the case

    Hier ar ch ica l d esig n

  • 7/30/2019 MCU architecture

    117/232

    June-July 2009 117

    Possible to combine bus-based SMP and DSM tobuild hierarchical shared memory Sun Wildfire connects four large SMPs (28 processors)

    over a scalable interconnect to form a 112p

    multiprocessor IBM POWER4 has two processors on-chip with privateL1 caches, but shared L2 and L3 caches (this is called achip multiprocessor); connect these chips over a networkto form scalable multiprocessors

    Next few lectures will focus on bus-based SMPsonly

    Cach e Coh er en ce

  • 7/30/2019 MCU architecture

    118/232

    June-July 2009 118

    Intuitive memory model

    For sequential programs we expect a memory location to

    return the latest value written to that location

    For concurrent programs running on multiple threads or

    processes on a single processor we expect the same

    model to hold because all threads see the same cache

    hierarchy (same as shared L1 cache)

    For multiprocessors there remains a danger of using a

    stale value: in SMP or DSM the caches are not shared

    and processors are allowed to replicate data

    independently in each cache; hardware must ensure thatcached values are coherent across the system and they

    satisfy programmers intuitive memory model

    Examp le

  • 7/30/2019 MCU architecture

    119/232

    June-July 2009 119

    Assume a write-through cache i.e. every storeupdates the value in cache as well as in memory

    P0: reads x from memory, puts it in its cache, and gets

    the value 5

    P1: reads x from memory, puts it in its cache, and gets

    the value 5

    P1: writes x=7, updates its cached value and memory

    value

    P0: reads x from its cache and gets the value 5

    P2: reads x from memory, puts it in its cache, and getsthe value 7 (now the system is completely incoherent)

    P2: writes x=10, updates its cached value and memory

    value

  • 7/30/2019 MCU architecture

    120/232

    W h at w en t w r on g?

  • 7/30/2019 MCU architecture

    121/232

    June-July 2009 121

    For write through cache

    The memory value may be correct if the writes are

    correctly ordered

    But the system allowed a store to proceed when there is

    already a cached copy

    Lesson learned: must invalidate all cached copies before

    allowing a store to proceed

    Writeback cache

    Problem is even more complicated: stores are no longer

    visible to memory immediately Writeback order is important

    Lesson learned: do not allow more than one copy of a

    cache line in M state

  • 7/30/2019 MCU architecture

    122/232

    Def in i t i onsM ti d (l d) it ( t )

  • 7/30/2019 MCU architecture

    123/232

    June-July 2009 123

    Memory operation: a read (load), a write (store), ora read-modify-write

    Assumed to take place atomically

    A memory operation is said to issue when it leavesthe issue queue and looks up the cache

    A memory operation is said to perform with respectto a processor when a processor can tell that fromother issued memory operations A read is said to perform with respect to a processor

    when subsequent writes issued by that processor cannot

    affect the returned read value A write is said to perform with respect to a processor

    when a subsequent read from that processor to the sameaddress returns the new value

    Or d er i n g m em or y opA ti i id t l t h it h

  • 7/30/2019 MCU architecture

    124/232

    June-July 2009 124

    A memory operation is said to complete when it hasperformed with respect to all processors in the

    system

    Assume that there is a single shared memory andno caches

    Memory operations complete in shared memory whenthey access the corresponding memory locations

    Operations from the same processor complete in

    program order: this imposes a partial orderamong the

    memory operations Operations from different processors are interleaved in

    such a way that the program order is maintained for each

    processor: memory imposes some total order(many are

    possible)

  • 7/30/2019 MCU architecture

    125/232

    Cach e coh er en ce Formal definition

  • 7/30/2019 MCU architecture

    126/232

    June-July 2009 126

    A memory system is coherent if the values returned byreads to a memory location during an execution of a

    program are such that all operations to that location canform a hypothetical total order that is consistent with theserial order and has the following two properties:

    1. Operations issued by any particular processor perform

    according to the issue order2. The value returned by a read is the value written to thatlocation by the last write in the total order

    Two necessary features that follow from above:

    A. Write propagation: writes must eventually becomevisible to all processors

    B. Write serialization: Every processor should see thewrites to a location in the same order (if I see w1 beforew2, you should not see w2 before w1)

    Bu s- b ased SMP Extend the philosophy of uniprocessor bus

  • 7/30/2019 MCU architecture

    127/232

    June-July 2009 127

    p p y ptransactions

    Three phases: arbitrate for bus, launch command (oftencalled request) and address, transfer data

    Every device connected to the bus can observe thetransaction

    Appropriate device responds to the request

    In SMP, processors also observe the transactions andmay take appropriate actions to guarantee coherence

    The other device on the bus that will be of interest to usis the memory controller (north bridge in standard mother

    boards) Depending on the bus transaction a cache block

    executes a finite state machine implementing thecoherence protocol

    Sn oopy p r o t oco l s Cache coherence protocols implemented in bus-

  • 7/30/2019 MCU architecture

    128/232

    June-July 2009 128

    Cache coherence protocols implemented in busbased machines are called snoopy protocols

    The processors snoop or monitor the bus and takeappropriate protocol actions based on snoop results

    Cache controller now receives requests both fromprocessor and bus

    Since cache state is maintained on a per line basis thatalso dictates the coherence granularity

    Cannot normally take a coherence action on parts of acache line

    The coherence protocol is implemented as a finite state

    machine on a per cache line basis The snoop logic in each processor grabs the address

    from the bus and decides if any action should be takenon the cache line containing that address (only if the lineis in cache)

  • 7/30/2019 MCU architecture

    129/232

    St at e t r an si t i on The finite state machine for each cache line:

  • 7/30/2019 MCU architecture

    130/232

    June-July 2009 130

    The finite state machine for each cache line:

    On a write miss no line is allocated The state remains at I: called write through write no-

    allocated

    A/B means: A is generated by processor, B is theresulting bus transaction (if any)

    Changes for write through write allocate?

    I V

    PrWr/BusWr

    BusWr (snoop)

    PrRd/BusRd PrRd/-

    PrWr/BusWr

  • 7/30/2019 MCU architecture

    131/232

    W r i t e t h r o u g h i s b ad High bandwidth requirement

  • 7/30/2019 MCU architecture

    132/232

    June-July 2009 132

    High bandwidth requirement Every write appears on the bus

    Assume a 3 GHz processor running application with 10%store instructions, assume CPI of 1

    If the application runs for 100 cycles it generates 10stores; assume each store is 4 bytes; 40 bytes are

    generated per 100/3 ns i.e. BW of 1.2 GB/s A 1 GB/s bus cannot even support one processor

    There are multiple processors and also there are readmisses

    Writeback caches absorb most of the write traffic Writes that hit in cache do not go on bus (not visible toothers)

    Complicated coherence protocol with many choices

    Mem or y con si st en cy Need a more formal description of memory

  • 7/30/2019 MCU architecture

    133/232

    June-July 2009 133

    p yordering

    How to establish the order between reads and writesfrom different processors?

    The most clear way is to use synchronization

    P0: A=1; flag=1

    P1: while (!flag); print A; Another example (assume A=0, B=0 initially)

    P0: A=1; print B;

    P1: B=1; print A; What do you expect?

    Memory consistency model is a contract betweenprogrammer and hardware regarding memory

    ordering

    Con sist en cy m od el A multiprocessor normally advertises the supported

  • 7/30/2019 MCU architecture

    134/232

    June-July 2009 134

    A multiprocessor normally advertises the supportedmemory consistency model

    This essentially tells the programmer what the possible

    correct outcome of a program could be when run on that

    machine

    Cache coherence deals with memory operations to the

    same location, but not different locations

    Without a formally defined order across all memory

    operations it often becomes impossible to argue about

    what is correct and what is wrong in shared memory

    Various memory consistency models

    Sequential consistency (SC) is the most intuitive one and

    we will focus on it now (more consistency models later)

  • 7/30/2019 MCU architecture

    135/232

  • 7/30/2019 MCU architecture

    136/232

    OOO an d SC Consider a simple example (all are zero initially)P0 1 1

  • 7/30/2019 MCU architecture

    137/232

    June-July 2009 137

    P0: x=w+1; r=y+1;

    P1: y=2; w=y+1;

    Suppose the load that reads w takes a miss and so w isnot ready for a long time; therefore, x=w+1 cannotcomplete immediately; eventually w returns with value 3

    Inside the microprocessor r=y+1 completes (but does not

    commit) before x=w+1 and gets the old value of y(possibly from cache); eventually instructions commit inorder with x=4, r=1, y=2, w=3

    So we have the following partial orders

    P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2

    Combine these to get a contradictory total order

    What went wrong? We will discuss it in detail later

    SC ex am p le Consider the following example

  • 7/30/2019 MCU architecture

    138/232

    June-July 2009 138

    Consider the following exampleP0: A=1; print B;

    P1: B=1; print A;

    Possible outcomes for an SC machine (A, B) = (0,1); interleaving: B=1; print A; A=1; print B

    (A, B) = (1,0); interleaving: A=1; print B; B=1; print A

    (A, B) = (1,1); interleaving: A=1; B=1; print A; print B

    A=1; B=1; print B; print A

    (A, B) = (0,0) is impossible: read of A must occur beforewrite of A and read of B must occur before write of B i.e.

    print A < A=1 and print B < B=1, but A=1 < print B andB=1 < print A; thus print B < B=1 < print A < A=1 < print Bwhich implies print B < print B, a contradiction

    I m p lem en t in g SC Two basic requirements

  • 7/30/2019 MCU architecture

    139/232

    June-July 2009 139

    Two basic requirements

    Memory operations issued by a processor must become

    visible to others in program order

    Need to make sure that all processors see the same total

    order of memory operations: in the previous example for

    the (0,1) case both P0 and P1 should see the same

    interleaving: B=1; print A; A=1; print B

    The tricky part is to make sure that writes becomevisible in the same order to all processors

    Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different

    values (which may still be correct from the viewpoint of

    cache coherence, but will violate SC)

    W r i t e at om ici t yE l (A 0 B 0 i iti ll )

  • 7/30/2019 MCU architecture

    140/232

    June-July 2009 140

    Example (A=0, B=0 initially)

    P0: A=1;P1: while (!A); B=1;

    P2: while (!B); print A;

    A correct execution on an SC machine should print

    A=1 A=0 will be printed only if write to A is not visible to P2,but clearly it is visible to P1 since it came out of the loop

    Thus A=0 is possible if P1 sees the order A=1 < B=1 andP2 sees the order B=1 < A=1 i.e. from the viewpoint of

    the whole system the write A=1 was not atomic Without write atomicity P2 may proceed to print 0 with a

    stale value from its cache

    Su m m ar y o f SC Program order from each processor creates apartial order among memory operations

  • 7/30/2019 MCU architecture

    141/232

    June-July 2009 141

    partial order among memory operations

    Interleaving of these partial orders defines a totalorder

    Sequential consistency: one of many total orders

    A multiprocessor is said to be SC if any execution

    on this machine is SC compliant Sufficient but not necessary conditions for SC

    Issue memory operation in program order

    Every processor waits for write to complete before

    issuing the next operation Every processor waits for read to complete and the write

    that affects the returned value to complete before issuingthe next operation (important for write atomicity)

    Back t o sh ar ed bu s Centralized shared bus makes it easy to support

    SC

  • 7/30/2019 MCU architecture

    142/232

    June-July 2009 142

    SC

    Writes and reads are all serialized in a total order throughthe bus transaction ordering

    If a read gets a value of a previous write, that write isguaranteed to be complete because that bus transactionis complete

    The write order seen by all processors is the same in awrite through system because every write causes atransaction and hence is visible to all in the same order

    In a nutshell, every processor sees the same total busorder for all memory operations and therefore any bus-based SMP with write through caches is SC

    What about a multiprocessor with writeback cache? No SMP uses write through protocol due to high BW

  • 7/30/2019 MCU architecture

    143/232

    Stores Look at stores a little more closely

  • 7/30/2019 MCU architecture

    144/232

    June-July 2009 144

    Look at stores a little more closely There are three situations at the time a store issues: the

    line is not in the cache, the line is in the cache in S state,the line is in the cache in one of M, E and O states

    If the line is in I state, the store generates aread-exclusive request on the bus and gets the line in Mstate

    If the line is in S or O state, that means the processoronly has read permission for that line; the storegenerates an upgrade request on the bus and theupgrade acknowledgment gives it the write permission

    (this is a data-less transaction) If the line is in M or E state, no bus transaction is

    generated; the cache already has write permission forthe line (this is the case of a write hit; previous two arewrite misses)

    I n v a l id at io n v s. u p d at e Two main classes of protocols:

    I lid ti b d d d t b d

  • 7/30/2019 MCU architecture

    145/232

    June-July 2009 145

    Invalidation-based and update-based

    Dictates what action should be taken on a write Invalidation-based protocols invalidate sharers when awrite miss (upgrade or readX) appears on the bus

    Update-based protocols update the sharer caches withnew value on a write: requires write transactions

    (carrying just the modified bytes) on the bus even onwrite hits (not very attractive with writeback caches)

    Advantage of update-based protocols: sharers continueto hit in the cache while in invalidation-based protocolssharers will miss next time they try to access the line

    Advantage of invalidation-based protocols: only writemisses go on bus (suited for writeback caches) andsubsequent stores to the same line are cache hits

    W h ich on e is bet t er ? Difficult to answer

    D d b h i d h d t

  • 7/30/2019 MCU architecture

    146/232

    June-July 2009 146

    Depends on program behavior and hardware cost

    When is update-based protocol good? What sharing pattern? (large-scale producer/consumer) Otherwise it would just waste bus bandwidth doing

    useless updates

    When is invalidation-protocol good? Sequence of multiple writes to a cache line Saves intermediate write transactions

    Also think about the overhead of initiating small

    updates for every write in update protocols Invalidation-based protocols are much more popular

    Some systems support both or maybe some hybridbased on dynamic sharing pattern of a cache line

    MSI p r o t oco l Forms the foundation of invalidation-based

  • 7/30/2019 MCU architecture

    147/232

    June-July 2009 147

    Forms the foundation of invalidation based

    writeback protocols

    Assumes only three supported cache line states: I, S,

    and M

    There may be multiple processors caching a line in S

    state

    There must be exactly one processor caching a line in M

    state and it is the owner of the line

    If none of the caches have the line, memory must have

    the most up-to-date copy of the line

    Processor requests to cache: PrRd, PrWr

    Bus transactions: BusRd, BusRdX, BusUpgr,BusWB

    St at e t r an si t i on

  • 7/30/2019 MCU architecture

    148/232

    June-July 2009 148

    I S M

    PrRd/BusRd

    PrWr/BusRdX

    PrRd/-BusRd/-

    {BusRdX, BusUpgr}/-CacheEvict/-

    PrWr/BusUpgr

    PrRd/-

    PrWr/-BusRd/Flush

    BusRdX/Flush

    CacheEvict/BusWB

    MSI p r o t oco l Few things to note

  • 7/30/2019 MCU architecture

    149/232

    June-July 2009 149

    Few things to note

    Flush operation essentially launches the line on the bus

    Processor with the cache line in M state is responsible

    for flushing the line on bus whenever there is a BusRd or

    BusRdX transaction generated by some other processor

    On BusRd the line transitions from M to S, but not M to I.

    Why? Also at this point both the requester and memory

    pick up the line from the bus; the requester puts the line

    in its cache in S state while memory writes the line back.

    Why does memory need to write back?

    On BusRdX the line transitions from M to I and this timememory does not need to pick up the line from bus. Only

    the requester picks up the line and puts it in M state in its

    cache. Why?

  • 7/30/2019 MCU architecture

    150/232

    MSI ex am p le Take the following example P0 reads x P1 reads x P1 writes x P0 reads x P2 reads

  • 7/30/2019 MCU architecture

    151/232

    June-July 2009 151

    P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 readsx, P3 writes x

    Assume the state of the cache line containing theaddress of x is I in all processors

    P0 generates BusRd, memory provides line, P0 puts line inS state

    P1 generates BusRd, memory provides line, P1 puts line inS state

    P1 generates BusUpgr, P0 snoops and invalidates line,memory does not respond, P1 sets state of line to M

    P0 generates BusRd, P1 flushes line and goes to S state,P0 puts line in S state, memory writes back

    P2 generates BusRd, memory provides line, P2 puts line inS state

    P3 generates BusRdX, P0, P1, P2 snoop and invalidate,

    memor rovides line P3 uts line in cache in M state

    MESI p r o t oco l The most popular invalidation-based protocol e.g.,

  • 7/30/2019 MCU architecture

    152/232

    June-July 2009 152

    The most popular invalidation based protocol e.g.,

    appears in Intel Xeon MP

    Why need E state?

    The MSI protocol requires two transactions to go from I to

    M even if there is no intervening requests for the line:

    BusRd followed by BusUpgr

    We can save one transaction by having memory

    controller respond to the first BusRd with E state if there

    is no other sharer in the system

    How to know if there is no other sharer? Needs a

    dedicated control wire that gets asserted by a sharer

    (wired OR)

    Processor can write to a line in E state silently and take it

    to M state

    St at e t r an si t i onPrRd/BusRd(S) PrRd/-

  • 7/30/2019 MCU architecture

    153/232

    June-July 2009 153

    I

    ME

    S

    PrRd/BusRd(S)

    PrRd/BusRd(!S)

    PrWr/BusRdX

    PrRd/-

    BusRdX/Flush

    CacheEvict/-

    PrWr/-

    BusRd/Flush

    PrRd/

    BusRd/Flush

    {BusRdX, BusUpgr}/FlushCacheEvict/-

    PrWr/BusUpgr

    PrRd/-

    PrWr/-

    BusRd/Flush

    BusRdX/Flush

    CacheEvict/BusWB

    MESI p r o t oco l If a cache line is in M state definitely the processor

  • 7/30/2019 MCU architecture

    154/232

    June-July 2009 154

    If a cache line is in M state definitely the processorwith the line is responsible for flushing it on the next

    BusRd or BusRdX transaction If a line is not in M state who is responsible?

    Memory or other caches in S or E state?

    Original Illinois MESI protocol assumed cache-to-cachetransfer i.e. any processor in E or S state is responsiblefor flush