MCU architecture
-
Upload
sachinshetty001 -
Category
Documents
-
view
230 -
download
0
Transcript of MCU architecture
-
7/30/2019 MCU architecture
1/232
ProgramOpt im izat ion f o r
Mu l t i - co re :Har dw ar e si de o f i t
-
7/30/2019 MCU architecture
2/232
June-July 2009 2
Conten ts Virtual Memory and Caches (Recap)
Fundamentals of Parallel Computers: ILP vs. TLP
Parallel Programming: Shared Memory andMessage Passing
Performance Issues in Shared Memory Shared Memory Multiprocessors: Consistency and
Coherence
Synchronization Memory consistency models
Case Studies of CMP
-
7/30/2019 MCU architecture
3/232
RECAP:VI RTUAL MEMORY
ANDCACHE
-
7/30/2019 MCU architecture
4/232
June-July 2009 4
W h y v i r t u a l m em or y ? With a 32-bit address you can access 4 GB of
physical memory (you will never get the fullmemory though) Seems enough for most day-to-day applications
But there are important applications that have much
bigger memory footprint: databases, scientific appsoperating on large matrices etc.
Even if your application fits entirely in physical memory itseems unfair to load the full image at startup
Just takes away memory from other processes, butprobably doesnt need the full image at any point of timeduring execution: hurts multiprogramming
Need to provide an illusion of bigger memory:Virtual Memory (VM)
-
7/30/2019 MCU architecture
5/232
June-July 2009 5
Vi r t u a l m em or y Need an address to access virtual memory
Virtual Address (VA)
Assume a 32-bit VA Every process sees a 4 GB of virtual memory
This is much better than a 4 GB physical memory sharedbetween multiprogrammed processes
The size of VA is really fixed by the processor data pathwidth
64-bit processors (Alpha 21264, 21364; SunUltraSPARC; AMD Athlon64, Opteron; IBM POWER4,
POWER5; MIPS R10000 onwards; Intel Itanium etc., andrecently Intel Pentium4) provide bigger virtual memory toeach process
Large virtual and physical memory is very important incommercial server market: need to run large databases
-
7/30/2019 MCU architecture
6/232
June-July 2009 6
Add r essin g VM There are primarily three ways to address VM
Paging, Segmentation, Segmented paging
We will focus on flat paging only
Paged VM
The entire VM is divided into small units called pages Virtual pages are loaded into physical page frames as
and when needed (demand paging)
Thus the physical memory is also divided into equal
sized page frames The processor generates virtual addresses
But memory is physically addressed: need a VA to PA
translation
-
7/30/2019 MCU architecture
7/232
June-July 2009 7
VA t o PA t r anslat ion The VA generated by the processor is divided into
two parts: Page offset and Virtual page number (VPN)
Assume a 4 KB page: within a 32-bit VA, lower 12 bits
will be page offset (offset within a page) and the
remaining 20 bits are VPN (hence 1 M virtual pages total)
The page offset remains unchanged in the translation
Need to translate VPN to a physical page frame number
(PPFN)
This translation is held in a page table resident in
memory: so first we need to access this page table
How to get the address of the page table?
-
7/30/2019 MCU architecture
8/232
June-July 2009 8
VA t o PA t r anslat ion Accessing the page table
The Page table base register (PTBR) contains thestarting physical address of the page table
PTBR is normally accessible in the kernel mode only
Assume each entry in page table is 32 bits (4 bytes)
Thus the required page table address is
PTBR + (VPN
-
7/30/2019 MCU architecture
9/232
June-July 2009 9
Page f au l t The valid bit within the 32 bits tells you if the
translation is valid If this bit is reset that means the page is not
resident in memory: results in a page fault
In case of a page fault the kernel needs to bring inthe page to memory from disk
The disk address is normally provided by the pagetable entry (different interpretation of 31 bits)
Also kernel needs to allocate a new physical pageframe for this virtual page
If all frames are occupied it invokes a pagereplacement policy
-
7/30/2019 MCU architecture
10/232
-
7/30/2019 MCU architecture
11/232
June-July 2009 11
TLB Why cant we cache the most recently used
translations? Translation Look-aside Buffers (TLB)
Small set of registers (normally fully associative)
Each entry has two parts: the tag which is simply VPN
and the corresponding PTE
The tag may also contain a process id
On a TLB hit you just get the translation in one cycle
(may take slightly longer depending on the design)
On a TLB miss you may need to access memory to load
the PTE in TLB (more later)
Normally there are two TLBs: instruction and data
-
7/30/2019 MCU architecture
12/232
June-July 2009 12
Caches Once you have completed the VA to PA translation
you have the physical address. Whats next?
You need to access memory with that PA
Instruction and data caches hold most recently
used (temporally close) and nearby (spatially close)data
Use the PA to access the cache first
Caches are organized as arrays of cache lines Each cache line holds several contiguous bytes
(32, 64 or 128 bytes)
-
7/30/2019 MCU architecture
13/232
June-July 2009 13
Add r essin g a cach e The PA is divided into several parts
The block offset determines the starting byte
address within a cache line
The index tells you which cache line to access
In that cache line you compare the tag to determine
hit/miss
TAG INDEX BLK. OFFSET
-
7/30/2019 MCU architecture
14/232
June-July 2009 14
Add r essin g a cach eTAG INDEX BLK. OFFSET
TAG DATA
STATE
PA
HIT/
MISS
DATA
ACCESS SIZE
(HOW MANY BYTES?)
-
7/30/2019 MCU architecture
15/232
-
7/30/2019 MCU architecture
16/232
June-July 2009 16
Set assoc ia t iv e cach e The example assumes one cache line per index
Called a direct-mapped cache
A different access to a line evicts the resident cache line
This is eithera capacity or a conflict miss
Conflict misses can be reduced by providingmultiple lines per index
Access to an index returns a set of cache lines
For an n-way set associative cache there are n lines per
set
Carry out multiple tag comparisons in parallel tosee if any one in the set hits
-
7/30/2019 MCU architecture
17/232
June-July 2009 17
2 -w ay set associat i veTAG INDEX BLK. OFFSET
TAG DATA
STATE
PA
TAG DATA
STATE
TAG0 TAG1
-
7/30/2019 MCU architecture
18/232
June-July 2009 18
Set assoc ia t iv e cach e When you need to evict a line in a particular set you
run a replacement policy LRU is a good choice: keeps the most recently used lines
(favors temporal locality)
Thus you reduce the number of conflict misses
Two extremes of set size: direct-mapped (1-way)and fully associative (all lines are in a single set) Example: 32 KB cache, 2-way set associative, line size
of 64 bytes: number of indices or number ofsets=32*1024/(2*64)=256 and hence index is 8 bits wide
Example: Same size and line size, but fully associative:number of sets is 1, within the set there are 32*1024/64or 512 lines; you need 512 tag comparisons for eachaccess
-
7/30/2019 MCU architecture
19/232
June-July 2009 19
Cach e h ier ar ch y Ideally want to hold everything in a fast cache
Never want to go to the memory
But, with increasing size the access time increases
A large cache will slow down every access
So, put increasingly bigger and slower cachesbetween the processor and the memory
Keep the most recently used data in the nearest
cache: register file (RF) Next level of cache: level 1 or L1 (same speed or
slightly slower than RF, but much bigger)
Then L2: way bigger than L1 and much slower
-
7/30/2019 MCU architecture
20/232
June-July 2009 20
Cach e h ier ar ch y Example: Intel Pentium 4 (Netburst)
128 registers accessible in 2 cycles
L1 date cache: 8 KB, 4-way set associative, 64 bytes linesize, accessible in 2 cycles for integer loads
L2 cache: 256 KB, 8-way set associative, 128 bytes line
size, accessible in 7 cycles Example: Intel Itanium 2 (code name Madison)
128 registers accessible in 1 cycle
L1 instruction and data caches: each 16 KB, 4-way set
associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128
bytes line size, accessible in 5 cycles
Unified L3 cache: 6 MB, 24-way set associative, 128bytes line size, accessible in 14 cycles
-
7/30/2019 MCU architecture
21/232
June-July 2009 21
St a t es o f a cach e l in e The life of a cache line starts off in invalid state (I)
An access to that line takes a cache miss andfetches the line from main memory
If it was a read miss the line is filled in shared state(S) [we will discuss it later; for now just assume thatthis is equivalent to a valid state]
In case of a store miss the line is filled in modifiedstate (M); instruction cache lines do not normallyenter the M state (no store to Icache)
The eviction of a line in M state must write the lineback to the memory (this is called a writebackcache); otherwise the effect of the store would belost
-
7/30/2019 MCU architecture
22/232
-
7/30/2019 MCU architecture
23/232
June-July 2009 23
Th e f i r st i nst r u ct i on Accessing the first instruction
Take the starting PC
Access iTLB with the VPN extracted from PC: iTLB miss
Invoke iTLB miss handler
Calculate PTE address
If PTEs are cached in L1 data and L2 caches, look them
up with PTE address: you will miss there also
Access page table in main memory: PTE is invalid: page
fault
Invoke page fault handler
Allocate page frame, read page from disk, update PTE,
load PTE in iTLB, restart fetch
-
7/30/2019 MCU architecture
24/232
June-July 2009 24
Th e f i r st i nst r u ct i on Now you have the physical address
Access Icache: miss
Send refill request to higher levels: you miss everywhere
Send request to memory controller (north bridge)
Access main memory Read cache line
Refill all levels of cache as the cache line returns to the
processor
Extract the appropriate instruction from the cache linewith the block offset
This is the longest possible latency in aninstruction/data access
-
7/30/2019 MCU architecture
25/232
June-July 2009 25
TLB access For every cache access (instruction or data) you
need to access the TLB first
Puts the TLB in the critical path
Want to start indexing into cache and read the tags
while TLB lookup takes place Virtually indexed physically tagged cache
Extract index from the VA, start reading tag while looking
up TLB
Once the PA is available do tag comparison
Overlaps TLB reading and tag reading
-
7/30/2019 MCU architecture
26/232
June-July 2009 26
Mem or y op l at en cy L1 hit: ~1 ns
L2 hit: ~5 ns L3 hit: ~10-15 ns
Main memory: ~70 ns DRAM access time + bustransfer etc. = ~110-120 ns
If a load misses in all caches it will eventually cometo the head of the ROB and block instructionretirement (in-order retirement is a must)
Gradually, the pipeline backs up, processor runsout of resources such as ROB entries and physicalregisters
Ultimately, the fetcher stalls: severely limits ILP
MLP
-
7/30/2019 MCU architecture
27/232
June-July 2009 27
MLP Need memory-level parallelism (MLP)
Simply speaking, need to mutually overlap severalmemory operations
Step 1: Non-blocking cache Allow multiple outstanding cache misses
Mutually overlap multiple cache misses
Supported by all microprocessors today (Alpha 21364supported 16 outstanding cache misses)
Step 2: Out-of-order load issue Issue loads out of program order (address is not known
at the time of issue)
How do you know the load didnt issue before a store tothe same address? Issuing stores must check for thismemory-order violation
O
-
7/30/2019 MCU architecture
28/232
June-July 2009 28
Ou t -o f - o r der l oadssw 0(r7), r6
/* other instructions */
lw r2, 80(r20)
Assume that the load issues before the store
because r20 gets ready before r6 or r7 The load accesses the store buffer (used forholding already executed store values before theyare committed to the cache at retirement)
If it misses in the store buffer it looks up the cachesand, say, gets the value somewhere
After several cycles the store issues and it turns outthat 0(r7)==80(r20) or they overlap; now what?
L d / d i
-
7/30/2019 MCU architecture
29/232
June-July 2009 29
Load / st o r e o r d er i n g Out-of-order load issue relies on speculative
memory disambiguation Assumes that there will be no conflicting store
If the speculation is correct, you have issued the loadmuch earlier and you have allowed the dependents to
also execute much earlier If there is a conflicting store, you have to squash the load
and all the dependents that have consumed the loadvalue and re-execute them systematically
Turns out that the speculation is correct most of the time
To further minimize the load squash, microprocessorsuse simple memory dependence predictors (predicts if aload is going to conflict with a pending store based onthat loads or load/store pairs past behavior)
MLP d l l
-
7/30/2019 MCU architecture
30/232
June-July 2009 30
MLP an d m em o r y w al l Today microprocessors try to hide cache misses by
initiating early prefetches: Hardware prefetchers try to predict next several loadaddresses and initiate cache line prefetch if they are notalready in the cache
All processors today also support prefetch instructions;
so you can specify in your program when to prefetchwhat: this gives much better control compared to ahardware prefetcher
Researchers are working on load value prediction
Even after doing all these, memory latency remainsthe biggest bottleneck
Today microprocessors are trying to overcome onesingle wall: the memory wall
-
7/30/2019 MCU architecture
31/232
Fu n dam en t a l s o fPara l le l
C om pu t e r s
A d
-
7/30/2019 MCU architecture
32/232
June-July 2009 32
Agenda
Convergence of parallel architectures Fundamental design issues
ILP vs. TLP
Commun ica t i on
-
7/30/2019 MCU architecture
33/232
June-July 2009 33
Commun ica t i ona rch i t ec tu re
Historically, parallel architectures are tied toprogramming models Diverse designs made it impossible to write portable
parallel software
But the driving force was the same: need for fastprocessing
Today parallel architecture is seen as an extensionof microprocessor architecture with a
communication architecture Defines the basic communication and synchronization
operations and provides hw/sw implementation of those
L d h i t t
-
7/30/2019 MCU architecture
34/232
June-July 2009 34
Lay er ed ar ch i t ect u r e A parallel architecture can be divided into several
layers Parallel applications
Programming models: shared address, messagepassing, multiprogramming, data parallel, dataflow etc
Compiler + libraries
Operating systems support
Communication hardware
Physical communication medium
Communication architecture = user/systeminterface + hw implementation (roughly defined bythe last four layers) Compiler and OS provide the user interface to
communicate between and synchronize threads
Sh d dd
-
7/30/2019 MCU architecture
35/232
June-July 2009 35
Sh ar ed add r ess Communication takes place through a logically
shared portion of memory
User interface is normal load/store instructions
Load/store instructions generate virtual addresses
The VAs are translated to PAs by TLB or page table
The memory controller then decides where to find this PA Actual communication is hidden from the programmer
The general communication hw consists of multipleprocessors connected over some medium so that
they can talk to memory banks and I/O devices The architecture of the interconnect may vary depending
on projected cost and target performance
Sh ar ed add r ess
-
7/30/2019 MCU architecture
36/232
June-July 2009 36
Sh ar ed add r ess Communication medium
Interconnect could be a crossbar switch so that anyprocessor can talk to any memory bank in one hop(provides latency and bandwidth advantages)
Scaling a crossbar becomes a problem: cost isproportional to square of the size
Instead, could use a scalable switch-based network;latency increases and bandwidth decreases becausenow multiple processors contend for switch ports
INTERCONNECT
MEM MEM MEM I/O I/O
P PDANCE HALL
Sh d dd
-
7/30/2019 MCU architecture
37/232
June-July 2009 37
Sh ar ed add r ess Communication medium
From mid 80s shared bus became popular leading to the
design of SMPs Pentium Pro Quad was the first commodity SMP
Sun Enterprise server provided a highly pipelined wideshared bus for scalability reasons; it also distributed the
memory to each processor, but there was no local bus onthe boards i.e. the memory was still symmetric (mustuse the shared bus)
NUMA or DSM architectures provide a better solution tothe scalability problem; the symmetric view is replaced by
local and remote memory and each node (containingprocessor(s) with caches, memory controller and router)gets connected via a scalable network (mesh, ring etc.);Examples include Cray/SGI T3E, SGI Origin 2000, AlphaGS320, Alpha/HP GS1280 etc.
M i
-
7/30/2019 MCU architecture
38/232
June-July 2009 38
Message p assin g Very popular for large-scale computing
The system architecture looks exactly same asDSM, but there is no shared memory
The user interface is via send/receive calls to themessage layer
The message layer is integrated to the I/O systeminstead of the memory system
Send specifies a local data buffer that needs to betransmitted; send also specifies a tag
A matching receive at dest. node with the same tagreads in the data from kernel space buffer to usermemory
Effectively, provides a memory-to-memory copy
M i
-
7/30/2019 MCU architecture
39/232
June-July 2009 39
Message p assin g Actual implementation of message layer
Initially it was very topology dependent
A node could talk only to its neighbors through FIFObuffers
These buffers were small in size and therefore whilesending a message send would occasionally block
waiting for the receive to start reading the buffer(synchronous message passing)
Soon the FIFO buffers got replaced by DMA (directmemory access) transfers so that a send can initiate atransfer from memory to I/O buffers and finishimmediately (DMA happens in background); sameapplies to the receiving end also
The parallel algorithms were designed specifically forcertain topologies: a big problem
Message p assin g
-
7/30/2019 MCU architecture
40/232
June-July 2009 40
Message p assin g To improve usability of machines, the message
layer started providing support for arbitrary source
and destination (not just nearest neighbors) Essentially involved storing a message in intermediate
hops and forwarding it to the next node on the route
Later this store-and-forward routing got moved tohardware where a switch could handle all the routingactivities
Further improved to do pipelined wormhole routing sothat the time taken to traverse the intermediate hopsbecame small compared to the time it takes to push the
message from processor to network (limited by node-to-network bandwidth)
Examples include IBM SP2, Intel Paragon
Each node of Paragon had two i860 processors, one ofwhich was dedicated to servicing the network (send/recv.etc.)
Convergence
-
7/30/2019 MCU architecture
41/232
June-July 2009 41
Convergence Shared address and message passing are two
distinct programming models, but the architectures
look very similar Both have a communication assist or network interface to
initiate messages or transactions
In shared memory this assist is integrated with thememory controller
In message passing this assist normally used to beintegrated with the I/O, but the trend is changing
There are message passing machines where the assistsits on the memory bus or machines where DMA over
network is supported (direct transfer from source memoryto destination memory)
Finally, it is possible to emulate send/recv. on sharedmemory through shared buffers, flags and locks
Possible to emulate a shared virtual mem. on messagepassing machines through modified page fault handlers
A gener i c ar ch i t ect u r e
-
7/30/2019 MCU architecture
42/232
June-July 200942
A gener i c ar ch i t ect u r e In all the architectures we have discussed thus far
a node essentially contains processor(s) + caches,
memory and a communication assist (CA)
CA = network interface (NI) + communication controller
The nodes are connected over a scalable network
The main difference remains in the architecture ofthe CA
And even under a particular programming model (e.g.,
shared memory) there is a lot of choices in the design of
the CA
Most innovations in parallel architecture takes place in
the communication assist (also called communication
controller or node controller)
A gener i c ar ch i t ect u r e
-
7/30/2019 MCU architecture
43/232
June-July 200943
A gener i c ar ch i t ect u r e
SCALABLE NETWORK
NODE NODE NODE NODE
CACHE
P
MEM CAXBAR
Desig n issu es
-
7/30/2019 MCU architecture
44/232
June-July 200944
Desig n issu es Need to understand architectural components that
affect software Compiler, library, program
User/system interface and hw/sw interface
How programming models efficiently talk to the
communication architecture? How to implement efficient primitives in the
communication layer?
In a nutshell, what issues of a parallel machine will affect
the performance of the parallel applications?
Naming, Operations, Ordering, Replication,Communication cost
N a m i n g
-
7/30/2019 MCU architecture
45/232
June-July 2009 45
N a m i n g How are the data in a program referenced?
In sequential programs a thread can access any variablein its virtual address space
In shared memory programs a thread can access any
private or shared variable (same load/store model of
sequential programs) In message passing programs a thread can access local
data directly
Clearly, naming requires some support from hw
and OS Need to make sure that the accessed virtual address
gets translated to the correct physical address
Opera t ions
-
7/30/2019 MCU architecture
46/232
June-July 2009 46
Opera t ions What operations are supported to access data?
For sequential and shared memory models load/store aresufficient
For message passing models send/receive are needed
to access remote data
For shared memory, hw (essentially the CA) needs tomake sure that a load/store operation gets correctly
translated to a message if the address is remote
For message passing, CA or the message layer needs to
copy data from local memory and initiate send, or copydata from receive buffer to local memory
Order ing
-
7/30/2019 MCU architecture
47/232
June-July 2009 47
Order ing How are the accesses to the same data ordered?
For sequential model, it is the program order: true
dependence order For shared memory, within a thread it is the program
order, across threads some valid interleaving ofaccesses as expected by the programmer and enforced
by synchronization operations (locks, point-to-pointsynchronization through flags, global synchronizationthrough barriers)
Ordering issues are very subtle and important in sharedmemory model (some microprocessor re-ordering tricks
may easily violate correctness when used in sharedmemory context)
For message passing, ordering across threads is impliedthrough point-to-point send/receive pairs (producer-consumer relationship) and mutual exclusion is inherent
(no shared variable)
Repl ica t ion
-
7/30/2019 MCU architecture
48/232
June-July 2009 48
Repl ica t ion How is the shared data locally replicated?
This is very important for reducing communication traffic In microprocessors data is replicated in the cache to
reduce memory accesses
In message passing, replication is explicit in the program
and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache
hierarchy so that subsequent accesses can be fast; this
is totally hidden from the program and therefore the
hardware must provide a layer that keeps track of themost recent copies of the data (this layer is central to the
performance of shared memory multiprocessors and is
called the cache coherence protocol)
Com m u n icat i on cost
-
7/30/2019 MCU architecture
49/232
June-July 2009 49
Com m u n icat i on cost Three major components of the communication
architecture that affect performance Latency: time to do an operation (e.g., load/store orsend/recv.)
Bandwidth: rate of performing an operation
Overhead or occupancy: how long is the communicationlayer occupied doing an operation
Latency Already a big problem for microprocessors
Even bigger problem for multiprocessors due to remoteoperations
Must optimize application or hardware to hide or lowerlatency (algorithmic optimizations or prefetching oroverlapping computation with communication)
Com m u n icat i on cost
-
7/30/2019 MCU architecture
50/232
June-July 2009 50
Com m u n icat i on cost Bandwidth
How many ops in unit time e.g. how many bytes
transferred per second Local BW is provided by heavily banked memory or
faster and wider system bus
Communication BW has two components: 1. node-to-network BW (also called network link BW) measures howfast bytes can be pushed into the router from the CA,2. within-network bandwidth: affected by scalability of thenetwork and architecture of the switch or router
Linear cost model: Transfer time = T0 + n/B where
T0 is start-up overhead, n is number of bytestransferred and B is BW Not sufficient since overlap of comp. and comm. is not
considered; also does not count how the transfer is done
(pipelined or not)
Com m u n icat i on cost
-
7/30/2019 MCU architecture
51/232
June-July 2009 51
Com m u n icat i on cost Better model:
Communication time for n bytes = Overhead + CA
occupancy + Network latency + Size/BW + Contention T(n) = Ov + Oc + L + n/B + Tc Overhead and occupancy may be functions of n
Contention depends on the queuing delay at various
components along the communication path e.g. waitingtime at the communication assist or controller, waitingtime at the router etc.
Overall communication cost = frequency ofcommunication x (communication time overlap with
useful computation)
Frequency of communication depends on various factorssuch as how the program is written or the granularity ofcommunication supported by the underlying hardware
I LP v s TLP
-
7/30/2019 MCU architecture
52/232
June-July 2009 52
I LP v s. TLP Microprocessors enhance performance of a
sequential program by extracting parallelism from
an instruction stream (called instruction-levelparallelism)
Multiprocessors enhance performance of anexplicitly parallel program by running multiple
threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger
granularity compared to ILP
In multiprocessors ILP and TLP work together Within a thread ILP provides performance boost
Across threads TLP provides speedup over a sequentialversion of the parallel program
-
7/30/2019 MCU architecture
53/232
Para l le lP r o g r a m m i n g
-
7/30/2019 MCU architecture
54/232
Agenda
-
7/30/2019 MCU architecture
55/232
June-July 2009 55
Agenda Steps in writing a parallel program
Example
W r i t i ng a par a l l el
-
7/30/2019 MCU architecture
56/232
June-July 2009 56
W r i t i ng a par a l l elp r o g r a m
Start from a sequential description Identify work that can be done in parallel
Partition work and/or data among threads orprocesses
Decomposition and assignment
Add necessary communication and synchronization Orchestration
Map threads to processors (Mapping) How good is the parallel program? Measure speedup = sequential execution time/parallel
execution time = number of processors ideally
Som e def i n i t i on s
-
7/30/2019 MCU architecture
57/232
June-July 2009 57
Som e def i n i t i on s Task
Arbitrary piece of sequential work Concurrency is only across tasks
Fine-grained task vs. coarse-grained task: controls
granularity of parallelism (spectrum of grain: one
instruction to the whole sequential program) Process/thread
Logical entity that performs a task
Communication and synchronization happen between
threads
Processors
Physical entity on which one or more processes execute
Decompos i t i on
-
7/30/2019 MCU architecture
58/232
June-July 2009 58
Decompos i t i on Find concurrent tasks and divide the program into
tasks Level or grain of concurrency needs to be decided here
Too many tasks: may lead to too much of overhead
communicating and synchronizing between tasks
Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy
Number of tasks may vary dynamically
New tasks may get created as the computation
proceeds: new rays in ray tracing
Number of available tasks at any point in time is an upper
bound on the achievable speedup
St a t i c assign m en t
-
7/30/2019 MCU architecture
59/232
June-July 2009 59
St a t i c assign m en t Given a decomposition it is possible to assign
tasks statically For example, some computation on an array of size N
can be decomposed statically by assigning a range ofindices to each process: for k processes P0 operates onindices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,,
Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and
low-overhead
What if the nature computation depends on the
index? For certain index ranges you do some heavy-weight
computation while for others you do something simple
Is there a problem?
-
7/30/2019 MCU architecture
60/232
-
7/30/2019 MCU architecture
61/232
Decom posi t i on t yp es
-
7/30/2019 MCU architecture
62/232
June-July 2009 62
Decom posi t i on t yp es Decomposition by data
The most commonly found decomposition technique The data set is partitioned into several subsets and each
subset is assigned to a process
The type of computation may or may not be identical on
each subset Very easy to program and manage
Computational decomposition
Not so popular: tricky to program and manage
All processes operate on the same data, but probablycarry out different kinds of computation
More common in systolic arrays, pipelined graphics
processor units (GPUs) etc.
Orches t ra t i on
-
7/30/2019 MCU architecture
63/232
June-July 2009 63
Orches t ra t i on Involves structuring communication and
synchronization among processes, organizing data
structures to improve locality, and scheduling tasks
This step normally depends on the programming model
and the underlying architecture
Goal is to Reduce communication and synchronization costs
Maximize locality of data reference
Schedule tasks to maximize concurrency: do not
schedule dependent tasks in parallel
Reduce overhead of parallelization and concurrency
management (e.g., management of the task queue,
overhead of initiating a task etc.)
Mapp ing
-
7/30/2019 MCU architecture
64/232
June-July 2009 64
Mapp ing At this point you have a parallel program
Just need to decide which and how many processes go
to each processor of the parallel machine Could be specified by the program
Pin particular processes to a particular processor for thewhole life of the program; the processes cannot migrate
to other processors Could be controlled entirely by the OS
Schedule processes on idle processors
Various scheduling algorithms are possible e.g., round
robin: process#k goes to processor#k NUMA-aware OS normally takes into account
multiprocessor-specific metrics in scheduling
How many processes per processor? Most common
is one-to-one
An ex am p le
-
7/30/2019 MCU architecture
65/232
June-July 2009 65
An ex am p le Iterative equation solver
Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations
A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j]
Pad the n by n grid to (n+2) by (n+2) to avoid corner
problems
Update only interior n by n grid
One iteration consists of updating all n2 points in-place
and accumulating the difference from the previous value
at each point
If the difference is less than a threshold, the solver is said
to have converged to a stable grid equilibrium
Sequ en t ia l p r og r am
-
7/30/2019 MCU architecture
66/232
June-July 2009 66
Sequ en t ia l p r og r amint n;
float **A, diff;
begin main()
read (n); /* size of grid */
Allocate (A);
Initialize (A);
Solve (A);
end main
begin Solve (A)
int i, j, done = 0;
float temp;
while (!done)
diff = 0.0;
for i = 0 to n-1
for j = 0 to n-1
temp = A[i,j];
A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+
A[i-1,j]+A[i+1,j];
diff += fabs (A[i,j] - temp);
endfor
endfor
if (diff/(n*n) < TOL) then done = 1;
endwhile
end Solve
Decompos i t i on
-
7/30/2019 MCU architecture
67/232
June-July 2009 67
Decompos i t i on Look for concurrency in loop iterations
In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)
Each anti-diagonal can be computed in parallel
Must synchronize after each anti-diagonal (or pt-to-pt)
Alternative: red-black ordering (different update pattern)
-
7/30/2019 MCU architecture
68/232
Decompos i t i on
-
7/30/2019 MCU architecture
69/232
June-July 2009 69
Decompos i t i onwhile (!done)
diff = 0.0;
for_all i = 0 to n-1
for_all j = 0 to n-1
temp = A[i, j];
A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j];
diff += fabs (A[i, j] temp);
end for_allend for_all
if (diff/(n*n) < TOL) then done = 1;
end while
Offers concurrency across elements: degree ofconcurrency is n2
Make the j loop sequential to have row-wisedecomposition: degree n concurrency
Ass ignmen t
-
7/30/2019 MCU architecture
70/232
June-July 2009 70
Ass ignmen t Possible static assignment: block row
decomposition
Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/pto (2n/p)-1 etc.
Another static assignment: cyclic rowdecomposition
Process 0 gets rows 0, p, 2p,; process 1 gets rows 1,p+1, 2p+1,.
Dynamic assignment Grab next available row, work on that, grab a new row,
Static block row assignment minimizes nearestneighbor communication by assigning contiguousrows to the same process
-
7/30/2019 MCU architecture
71/232
ar e m em or y
-
7/30/2019 MCU architecture
72/232
June-July 2009 72
ar e m em or yvers ion
void Solve (void)
{
int i, j, pid, done = 0;
float temp, local_diff;
GET_PID (pid);
while (!done) {
local_diff = 0.0;
if (!pid) gm->diff = 0.0;
BARRIER (gm->barrier, P);/*why?*/
for (i = pid*(n/P); i < (pid+1)*(n/P);
i++) {
for (j = 0; j < n; j++) {
temp = gm->A[i] [j];
gm->A[i] [j] = 0.2*(gm->A[i] [j] +
gm->A[i] [j-1] + gm->A[i] [j+1] + gm-
>A[i+1] [j] + gm->A[i-1] [j];
local_diff += fabs (gm->A[i] [j]
temp);} /* end for */
} /* end for */
LOCK (gm->diff_lock);
gm->diff += local_diff;
UNLOCK (gm->diff_lock);
BARRIER (gm->barrier, P);
if (gm->diff/(n*n) < TOL) done = 1;
BARRIER (gm->barrier, P); /* why? */
} /* end while */
}
Mut u al ex clus ion
-
7/30/2019 MCU architecture
73/232
June-July 2009 73
Mut u al ex clus ion Use LOCK/UNLOCK around critical sections
Updates to shared variable diff must be sequential Heavily contended locks may degrade performance
Try to minimize the use of critical sections: they are
sequential anyway and will limit speedup
This is the reason for using a local_diff instead ofaccessing gm->diff every time
Also, minimize the size of critical section because the
longer you hold the lock, longer will be the waiting time
for other processors at lock acquire
LOCK op t im iza t ion
-
7/30/2019 MCU architecture
74/232
June-July 2009 74
LOCK op t im iza t ion Suppose each processor updates a shared variable
holding a global cost value, only if its local cost isless than the global cost: found frequently in
minimization problemsLOCK (gm->cost_lock); if (my_cost < gm->cost) {
if (my_cost < gm->cost) { LOCK (gm->cost_lock);gm->cost = my_cost; if (my_cost < gm->cost) { /* make
sure*/
} gm->cost = my_cost;
UNLOCK (gm->cost_lock); }
/* May lead to heavy lock UNLOCK (gm->cost_lock);
contention if everyone } /* this works because gm->cost is
tries to update at the monotonically decreasing */
same time */
Mor e sy n ch r on i zat i on
-
7/30/2019 MCU architecture
75/232
June-July 2009 75
Mor e sy n ch r on i zat i on Global synchronization
Through barriers
Often used to separate computation phases
Point-to-point synchronization
A process directly notifies another about a certain event
on which the latter was waiting Producer-consumer communication pattern
Semaphores are used for concurrent programming on
uniprocessor through P and V functions
Normally implemented through flags on shared memorymultiprocessors (busy wait or spin)
P0: A = 1; flag = 1;
P1: while (!flag); use (A);
Message p assin g
-
7/30/2019 MCU architecture
76/232
June-July 2009 76
Message p assin g What is different from shared memory?
No shared variable: expose communication through
send/receive No lock or barrier primitive
Must implement synchronization through send/receive
Grid solver example
P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e.
P1 waits to receive rows n/P to 2n/P-1 etc. (this is one-time)
Within the while loop the first thing that every processordoes is to send its first and last rows to the upper and thelower processors (corner cases need to be handled)
Then each processor waits to receive the neighboringtwo rows from the upper and the lower processors
Message p assin g
-
7/30/2019 MCU architecture
77/232
June-July 2009 77
Message p assin g
At the end of the loop each processor sends itslocal_diff to P0 and P0 sends back the accumulated
diff so that each processor can locally compute the
done flag
Maj o r ch an ges
-
7/30/2019 MCU architecture
78/232
June-July 2009 78
Maj o r ch an ges
/* include files */
MAIN_ENV;
int P, n;
void Solve ();
struct gm_t {
LOCKDEC (diff_lock);
BARDEC (barrier);float **A, diff;
} *gm;
int main (char **argv, int argc)
{ int i; int P, n; float **A;
MAIN_INITENV;
gm = (struct gm_t*) G_MALLOC(sizeof (struct gm_t));
LOCKINIT (gm->diff_lock);
BARINIT (gm->barrier);
n = atoi (argv[1]);P = atoi (argv[2]);
gm->A = (float**) G_MALLOC((n+2)*sizeof (float*));
for (i = 0; i < n+2; i++) {
gm->A[i] = (float*) G_MALLOC((n+2)*sizeof (float));
}
Initialize (gm->A);
for (i = 1; i < P; i++) { /* starts at 1 */
CREATE (Solve);
}Solve ();
WAIT_FOR_END (P-1);
MAIN_END;
}
Local
Alloc.
Maj o r ch an ges
-
7/30/2019 MCU architecture
79/232
June-July 2009 79
Maj o r ch an ges
void Solve (void)
{
int i, j, pid, done = 0;
float temp, local_diff;
GET_PID (pid);
while (!done) {
local_diff = 0.0;
if (!pid) gm->diff = 0.0;
BARRIER (gm->barrier, P);/*why?*/
for (i = pid*(n/P); i < (pid+1)*(n/P);
i++) {
for (j = 0; j < n; j++) {
temp = gm->A[i] [j];
gm->A[i] [j] = 0.2*(gm->A[i] [j] +
gm->A[i] [j-1] + gm->A[i] [j+1] + gm-
>A[i+1] [j] + gm->A[i-1] [j];
local_diff += fabs (gm->A[i] [j]
temp);} /* end for */
} /* end for */
LOCK (gm->diff_lock);
gm->diff += local_diff;
UNLOCK (gm->diff_lock);
BARRIER (gm->barrier, P);
if (gm->diff/(n*n) < TOL) done = 1;
BARRIER (gm->barrier, P); /* why? */
} /* end while */
}
if (pid) Recv rows, n, P
Send up/downRecv up/down
Send local_diff
to P0
Recv diff
Message p assin g
-
7/30/2019 MCU architecture
80/232
June-July 2009 80
Message p assin g
This algorithm is deterministic
May converge to a different solution compared tothe shared memory version if there are multiple
solutions: why? There is a fixed specific point in the program (at the
beginning of each iteration) when the neighboring rows
are communicated
This is not true for shared memory
-
7/30/2019 MCU architecture
81/232
Messag e Passin gGr id So lv er
MPI - l i k e en v i r o n m en t
-
7/30/2019 MCU architecture
82/232
June-July 2009 82
MPI l i k e en v i r o n m en t MPI stands for Message Passing Interface
A C library that provides a set of message passingprimitives (e.g., send, receive, broadcast etc.) to the user
PVM (Parallel Virtual Machine) is another well-known platform for message passing programming
Background in MPI is not necessary forunderstanding this lecture
Only need to know When you start an MPI program every thread runs the
same mainfunction We will assume that we pin one thread to one processor
just as we did in shared memory
Instead of using the exact MPI syntax we will use
some macros that call the MPI functions
MAIN_ENV;/* define message tags */
#define ROW 99
while (!done) {local_diff = 0.0;
/* MPI CHAR means raw byte format */
-
7/30/2019 MCU architecture
83/232
June-July 2009 83
#define ROW 99
#define DIFF 98
#define DONE 97
int main(int argc, char **argv){
int pid, P, done, i, j, N;
float tempdiff, local_diff, temp, **A;
MAIN_INITENV;
GET_PID(pid);
GET_NUMPROCS(P);
N = atoi(argv[1]);
tempdiff = 0.0;
done = 0;
A = (double **) malloc ((N/P+2) *sizeof(float *));
for (i=0; i < N/P+2; i++) {
A[i] = (float *) malloc (sizeof(float)* (N+2));
}
initialize(A);
/ MPI_CHAR means raw byte format /
if (pid) { /* send my first row up */
SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);
}if (pid != P-1) { /* recv last row */
RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);
}
if (pid != P-1) { /* send last row down */
SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);
}
if (pid) { /* recv first row from above */
RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);
}
for (i=1; i
-
7/30/2019 MCU architecture
84/232
June-July 2009 84
#define ROW 99
#define DIFF 98
#define DONE 97
int main(int argc, char **argv){
int pid, P, done, i, j, N;
float tempdiff, local_diff, temp, **A;
MAIN_INITENV;
GET_PID(pid);
GET_NUMPROCS(P);
N = atoi(argv[1]);
tempdiff = 0.0;
done = 0;
A = (double **) malloc ((N/P+2) *sizeof(float *));
for (i=0; i < N/P+2; i++) {
A[i] = (float *) malloc (sizeof(float)* (N+2));
}
initialize(A);
/ MPI_CHAR means raw byte format /
if (pid) { /* send my first row up */
SEND(&A[1][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);
}if (pid != P-1) { /* recv last row */
RECV(&A[N/P+1][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);
}
if (pid != P-1) { /* send last row down */
SEND(&A[N/P][1], N*sizeof(float),MPI_CHAR, pid+1, ROW);
}
if (pid) { /* recv first row from above */
RECV(&A[0][1], N*sizeof(float),MPI_CHAR, pid-1, ROW);
}
for (i=1; i
-
7/30/2019 MCU architecture
85/232
June-July 2009 85
Per fo rmanceI ssu es
Agenda
-
7/30/2019 MCU architecture
86/232
June-July 2009 86
g Partitioning for performance
Data access and communication Summary
Goal is to understand simple trade-offs involved inwriting a parallel program keeping an eye on
parallel performance
Getting good performance out of a multiprocessor is
difficult Programmers need to be careful
A little carelessness may lead to extremely poor
performance
Par t i t ion i ng f o r per f .
-
7/30/2019 MCU architecture
87/232
June-July 2009 87
g p Partitioning plays an important role in the parallel
performance This is where you essentially determine the tasks
A good partitioning should practise
Load balance
Minimal communication
Low overhead to determine and manage task
assignment (sometimes called extra work)
A well-balanced parallel program automatically haslow barrier or point-to-point synchronization time
Ideally I want all the threads to arrive at a barrier at the
same time
Load b a lan cin g
-
7/30/2019 MCU architecture
88/232
June-July 2009 88
g Achievable speedup is bounded above by
Sequential exec. time / Max. time for any processor
Thus speedup is maximized when the maximum timeand minimum time across all processors are close (wantto minimize the variance of parallel execution time)
This directly gets translated to load balancing
What leads to a high variance? Ultimately all processors finish at the same time But some do useful work all over this period while others
may spend a significant time at synchronization points
This may arise from a bad partitioning
There may be other architectural reasons for loadimbalance beyond the scope of a programmer e.g.,network congestion, unforeseen cache conflicts etc.(slows down a few threads)
Dyn am ic t ask qu eu es
-
7/30/2019 MCU architecture
89/232
June-July 2009 89
y q Introduced in the last lecture
Normally implemented as part of the parallelprogram
Two possible designs
Centralized task queue: a single queue of tasks; may
lead to heavy contention because insertion and deletion
to/from the queue must be critical sections
Distributed task queues: one queue per processor
Issue with distributed task queues When a queue of a particular processor is empty whatdoes it do? Task stealing
Task st eal in g
-
7/30/2019 MCU architecture
90/232
June-July 2009 90
g A processor may choose to steal tasks from
another processors queue if the formers queue is
empty How many tasks to steal? Whom to steal from?
The biggest question: how to detect termination? Reallya distributed consensus!
Task stealing, in general, may increase overhead andcommunication, but a smart design may lead to excellentload balance (normally hard to design efficiently)
This is a form of a more general technique calledReceiver Initiated Diffusion (RID) where the receiver ofthe task initiates the task transfer
In Sender Initiated Diffusion (SID) a processor maychoose to insert into another processors queue if theformers task queue is full above a threshold
Ar ch i t ect s j ob
-
7/30/2019 MCU architecture
91/232
June-July 2009 91
j Normally load balancing is a responsibility of the
programmer
However, an architecture may provide efficient primitivesto implement task queues and task stealing
For example, the task queue may be allocated in aspecial shared memory segment, accesses to which maybe optimized by special hardware in the memorycontroller
But this may expose some of the architectural features tothe programmer
There are multiprocessors that provide efficient
implementations for certain synchronization primitives;this may improve load balance
Sophisticated hardware tricks are possible: dynamic loadmonitoring and favoring slow threads dynamically
Par t i t i on in g an d
-
7/30/2019 MCU architecture
92/232
June-July 2009 92
gc o m m u n i c a t i o n
Need to reduce inherent communication This is the part of communication determined by
assignment of tasks
There may be other communication traffic also (more
later)
Goal is to assign tasks such that accessed data aremostly local to a process
Ideally I do not want any communication But in life sometimes you need to talk to people to get
some work done!
Dom ain decom posi t i on
-
7/30/2019 MCU architecture
93/232
June-July 2009 93
p Normally applications show a local bias on data
usage Communication is short-range e.g. nearest neighbor
Even if it is long-range it falls off with distance
View the dataset of an application as the domain of the
problem e.g., the 2-D grid in equation solver If you consider a point in this domain, in most of the
applications it turns out that this point depends on points
that are close by
Partitioning can exploit this property by assigningcontiguous pieces of data to each process
Exact shape of decomposed domain depends on the
application and load balancing requirements
Co m m - t o - co m p r at io
-
7/30/2019 MCU architecture
94/232
June-July 2009 94
p Surely, there could be many different domain
decompositions for a particular problem For grid solver we may have a square blockdecomposition, block row decomposition or cyclic row
decomposition
How to determine which one is good? Communication-to-computation ratio
Assume P processors and NxN grid for grid solver
P0 P1 P2 P3
P4 P5 P6 P7
P15
Sq. block decomp. for P=16
Size of each block: N/P by N/P
Communication (perimeter): 4N/PComputation (area): N2/P
Comm-to-comp ratio = 4P/N
Co m m - t o - co m p r at io
-
7/30/2019 MCU architecture
95/232
June-July 2009 95
p For block row decomposition
Each strip has N/P rows Communication (boundary rows): 2N
Computation (area): N2/P (same as square block)
Comm-to-comp ratio: 2P/N
For cyclic row decomposition Each processor gets N/P isolated rows
Communication: 2N2/P
Computation: N2/P
Comm-to-comp ratio: 2
Normally N is much much larger than P
Asymptotically, square block yields lowest comm-to-
comp ratio
Co m m - t o - co m p r at ioId i t th l f i h t
-
7/30/2019 MCU architecture
96/232
June-July 2009 96
p Idea is to measure the volume of inherent
communication per computation
In most cases it is beneficial to pick the decompositionwith the lowest comm-to-comp ratio
But depends on the application structure i.e. picking thelowest comm-to-comp may have other problems
Normally this ratio gives you a rough estimate aboutaverage communication bandwidth requirement of theapplication i.e. how frequent is communication
But it does not tell you the nature of communication i.e.bursty or uniform
For grid solver comm. happens only at the start of eachiteration; it is not uniformly distributed over computation
Thus the worst case BW requirement may exceed theaverage comm-to-comp ratio
Ex t r a w o r kf
-
7/30/2019 MCU architecture
97/232
June-July 2009 97
Extra work in a parallel version of a sequentialprogram may result from
Decomposition
Assignment techniques
Management of the task pool etc.
Speedup is bounded above by
Sequential work / Max (Useful work +
Synchronization + Comm. cost + Extra work)
where the Max is taken over all processors
But this is still incomplete
We have only considered communication cost from the
viewpoint of the algorithm and ignored the architecture
completely
Dat a access an di t i
-
7/30/2019 MCU architecture
98/232
June-July 2009 98
c o m m u n i c a t i o n The memory hierarchy (caches and main memory)
plays a significant role in determiningcommunication cost May easily dominate the inherent communication of the
algorithm For uniprocessor, the execution time of a program
is given by useful work time + data access time Useful work time is normally called the busy time or busy
cycles Data access time can be reduced either by architectural
techniques (e.g., large caches) or by cache-awarealgorithm design that exploits spatial and temporallocality
Dat a accessI lti
-
7/30/2019 MCU architecture
99/232
June-July 2009 99
In multiprocessors
Every processor wants to see the memory interface as its
own local cache and the main memory In reality it is much more complicated
If the system has a centralized memory (e.g., SMPs),
there are still caches of other processors; if the memory
is distributed then some part of it is local and some isremote
For shared memory, data movement from local or remote
memory to cache is transparent while for message
passing it is explicit View a multiprocessor as an extended memory hierarchy
where the extension includes caches of other
processors, remote memory modules and the network
topology
Ar t i f act u a l co m m .
-
7/30/2019 MCU architecture
100/232
June-July 2009 100
Communication caused by artifacts of extendedmemory hierarchy
Data accesses not satisfied in the cache or local memory
cause communication
Inherent communication is caused by data transfers
determined by the program
Artifactual communication is caused by poor allocation of
data across distributed memories, unnecessary data in a
transfer, unnecessary transfers due to system-dependent
transfer granularity, redundant communication of data,
finite replication capacity (in cache or memory)
Inherent communication assumes infinite capacityand perfect knowledge of what should be
transferred
-
7/30/2019 MCU architecture
101/232
-
7/30/2019 MCU architecture
102/232
Sp at ia l loca l i t y
-
7/30/2019 MCU architecture
103/232
June-July 2009 103
Consider a square block decomposition of grid
solver and a C-like row major layout i.e. A[i][j] andA[i][j+1] have contiguous memory locations
Memory
allocation
Page straddles
partition boundary
Cache line
across partition
Page
Cache line
The same page is local
to a processor while
remote to others; sameapplies to straddling
cache lines. Ideally, I
want to have all pages
within a partition local toa single processor.
Standard trick is to
covert the 2D array to
4D.
2 D t o 4 D con v er sion
-
7/30/2019 MCU architecture
104/232
June-July 2009 104
Essentially you need to change the way memory isallocated
The matrix A needs to be allocated in such a way that theelements falling within a partition are contiguous
The first two dimensions of the new 4D matrix are blockrow and column indices i.e. for the partition assigned to
processor P6 these are 1 and 2 respectively (assuming16 processors)
The next two dimensions hold the data elements withinthat partition
Thus the 4D array may be declared as
float B[P][P][N/P][N/P] The element B[3][2][5][10] corresponds to the element in
10th column, 5th row of the partition of P14 Now all elements within a partition have contiguous
addresses
Tr an sfer g r an u lar i t y
-
7/30/2019 MCU architecture
105/232
June-July 2009 105
How much data do you transfer in onecommunication?
For message passing it is explicit in the program
For shared memory this is really under the control of the
cache coherence protocol: there is a fixed size for which
transactions are defined (normally the block size of the
outermost level of cache hierarchy)
In shared memory you have to be careful
Since the minimum transfer size is a cache line you may
end up transferring extra data e.g., in grid solver theelements of the left and right neighbors for a square
block decomposition (you need only one element, but
must transfer the whole cache line): no good solution
W or se: fa lse sh ar in g
-
7/30/2019 MCU architecture
106/232
June-July 2009 106
If the algorithm is designed so poorly that
Two processors write to two different words within acache line at the same time
The cache line keeps on moving between two processors
The processors are not really accessing or updating the
same element, but whatever they are updating happen tofall within a cache line: not a true sharing, but false
sharing
For shared memory programs false sharing can easily
degrade performance by a lot
Easy to avoid: just pad up to the end of the cache line
before starting the allocation of the data for the next
processor (wastes memory, but improves performance)
-
7/30/2019 MCU architecture
107/232
H ot - spo t s
-
7/30/2019 MCU architecture
108/232
June-July 2009 108
Avoid location hot-spot by either staggering
accesses to the same location or by designing thealgorithm to exploit a tree structured
communication
Module hot-spot
Normally happens when a particular node saturateshandling too many messages (need not be to same
memory location) within a short amount of time
Normal solution again is to design the algorithm in such a
way that these messages are staggered over time
Rule of thumb: design communication pattern suchthat it is not bursty; want to distribute it uniformly
over time
Over lap
-
7/30/2019 MCU architecture
109/232
June-July 2009 109
Increase overlap between communication andcomputation
Not much to do at algorithm level unless the
programming model and/or OS provide some primitives
to carry out prefetching, block data transfer, non-blocking
receive etc.
Normally, these techniques increase bandwidth demand
because you end up communicating the same amount of
data, but in a shorter amount of time (execution time
hopefully goes down if you can exploit overlap)
S u m m a r y
-
7/30/2019 MCU architecture
110/232
June-July 2009 110
Parallel programs introduce three overhead terms:busy overhead (extra work), remote data access
time, and synchronization time
Goal of a good parallel program is to minimize these
three terms
Goal of a good parallel computer architecture is to
provide sufficient support to let programmers optimize
these three terms (and this is the focus of the rest of the
course)
-
7/30/2019 MCU architecture
111/232
Fou r o r gan i zat i onsSh d h Interconnect is between the
-
7/30/2019 MCU architecture
112/232
June-July 2009 112
Shared cache
The switch is a simplecontroller for granting
access to cache banks
Interconnect is between theprocessors and the sharedcache
Which level of cachehierarchy is shared dependson the design: Chipmultiprocessors today
normally share the outermostlevel (L2 or L3 cache)
The cache and memory areinterleaved to improvebandwidth by allowing
multiple concurrent accesses Normally small scale due to
heavy bandwidth demand onswitch and shared cache
P0 Pn
SWITCH
INTERLEAVEDSHARED CACHE
INTERLEAVED
MEMORY
Fou r o r gan i zat i ons
-
7/30/2019 MCU architecture
113/232
June-July 2009 113
Bus-based SMP
Scalability is limited bythe shared bus bandwidth
Interconnect is a shared buslocated between the private
cache hierarchies andmemory controller
The most popularorganization for small to
medium-scale servers Possible to connect 30 or so
processors with smart bus
design
Bus bandwidth requirement islower compared to shared
cache approach
Why?
P0 Pn
CACHE CACHE
MEM
BUS
Fou r o r gan i zat i ons
-
7/30/2019 MCU architecture
114/232
June-July 2009 114
Dancehall Better scalability comparedto previous two designs
The difference from bus-based SMP is that the
interconnect is a scalable
point-to-point network (e.g.
crossbar or other topology)
Memory is still symmetricfrom all processors
Drawback: a cache miss may
take a long time since allmemory banks too far off
from the processors (may be
several network hops)
P0 Pn
CACHE CACHE
INTERCONNECT
MEM MEM
Fou r o r gan i zat i ons
-
7/30/2019 MCU architecture
115/232
June-July 2009 115
Distributed shared memory The most popular scalableorganization
Each node now has localmemory banks
Shared memory on othernodes must be accessedover the network Remote memory access
Non-uniform memory access(NUMA) Latency to access local
memory is much smallercompared to remote memory
Caching is very important toreduce remote memoryaccess
P0 Pn
CACHE CACHE
MEM MEM
INTERCONNECT
Fou r o r gan i zat i ons In all four organizations caches play an important
-
7/30/2019 MCU architecture
116/232
June-July 2009 116
In all four organizations caches play an importantrole in reducing latency and bandwidth requirement
If an access is satisfied in cache, the transaction will notappear on the interconnect and hence the bandwidth
requirement of the interconnect will be less (shared L1
cache does not have this advantage)
In distributed shared memory (DSM) cache andlocal memory should be used cleverly
Bus-based SMP and DSM are the two designssupported today by industry vendors
In bus-based SMP every cache miss is launched on the
shared bus so that all processors can see all transactions
In DSM this is not the case
Hier ar ch ica l d esig n
-
7/30/2019 MCU architecture
117/232
June-July 2009 117
Possible to combine bus-based SMP and DSM tobuild hierarchical shared memory Sun Wildfire connects four large SMPs (28 processors)
over a scalable interconnect to form a 112p
multiprocessor IBM POWER4 has two processors on-chip with privateL1 caches, but shared L2 and L3 caches (this is called achip multiprocessor); connect these chips over a networkto form scalable multiprocessors
Next few lectures will focus on bus-based SMPsonly
Cach e Coh er en ce
-
7/30/2019 MCU architecture
118/232
June-July 2009 118
Intuitive memory model
For sequential programs we expect a memory location to
return the latest value written to that location
For concurrent programs running on multiple threads or
processes on a single processor we expect the same
model to hold because all threads see the same cache
hierarchy (same as shared L1 cache)
For multiprocessors there remains a danger of using a
stale value: in SMP or DSM the caches are not shared
and processors are allowed to replicate data
independently in each cache; hardware must ensure thatcached values are coherent across the system and they
satisfy programmers intuitive memory model
Examp le
-
7/30/2019 MCU architecture
119/232
June-July 2009 119
Assume a write-through cache i.e. every storeupdates the value in cache as well as in memory
P0: reads x from memory, puts it in its cache, and gets
the value 5
P1: reads x from memory, puts it in its cache, and gets
the value 5
P1: writes x=7, updates its cached value and memory
value
P0: reads x from its cache and gets the value 5
P2: reads x from memory, puts it in its cache, and getsthe value 7 (now the system is completely incoherent)
P2: writes x=10, updates its cached value and memory
value
-
7/30/2019 MCU architecture
120/232
W h at w en t w r on g?
-
7/30/2019 MCU architecture
121/232
June-July 2009 121
For write through cache
The memory value may be correct if the writes are
correctly ordered
But the system allowed a store to proceed when there is
already a cached copy
Lesson learned: must invalidate all cached copies before
allowing a store to proceed
Writeback cache
Problem is even more complicated: stores are no longer
visible to memory immediately Writeback order is important
Lesson learned: do not allow more than one copy of a
cache line in M state
-
7/30/2019 MCU architecture
122/232
Def in i t i onsM ti d (l d) it ( t )
-
7/30/2019 MCU architecture
123/232
June-July 2009 123
Memory operation: a read (load), a write (store), ora read-modify-write
Assumed to take place atomically
A memory operation is said to issue when it leavesthe issue queue and looks up the cache
A memory operation is said to perform with respectto a processor when a processor can tell that fromother issued memory operations A read is said to perform with respect to a processor
when subsequent writes issued by that processor cannot
affect the returned read value A write is said to perform with respect to a processor
when a subsequent read from that processor to the sameaddress returns the new value
Or d er i n g m em or y opA ti i id t l t h it h
-
7/30/2019 MCU architecture
124/232
June-July 2009 124
A memory operation is said to complete when it hasperformed with respect to all processors in the
system
Assume that there is a single shared memory andno caches
Memory operations complete in shared memory whenthey access the corresponding memory locations
Operations from the same processor complete in
program order: this imposes a partial orderamong the
memory operations Operations from different processors are interleaved in
such a way that the program order is maintained for each
processor: memory imposes some total order(many are
possible)
-
7/30/2019 MCU architecture
125/232
Cach e coh er en ce Formal definition
-
7/30/2019 MCU architecture
126/232
June-July 2009 126
A memory system is coherent if the values returned byreads to a memory location during an execution of a
program are such that all operations to that location canform a hypothetical total order that is consistent with theserial order and has the following two properties:
1. Operations issued by any particular processor perform
according to the issue order2. The value returned by a read is the value written to thatlocation by the last write in the total order
Two necessary features that follow from above:
A. Write propagation: writes must eventually becomevisible to all processors
B. Write serialization: Every processor should see thewrites to a location in the same order (if I see w1 beforew2, you should not see w2 before w1)
Bu s- b ased SMP Extend the philosophy of uniprocessor bus
-
7/30/2019 MCU architecture
127/232
June-July 2009 127
p p y ptransactions
Three phases: arbitrate for bus, launch command (oftencalled request) and address, transfer data
Every device connected to the bus can observe thetransaction
Appropriate device responds to the request
In SMP, processors also observe the transactions andmay take appropriate actions to guarantee coherence
The other device on the bus that will be of interest to usis the memory controller (north bridge in standard mother
boards) Depending on the bus transaction a cache block
executes a finite state machine implementing thecoherence protocol
Sn oopy p r o t oco l s Cache coherence protocols implemented in bus-
-
7/30/2019 MCU architecture
128/232
June-July 2009 128
Cache coherence protocols implemented in busbased machines are called snoopy protocols
The processors snoop or monitor the bus and takeappropriate protocol actions based on snoop results
Cache controller now receives requests both fromprocessor and bus
Since cache state is maintained on a per line basis thatalso dictates the coherence granularity
Cannot normally take a coherence action on parts of acache line
The coherence protocol is implemented as a finite state
machine on a per cache line basis The snoop logic in each processor grabs the address
from the bus and decides if any action should be takenon the cache line containing that address (only if the lineis in cache)
-
7/30/2019 MCU architecture
129/232
St at e t r an si t i on The finite state machine for each cache line:
-
7/30/2019 MCU architecture
130/232
June-July 2009 130
The finite state machine for each cache line:
On a write miss no line is allocated The state remains at I: called write through write no-
allocated
A/B means: A is generated by processor, B is theresulting bus transaction (if any)
Changes for write through write allocate?
I V
PrWr/BusWr
BusWr (snoop)
PrRd/BusRd PrRd/-
PrWr/BusWr
-
7/30/2019 MCU architecture
131/232
W r i t e t h r o u g h i s b ad High bandwidth requirement
-
7/30/2019 MCU architecture
132/232
June-July 2009 132
High bandwidth requirement Every write appears on the bus
Assume a 3 GHz processor running application with 10%store instructions, assume CPI of 1
If the application runs for 100 cycles it generates 10stores; assume each store is 4 bytes; 40 bytes are
generated per 100/3 ns i.e. BW of 1.2 GB/s A 1 GB/s bus cannot even support one processor
There are multiple processors and also there are readmisses
Writeback caches absorb most of the write traffic Writes that hit in cache do not go on bus (not visible toothers)
Complicated coherence protocol with many choices
Mem or y con si st en cy Need a more formal description of memory
-
7/30/2019 MCU architecture
133/232
June-July 2009 133
p yordering
How to establish the order between reads and writesfrom different processors?
The most clear way is to use synchronization
P0: A=1; flag=1
P1: while (!flag); print A; Another example (assume A=0, B=0 initially)
P0: A=1; print B;
P1: B=1; print A; What do you expect?
Memory consistency model is a contract betweenprogrammer and hardware regarding memory
ordering
Con sist en cy m od el A multiprocessor normally advertises the supported
-
7/30/2019 MCU architecture
134/232
June-July 2009 134
A multiprocessor normally advertises the supportedmemory consistency model
This essentially tells the programmer what the possible
correct outcome of a program could be when run on that
machine
Cache coherence deals with memory operations to the
same location, but not different locations
Without a formally defined order across all memory
operations it often becomes impossible to argue about
what is correct and what is wrong in shared memory
Various memory consistency models
Sequential consistency (SC) is the most intuitive one and
we will focus on it now (more consistency models later)
-
7/30/2019 MCU architecture
135/232
-
7/30/2019 MCU architecture
136/232
OOO an d SC Consider a simple example (all are zero initially)P0 1 1
-
7/30/2019 MCU architecture
137/232
June-July 2009 137
P0: x=w+1; r=y+1;
P1: y=2; w=y+1;
Suppose the load that reads w takes a miss and so w isnot ready for a long time; therefore, x=w+1 cannotcomplete immediately; eventually w returns with value 3
Inside the microprocessor r=y+1 completes (but does not
commit) before x=w+1 and gets the old value of y(possibly from cache); eventually instructions commit inorder with x=4, r=1, y=2, w=3
So we have the following partial orders
P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2
Combine these to get a contradictory total order
What went wrong? We will discuss it in detail later
SC ex am p le Consider the following example
-
7/30/2019 MCU architecture
138/232
June-July 2009 138
Consider the following exampleP0: A=1; print B;
P1: B=1; print A;
Possible outcomes for an SC machine (A, B) = (0,1); interleaving: B=1; print A; A=1; print B
(A, B) = (1,0); interleaving: A=1; print B; B=1; print A
(A, B) = (1,1); interleaving: A=1; B=1; print A; print B
A=1; B=1; print B; print A
(A, B) = (0,0) is impossible: read of A must occur beforewrite of A and read of B must occur before write of B i.e.
print A < A=1 and print B < B=1, but A=1 < print B andB=1 < print A; thus print B < B=1 < print A < A=1 < print Bwhich implies print B < print B, a contradiction
I m p lem en t in g SC Two basic requirements
-
7/30/2019 MCU architecture
139/232
June-July 2009 139
Two basic requirements
Memory operations issued by a processor must become
visible to others in program order
Need to make sure that all processors see the same total
order of memory operations: in the previous example for
the (0,1) case both P0 and P1 should see the same
interleaving: B=1; print A; A=1; print B
The tricky part is to make sure that writes becomevisible in the same order to all processors
Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different
values (which may still be correct from the viewpoint of
cache coherence, but will violate SC)
W r i t e at om ici t yE l (A 0 B 0 i iti ll )
-
7/30/2019 MCU architecture
140/232
June-July 2009 140
Example (A=0, B=0 initially)
P0: A=1;P1: while (!A); B=1;
P2: while (!B); print A;
A correct execution on an SC machine should print
A=1 A=0 will be printed only if write to A is not visible to P2,but clearly it is visible to P1 since it came out of the loop
Thus A=0 is possible if P1 sees the order A=1 < B=1 andP2 sees the order B=1 < A=1 i.e. from the viewpoint of
the whole system the write A=1 was not atomic Without write atomicity P2 may proceed to print 0 with a
stale value from its cache
Su m m ar y o f SC Program order from each processor creates apartial order among memory operations
-
7/30/2019 MCU architecture
141/232
June-July 2009 141
partial order among memory operations
Interleaving of these partial orders defines a totalorder
Sequential consistency: one of many total orders
A multiprocessor is said to be SC if any execution
on this machine is SC compliant Sufficient but not necessary conditions for SC
Issue memory operation in program order
Every processor waits for write to complete before
issuing the next operation Every processor waits for read to complete and the write
that affects the returned value to complete before issuingthe next operation (important for write atomicity)
Back t o sh ar ed bu s Centralized shared bus makes it easy to support
SC
-
7/30/2019 MCU architecture
142/232
June-July 2009 142
SC
Writes and reads are all serialized in a total order throughthe bus transaction ordering
If a read gets a value of a previous write, that write isguaranteed to be complete because that bus transactionis complete
The write order seen by all processors is the same in awrite through system because every write causes atransaction and hence is visible to all in the same order
In a nutshell, every processor sees the same total busorder for all memory operations and therefore any bus-based SMP with write through caches is SC
What about a multiprocessor with writeback cache? No SMP uses write through protocol due to high BW
-
7/30/2019 MCU architecture
143/232
Stores Look at stores a little more closely
-
7/30/2019 MCU architecture
144/232
June-July 2009 144
Look at stores a little more closely There are three situations at the time a store issues: the
line is not in the cache, the line is in the cache in S state,the line is in the cache in one of M, E and O states
If the line is in I state, the store generates aread-exclusive request on the bus and gets the line in Mstate
If the line is in S or O state, that means the processoronly has read permission for that line; the storegenerates an upgrade request on the bus and theupgrade acknowledgment gives it the write permission
(this is a data-less transaction) If the line is in M or E state, no bus transaction is
generated; the cache already has write permission forthe line (this is the case of a write hit; previous two arewrite misses)
I n v a l id at io n v s. u p d at e Two main classes of protocols:
I lid ti b d d d t b d
-
7/30/2019 MCU architecture
145/232
June-July 2009 145
Invalidation-based and update-based
Dictates what action should be taken on a write Invalidation-based protocols invalidate sharers when awrite miss (upgrade or readX) appears on the bus
Update-based protocols update the sharer caches withnew value on a write: requires write transactions
(carrying just the modified bytes) on the bus even onwrite hits (not very attractive with writeback caches)
Advantage of update-based protocols: sharers continueto hit in the cache while in invalidation-based protocolssharers will miss next time they try to access the line
Advantage of invalidation-based protocols: only writemisses go on bus (suited for writeback caches) andsubsequent stores to the same line are cache hits
W h ich on e is bet t er ? Difficult to answer
D d b h i d h d t
-
7/30/2019 MCU architecture
146/232
June-July 2009 146
Depends on program behavior and hardware cost
When is update-based protocol good? What sharing pattern? (large-scale producer/consumer) Otherwise it would just waste bus bandwidth doing
useless updates
When is invalidation-protocol good? Sequence of multiple writes to a cache line Saves intermediate write transactions
Also think about the overhead of initiating small
updates for every write in update protocols Invalidation-based protocols are much more popular
Some systems support both or maybe some hybridbased on dynamic sharing pattern of a cache line
MSI p r o t oco l Forms the foundation of invalidation-based
-
7/30/2019 MCU architecture
147/232
June-July 2009 147
Forms the foundation of invalidation based
writeback protocols
Assumes only three supported cache line states: I, S,
and M
There may be multiple processors caching a line in S
state
There must be exactly one processor caching a line in M
state and it is the owner of the line
If none of the caches have the line, memory must have
the most up-to-date copy of the line
Processor requests to cache: PrRd, PrWr
Bus transactions: BusRd, BusRdX, BusUpgr,BusWB
St at e t r an si t i on
-
7/30/2019 MCU architecture
148/232
June-July 2009 148
I S M
PrRd/BusRd
PrWr/BusRdX
PrRd/-BusRd/-
{BusRdX, BusUpgr}/-CacheEvict/-
PrWr/BusUpgr
PrRd/-
PrWr/-BusRd/Flush
BusRdX/Flush
CacheEvict/BusWB
MSI p r o t oco l Few things to note
-
7/30/2019 MCU architecture
149/232
June-July 2009 149
Few things to note
Flush operation essentially launches the line on the bus
Processor with the cache line in M state is responsible
for flushing the line on bus whenever there is a BusRd or
BusRdX transaction generated by some other processor
On BusRd the line transitions from M to S, but not M to I.
Why? Also at this point both the requester and memory
pick up the line from the bus; the requester puts the line
in its cache in S state while memory writes the line back.
Why does memory need to write back?
On BusRdX the line transitions from M to I and this timememory does not need to pick up the line from bus. Only
the requester picks up the line and puts it in M state in its
cache. Why?
-
7/30/2019 MCU architecture
150/232
MSI ex am p le Take the following example P0 reads x P1 reads x P1 writes x P0 reads x P2 reads
-
7/30/2019 MCU architecture
151/232
June-July 2009 151
P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 readsx, P3 writes x
Assume the state of the cache line containing theaddress of x is I in all processors
P0 generates BusRd, memory provides line, P0 puts line inS state
P1 generates BusRd, memory provides line, P1 puts line inS state
P1 generates BusUpgr, P0 snoops and invalidates line,memory does not respond, P1 sets state of line to M
P0 generates BusRd, P1 flushes line and goes to S state,P0 puts line in S state, memory writes back
P2 generates BusRd, memory provides line, P2 puts line inS state
P3 generates BusRdX, P0, P1, P2 snoop and invalidate,
memor rovides line P3 uts line in cache in M state
MESI p r o t oco l The most popular invalidation-based protocol e.g.,
-
7/30/2019 MCU architecture
152/232
June-July 2009 152
The most popular invalidation based protocol e.g.,
appears in Intel Xeon MP
Why need E state?
The MSI protocol requires two transactions to go from I to
M even if there is no intervening requests for the line:
BusRd followed by BusUpgr
We can save one transaction by having memory
controller respond to the first BusRd with E state if there
is no other sharer in the system
How to know if there is no other sharer? Needs a
dedicated control wire that gets asserted by a sharer
(wired OR)
Processor can write to a line in E state silently and take it
to M state
St at e t r an si t i onPrRd/BusRd(S) PrRd/-
-
7/30/2019 MCU architecture
153/232
June-July 2009 153
I
ME
S
PrRd/BusRd(S)
PrRd/BusRd(!S)
PrWr/BusRdX
PrRd/-
BusRdX/Flush
CacheEvict/-
PrWr/-
BusRd/Flush
PrRd/
BusRd/Flush
{BusRdX, BusUpgr}/FlushCacheEvict/-
PrWr/BusUpgr
PrRd/-
PrWr/-
BusRd/Flush
BusRdX/Flush
CacheEvict/BusWB
MESI p r o t oco l If a cache line is in M state definitely the processor
-
7/30/2019 MCU architecture
154/232
June-July 2009 154
If a cache line is in M state definitely the processorwith the line is responsible for flushing it on the next
BusRd or BusRdX transaction If a line is not in M state who is responsible?
Memory or other caches in S or E state?
Original Illinois MESI protocol assumed cache-to-cachetransfer i.e. any processor in E or S state is responsiblefor flush