SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata...
-
Upload
gerald-freeman -
Category
Documents
-
view
221 -
download
0
Transcript of SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata...
![Page 1: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/1.jpg)
SE363
Computer Architecture
MIMD Parallel Processors
John Morris
Iolanthe II racing in Waitemata Harbour
![Page 2: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/2.jpg)
MIMD Systems
• Recipe• Buy a few high performance commercial PEs
• DEC Alpha
• MIPS R10000
• UltraSPARC
• Pentium?
• Put them together with some memory and peripherals on a common bus Instant
parallel processor!
• How to program it?
![Page 3: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/3.jpg)
Programming Model
• Problem not unique to MIMD• Even sequential machines need one
• von Neuman (stored program) model
• Parallel - Splitting the work load• Data
• Distribute data to PEs• Instructions
• Distribute tasks to PEs• Synchronization
• Having divided the data & tasks,how do we synchronize tasks?
![Page 4: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/4.jpg)
Programming Model
• Shared Memory Model• Flavour of the year
• Generally thought
to be simplest to manage
• All PEs see a common (virtual) address space
• PEs communicate by writing into the common address space
![Page 5: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/5.jpg)
Data Distribution
• Trivial• All the data sits
in the common addressspace
• Any PE can access it!
• Uniform Memory Access(UMA) systems• All PEs access all data
with same tacc
• Non-UMA (NUMA) systems• Memory is physically distributed• Some PEs are “closer” to some addresses• More later!
![Page 6: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/6.jpg)
Synchronisation
• Read static shared data• No problem!
• Update problem• PE0 writes x
• PE1 reads x
• How to ensure thatPE1 reads the lastvalue written by PE0?
• Semaphores• Lock resources
(memory areas or ...)while being updatedby one PE
![Page 7: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/7.jpg)
Synchronisation
• Semaphore• Data structure in memory
• Count of waiters• -1 = resource free
• >= 0 resource in use
• Pointer to list of waiters• Two operations
• Wait• Proceed immediately if resource free
(waiter count = -1)
• Notify• Advise semaphore that you have finished with resource
• Decrement waiter count
• First waiter will be given control
![Page 8: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/8.jpg)
Semaphores - Implementation
• Scenario• Semaphore free (-1)
• PE0: wait ..
• Resource free, so PE0 uses it (sets 0)
• PE1: wait ..• Reads count (0)• Starts to increment it ..
• PE0 notify ..• Gets bus and writes -1
• PE1: (finishing wait)
• Adds 1 to 0, writes 1 to count, adds PE1 TCB to list
Stalemate!• Who issues notify to free the resource?
![Page 9: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/9.jpg)
Atomic Operations
• Problem• PE0 wrote a new value (-1) after PE1 had read the counter
• PE1 increments the value it read (0) and writes it back
• Solution• PE1’s read and update must be atomic
• No other PE must gain access to counter
while PE1 is updating
• Usually an architecture will provide • Test and set instruction
• Read a memory location, test it,if it’s 0, write a new value,else do nothing
• Atomic or indivisible .. No other PE can access the value until the operation is complete
![Page 10: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/10.jpg)
Atomic Operations
• Test & Set• Read a memory location, test it,
if it’s 0, write a new value,else do nothing
• Can be used to guard a resource• When the location contains 0 -
access to the resource is allowed• Non-zero value means the resource is locked• Semaphore:
• Simple semaphore (no wait list)• Implement directly• Waiter “backs off” and tries again (rather than being queued)
• Complex semaphore (with wait list)• Guards the wait counter
![Page 11: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/11.jpg)
Atomic Operations
• Processor must provide an atomic operation for• Multi-tasking or multi-threading on a single PE
• Multiple processes• Interrupts occur at arbitrary points in time
• including timer interrupts signaling end of time-slice
• Any process can be interrupted in the middle of a read-modify-write sequence
• Shared memory multi-processors• One PE can lose control of the bus after the
read of a read-modify-write• Cache?
• Later!
![Page 12: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/12.jpg)
Atomic Operations
• Variations• Provide equivalent capability
• Sometimes appear in strange guises!
• Read-modify-write bus transactions• Memory location is
read, modified and written back as a single, indivisible operation
• Test and exchange• Check register’s value, if 0, exchange with memory
• Reservation Register (PowerPC)• lwarx - load word and reserve indexed• stwcx - store word conditional indexed• Reservation register stores address of reserved word
• Reservation and use can be separated by sequence of instructions
![Page 13: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/13.jpg)
Barriers
• In shared memoryenvironment
• PEs must know whenanother PE hasproduced a result
• Simplest case:barrier for all PEs
• Must be inserted byprogrammer
• Potentially expensive• All PEs stall and
waste time in the barrier
![Page 14: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/14.jpg)
Cache?
• What happens to cachedlocations?
![Page 15: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/15.jpg)
Multiple Caches
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
![Page 16: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/16.jpg)
Multiple Caches - Inconsistent states
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201PEB reads location x
reads 200 from cache B
![Page 17: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/17.jpg)
Multiple Caches - Inconsistent states
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201PEB reads location x
reads 200 from cache BCaches and memory are now inconsistent or
not coherent
![Page 18: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/18.jpg)
Cache - Maintaining Coherence
• Invalidate on writePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201Issues invalidate x
Cache B marks x invalid• Invalidate is address transaction only
![Page 19: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/19.jpg)
Cache - Maintaining Coherence
• Reading the new valuePEB reads location x
Main memoryis wrong also
PEA snoops read
Realises it hasvalid copy
PEA issues retry
![Page 20: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/20.jpg)
Cache - Maintaining Coherence
• Reading the new valuePEB reads location x
Main memoryis wrong also
PEA snoops read
Realises it hasvalid copy
PEA issues retry
PEA writes x back
Memory now correct PEB reads location x again
• Reads latest version
![Page 21: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/21.jpg)
Coherent Cache - Snooping
• SIU “snoops” bus for transactions• Addresses compared with local cache• Matches
• Initiate retries• Local copy is modified
• Local copy is written to bus
• Invalidate local copies• Another PE is writing
• Mark local copies shared
• second PE is readingsame value
![Page 22: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/22.jpg)
Coherent Cache - MESI protocol
• Cache line has 4 states• Invalid• Modified
• Only valid copy• Memory copy is invalid
• Exclusive• Only cached copy• Memory copy is valid
• Shared• Multiple cached copies• Memory copy is valid
![Page 23: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/23.jpg)
MESI State Diagram
• Note the number of bus transactions needed!
WH Write HitWM Write MissRH Read HitRMS Read Miss SharedRME Read Miss ExclusiveSHW Snoop Hit Write
![Page 24: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/24.jpg)
Coherent Cache - The Cost
• Cache coherency transactions• Additional transactions needed • Shared
• Write Hit• Other caches must be notified
• Modified• Other PE read
• Push-out needed
• Other PE write• Push-out needed - writing one word of n-word line
• Invalid - modified in other cache• Read or write
• Wait for push-out
![Page 25: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/25.jpg)
Clusters
• A bus which is too long becomes slow! eg PCI is limited to 10 TTL loads
• Lots of processors?• On the same bus
• Bus speed must be limited Low communication rate Better to use a single PE!
• Clusters• ~8 processors on a bus
![Page 26: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/26.jpg)
Clusters
8 cache coherent
(CC) processors
on a bus
Interconnectnetwork
~100? clusters
![Page 27: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/27.jpg)
Clusters
Network InterfaceUnit
Detects requests for“remote” memory
![Page 28: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/28.jpg)
Clusters
Messagedespatched to
remote cluster’sNIU
Memory RequestMessage
![Page 29: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/29.jpg)
This memory ismuch closer
than this one!
From PEs inthis cluster
Clusters - Shared Memory
• Non Uniform Memory Access• Access time to memory depends on location!
![Page 30: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/30.jpg)
Clusters - Shared Memory
• Non Uniform Memory Access• Access time to memory depends on location!
Worse!NIU needs to maintain
cache coherenceacross the entire
machine
![Page 31: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/31.jpg)
Clusters - Maintaining Cache Coherence
• NIU (or equivalent) maintains directory • Directory Entries
• All lines from local memory cached elsewhere
• NIU software (firmware) • Checks memory requests against directory• Update directory• Send invalidate messages to other clusters• Fetch modified (dirty) lines from other clusters
• Remote memory access cost• 100s of cycles!
Address Status Clusters 4340 S 1, 3, 8 5260 E 9
Directory(Cluster 2)
![Page 32: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/32.jpg)
Clusters - “Off the shelf”
• Commercial clusters • Provide page migration
• Make copy of a remote page on the local PE• Programmer remains responsible for
coherence• Don’t provide hardware support for cache
coherence (across network)• Fully CC machines may never be available!
• Software Systems• ....
![Page 33: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/33.jpg)
Shared Memory Systems
• Software Systems eg Treadmarks• Provide shared memory on page basis
• Software • detects references to remote pages
• moves copy to local memory
• Reduces shared memory overhead• Provides some of the shared memory model
convenience• Without swamping interconnection network with
messages
• Message overhead is too high for a single word!
• Word basis is too expensive!!
![Page 34: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/34.jpg)
Shared Memory Systems - Granularity
• Granularity• Word basis is too expensive!!• Sharing data at low granularity
• Fine grain sharing• Access / sharing for individual words
• Overheads too high• Number of messages
• Message overhead is high for one word
• Compare• Burst access to memory• Don’t fetch a single word -
• Overhead (bus protocol) is too high
• Amortize cost of access over multiple words
![Page 35: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/35.jpg)
Shared Memory Systems - Granularity
• Coarse Grain Systems• Transferring data from cluster to cluster
• Overhead• Messages
• Updating directory
• Amortise the overhead over a whole pageLower relative overhead
• Applies to thread size also• Split program into small threads of control
Parallel Overhead
• cost of setting up & starting each thread
• cost of synchronising at the end of a set of threads• Can be more efficient to run a single sequential thread!
![Page 36: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/36.jpg)
Coarse Grain Systems
• So far ...• Most experiments suggest that fine grain
systems are impractical• Larger, coarser grain
• Blocks of data• Threads of computation
needed to reduce overall computation time by using multiple processors
• Too Fine grain parallel systems • can run slower than a single processor!
![Page 37: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/37.jpg)
Parallel Overhead• Ideal
• Time = 1/n
• Add Overhead• Time > optimal• No point to use
more than4 PEs!!
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
Number of PEs
Exe
cuti
on
Tim
e
Ideal
"+Parall O'head"
![Page 38: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/38.jpg)
Parallel Overhead• Ideal
• Time = 1/n
• Add Overhead• Time > optimal• No point to use
more than4 PEs!!
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
Number of PEs
Exe
cuti
on
Tim
e
Ideal
"+Parall O'head"
![Page 39: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/39.jpg)
Parallel Overhead• Shared memory systems Best results if you
• Share on large block basis
eg page• Split program into coarse grain
(long running) threads• Give away some parallelism
to achieve any parallel speedup!
• Coarse grain• Data• Computation
There’s parallelism at the instruction level too!The instruction issue unit in a sequential processoris trying to exploit it!
![Page 40: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/40.jpg)
Clusters - Improving multiple PE performance
• Bandwidth to memory • Cache reduces dependency on the memory-
CPU interface• 95% cache hits 5% of memory accesses
crossing the interface
but add • a few PEs and • a few CC transactions
even if the interface was coping before,it won’t in a multiprocessor system!
A major bottleneck!
![Page 41: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/41.jpg)
Clusters - Improving multiple PE performance
• Bus protocols add to access time Request / Grant / Release phases needed
• “Point-to-point” is faster! • Cross-bar switch
interface to memory• No PE contends
with any other for the common bus
Cross-bar?Name taken from old telephone exchanges!
![Page 42: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/42.jpg)
Clusters - Memory Bandwidth
• Modern Clusters• Use “Point-to-point” X-bar interfaces to
memory to get bandwidth!
• Cache coherence?• Now really hard!!• How does each cache
snoop all transactions?
![Page 43: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/43.jpg)
Programming Model
• Distributed Memory• Message passing• Alternative to shared memory• Each PE has
own address space• PEs communicate
with messages• Messages provide
synchronisation• PE can block or
wait for a message
![Page 44: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/44.jpg)
Programming Model - Distributed Memory
• Distributed Memory Systems• Hardware is simple!• Network can be as simple as ethernet• Networks of Workstations model
• Commodity (cheap!) PEs• Commodity Network
• Standard
• Ethernet
• ATM• Proprietary
• Myrinet
• Achilles (UWA!)
![Page 45: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/45.jpg)
Programming Model - Distributed Memory
• Distributed Memory Systems• Software is considered harder• Programmer responsible for
• Distributing data to individual PEs• Explicit Thread control
• Starting, stopping & synchronising
• At least two commonly available systems• Parallel Virtual Machine (PVM)• Message Passing Interface (MPI)
• Built on two operations• Send data, destPE, block | don’t block
• Receive data, srcPE, block | don’t block
• Blocking ensures synchronisation
![Page 46: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/46.jpg)
Programming Model - Distributed Memory
• Distributed Memory Systems• Performance generally better
(versus shared memory)• Shared memory has hidden overheads
• Grain size poorly chosen• eg data doesn’t fit into pages
• Unnecessary coherencetransactions
• Updating a shared region (each page)before end of computation
• MP system waits and updates page when computation is complete
![Page 47: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/47.jpg)
Programming Model - Distributed Memory
• Distributed Memory Systems• Performance generally better
(versus shared memory)
• False sharing
• Severely degrades performance• May not be apparent on superficial analysis
PEa accessesthis data
PEb accessesthis data
This whole pageping-pongs
between PEa and PEb
Memory page
![Page 48: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/48.jpg)
Distributed Memory - Summary
• Simpler (almost trivial) hardware• Software
• More programmer effort• Explicit data distribution• Explicit synchronisation
• Performance generally better • Programmer knows more about the problem• Communicates only when necessary• Communication grain size can be optimum
Lower overheads
![Page 49: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/49.jpg)
Data Flow
• Conventional programming models are control driven• Instruction sequence is precisely specified• Sequence specifies control
• which instruction the CPU will execute next
• Execution rule:• Execute an instruction when its predecessor
has completed s1: r = a*b;s2: s = c*d;s3: y = r + s;
s2 executes when s1 is completes3 executes when s2 is complete
![Page 50: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/50.jpg)
Data Flow• Consider the calculation
• y = a*b + c*d
• Represent it bya graph• Nodes represent
computations• Data flows along
arcs
• Execution rule:• Execute an instruction
when its data is available• Data driven rule
a b
x
+
d c
x
y
![Page 51: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.](https://reader030.fdocuments.in/reader030/viewer/2022033102/5697c0251a28abf838cd4e76/html5/thumbnails/51.jpg)
Data Flow• Dataflow firing rule
• An instruction fires (executes)when its data is available
• Exposes all possible parallelism• Either multiplication can
fire as soon as data arrives• Addition must wait
• Data dependence analysis!• Instruction issue units:
• Fire (issue) each instructionwhen its operands (registers) have been written
a b
x
+
d c
x
y