Post on 01-Aug-2020
Chap. 2 part 2
CIS*3090 Fall 2016
Fall 2016 CIS*3090 Parallel Programming 1
Why develop abstract models of computer architectures?
Don’t want to treat every computer as a special case for SW development!
Programming for speed is not just “hacking” it’s based on algorithms
Have math basis for measuring time/space complexity of algorithms (Big-O notation)
Vast research into efficient algorithms
Fall 2016 CIS*3090 Parallel Programming 2
What exactly is Big-O counting?
Counts abstract “instructions,” like pseudocode
Assumption is that counting instructions is close enough to reality (=machine code)
Comes from traditional “RAM” model
“Any mem. location – whether inst. or data – can be referenced (read or written) in ‘unit’ time without regard to its location”
Memory is effectively unbounded
Fall 2016 CIS*3090 Parallel Programming 3
Is RAM model accurate?
Defects
RAM model can’t account for caching and paging effects “unit time” assumption
VM illusion of unbounded memory
It’s “close enough”
If you care, you can program with cache/paging in mind…
And now your code may be computer- or OS-dependent!
Fall 2016 CIS*3090 Parallel Programming 4
Extending RAM to PRAM
Simple enhancements
Instead of one inst. execution unit, now N executing in lock step (as if global clock)
Superficially looks like multicore system
Still single (global) memory
Every PE still accesses it in unit time, as before
All PEs immediately see any change in the memory image
Fall 2016 CIS*3090 Parallel Programming 5
Defects in PRAM model
Simultaneous memory access issues R+R, R+W, W+W?
Solved by different rules in PRAM variants
Fatal weaknesses
“Close enough” for dual core (since that’s what it approximates)
As no. of cores grows in SMP, HW cannot maintain unit-time uniform mem. image
For non-SMP (cluster), non-local memory orders of mag. slower than local memory
Fall 2016 CIS*3090 Parallel Programming 6
Unfortunate consequences
PRAM model not good at predicting performance of parallel algos!
Intelligent designers may be “led astray”
Valiant’s Algo. example
Naïve parallel programmers may unconsciously adopt this model
Helps explain why naïve parallel programs don’t give scalable performance unless by luck (chap. 1)
Fall 2016 CIS*3090 Parallel Programming 7
Better parallel model: CTA
Candidate Type Architecture
More complex than PRAM, but still much simpler than any real computer
Key characteristic: distinguishes local vs. non-local memory accesses
Allows you to differentiate the costs on a particular system
Fall 2016 CIS*3090 Parallel Programming 8
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Figure 2.10
9
CTA components (Fig 2.10)
Pi = a single “RAM” model PE Processor + local memory + network interface
P’s might be all equal, or one (P0) might be “controller”
Local memory is for program and data
What is “node degree”?
Max. no. of nodes any node can directly comm. with at one time; could be just 1
Degree = a measure of network interconnectedness and effective bandwidth
Fall 2016 CIS*3090 Parallel Programming 10
No global memory!
P has two kinds of memory refs.
Own local memory
Non-local ref. over the network
But actual dual core and SMP processors do have global mem!
Will need to represent it within the model (stay tuned)
Fall 2016 CIS*3090 Parallel Programming 11
Recognizing memory ref. costs
Memory ref. time = “latency”
Local refs. considered “unit time” (1), like RAM model
Non-local refs symbolized with lambda (λ), can be measured and averaged (Tab 2.1)
λ increases with no. of P (but less than linearly)
Fall 2016 CIS*3090 Parallel Programming 12
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 13
Table 2.1 Estimates for λ for common architectures;
speeds generally do not include congestion or other
traffic delays.
One benefit of CTA model
Yields “Locality Rule” for fast programs
KEY POINT: maximize local memory refs., minimize non-local
Ergo, more efficient even to do redundant local calculations than to make many non-local refs.
Nice practical application for stochastic calculation (p53)
Fall 2016 CIS*3090 Parallel Programming 14
How MIMD machines do non-local memory references
3 common mechanisms
Shared global memory (typical SMP)
One-sided communication
Message passing (typical HPC cluster)
CTA model needs to account for these
Fall 2016 CIS*3090 Parallel Programming 15
(1) Shared memory
CTA has no “shared memory” per se
Real meaning: Pi doesn’t have ref. in local cache already, must get from elsewhere
Incurs λ latency, in practice includes overhead of cache coherency mechanism, memory bus
Programmer’s standpoint
Convenient: read/write global variables
Tricky: still need sync. conventions (mutex)
Fall 2016 CIS*3090 Parallel Programming 16
(2) One-sided communication
“Poor man’s” shared memory
Cray calls “shmem”
Pi can get or put any location = explicitly transferring data between shared address space and local memory @ λ cost
No HW keeping each P’s view of location coherent cheaper to build, smaller λ
“Private” addresses reserved for each P
As with shared mem, need sync conventions
Easier to debug? Fall 2016 CIS*3090 Parallel Programming 17
(3) Message passing
Primitive operations: send and receive
Considered “2-sided” since needs HW and/or OS cooperation at source & destination P’s
“Easier to debug” due to explicit comm.
Communication and synchronization are combined automatically vs. needing separate APIs
Fall 2016 CIS*3090 Parallel Programming 18
Most popular mechanisms
For multicore/SMP: shared memory
For clusters: message passing
And… can build virtual layer for either!
Effect of shared memory implemented with message passing
Send/receive implemented via shared mem.
Occurs anyway when MPI processes happen to be on same node’s cores
Fall 2016 CIS*3090 Parallel Programming 19
Underlying HW used for non-local memory refs.
Bus, we know
Good: direct connect between origin/dest.
Bad: only one comm. at a time (serialized)
Crossbar, saw with SunFire
Good: direct connect and multiple comm.
Bad: most expensive, n2 HW cost
Messaging forwarding (packets)
Good: less HW, multiple comm. possible
Bad: multiple “hops” pile up latency Fall 2016 CIS*3090 Parallel Programming 20
Another wrinkle: “consistency”
Coherence = HW mechanism to sync views of multiple L1 caches
Saw MESI protocol, popular choice
When coherence mechanism operates comes from consistency policy
Reducing sync operations…
Minimizes bus congestion
Makes attaching more cores practical before hitting memory bandwidth wall
Fall 2016 CIS*3090 Parallel Programming 21
Memory consistency models
Sequential (what we’re used to) If P1 sets X before P2 reads, P2 will get new X
Guaranteed at cost of HW overhead
Relaxed If P1 sets X before P2 reads, P2 may get new X
Why? HW can be faster if P1’s store is buffered, lazily propagated to memory
Careful: e.g. louses up Dekker’s algo (p56)
Special occasions, provide test-and-swap inst.
Fall 2016 CIS*3090 Parallel Programming 22
Dekker’s Algorithm for Mutual Exclusion (wikipedia)
flag[0] := false // [i] true => i wants to enter C/S
flag[1] := false
turn := 0 // or 1 => whose turn it is in case both flags true
p0: flag[0] := true
while flag[1]=true {
if turn ≠ 0 {
flag[0] := false
while turn ≠ 0 {
}
flag[0] := true
}
}
// critical section
...
turn := 1
flag[0] := false
// remainder section
p1: flag[1] := true
while flag[0]=true {
if turn ≠ 1 {
flag[1] := false
while turn ≠ 1 {
}
flag[1] := true
}
}
// critical section
...
turn := 0
flag[1] := false
// remainder section
Danger sequence assuming processes
are preemptible:
flag[0] := true
while flag[1]=true {
Interrupt / Reschedule
}
// critical section
...
flag[1] := true
while flag[0]=true {
}
// critical section
...
Interrupt / Reschedule
Fall 2016 CIS*3090 Parallel Programming
23
So our computers have sequential consistency, right?
No! Modern architectures…
Actually employ relaxed consistency
Make it look (to programmer) like sequential consistency
But! programmer has to play by the rules
Means careful use of mutexes for global variables is essential!
Lock/unlock secretly use memory “fence” to activate cache coherence mechanism for sync
Fall 2016 CIS*3090 Parallel Programming 24
Benefit of CTA model
Apply it back to “count 3s”
Since single global count variable requires non-local ref., incurs λ avoid that in
favour of local count variables (e.g. stack)
Steers us to best solution by taking memory ref. time into account!
Fall 2016 CIS*3090 Parallel Programming 25
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 26
Figure 1.7 The first try at Count 3s
λ
Model’s analysis not quite accurate
Points up problem of global variable because of high-cost non-local ref.
Major problem here was thread contention for single global variable + its lock
OK, “model” is supposed to abstract away low-level details of real computer
As long as it leads you to the right conclusion (even for wrong reason), it’s doing its job!
Fall 2016 CIS*3090 Parallel Programming 27