Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

Post on 01-Aug-2020

0 views 0 download

Transcript of Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

Chap. 2 part 2

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Why develop abstract models of computer architectures?

Don’t want to treat every computer as a special case for SW development!

Programming for speed is not just “hacking” it’s based on algorithms

Have math basis for measuring time/space complexity of algorithms (Big-O notation)

Vast research into efficient algorithms

Fall 2016 CIS*3090 Parallel Programming 2

What exactly is Big-O counting?

Counts abstract “instructions,” like pseudocode

Assumption is that counting instructions is close enough to reality (=machine code)

Comes from traditional “RAM” model

“Any mem. location – whether inst. or data – can be referenced (read or written) in ‘unit’ time without regard to its location”

Memory is effectively unbounded

Fall 2016 CIS*3090 Parallel Programming 3

Is RAM model accurate?

Defects

RAM model can’t account for caching and paging effects “unit time” assumption

VM illusion of unbounded memory

It’s “close enough”

If you care, you can program with cache/paging in mind…

And now your code may be computer- or OS-dependent!

Fall 2016 CIS*3090 Parallel Programming 4

Extending RAM to PRAM

Simple enhancements

Instead of one inst. execution unit, now N executing in lock step (as if global clock)

Superficially looks like multicore system

Still single (global) memory

Every PE still accesses it in unit time, as before

All PEs immediately see any change in the memory image

Fall 2016 CIS*3090 Parallel Programming 5

Defects in PRAM model

Simultaneous memory access issues R+R, R+W, W+W?

Solved by different rules in PRAM variants

Fatal weaknesses

“Close enough” for dual core (since that’s what it approximates)

As no. of cores grows in SMP, HW cannot maintain unit-time uniform mem. image

For non-SMP (cluster), non-local memory orders of mag. slower than local memory

Fall 2016 CIS*3090 Parallel Programming 6

Unfortunate consequences

PRAM model not good at predicting performance of parallel algos!

Intelligent designers may be “led astray”

Valiant’s Algo. example

Naïve parallel programmers may unconsciously adopt this model

Helps explain why naïve parallel programs don’t give scalable performance unless by luck (chap. 1)

Fall 2016 CIS*3090 Parallel Programming 7

Better parallel model: CTA

Candidate Type Architecture

More complex than PRAM, but still much simpler than any real computer

Key characteristic: distinguishes local vs. non-local memory accesses

Allows you to differentiate the costs on a particular system

Fall 2016 CIS*3090 Parallel Programming 8

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Figure 2.10

9

CTA components (Fig 2.10)

Pi = a single “RAM” model PE Processor + local memory + network interface

P’s might be all equal, or one (P0) might be “controller”

Local memory is for program and data

What is “node degree”?

Max. no. of nodes any node can directly comm. with at one time; could be just 1

Degree = a measure of network interconnectedness and effective bandwidth

Fall 2016 CIS*3090 Parallel Programming 10

No global memory!

P has two kinds of memory refs.

Own local memory

Non-local ref. over the network

But actual dual core and SMP processors do have global mem!

Will need to represent it within the model (stay tuned)

Fall 2016 CIS*3090 Parallel Programming 11

Recognizing memory ref. costs

Memory ref. time = “latency”

Local refs. considered “unit time” (1), like RAM model

Non-local refs symbolized with lambda (λ), can be measured and averaged (Tab 2.1)

λ increases with no. of P (but less than linearly)

Fall 2016 CIS*3090 Parallel Programming 12

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 13

Table 2.1 Estimates for λ for common architectures;

speeds generally do not include congestion or other

traffic delays.

One benefit of CTA model

Yields “Locality Rule” for fast programs

KEY POINT: maximize local memory refs., minimize non-local

Ergo, more efficient even to do redundant local calculations than to make many non-local refs.

Nice practical application for stochastic calculation (p53)

Fall 2016 CIS*3090 Parallel Programming 14

How MIMD machines do non-local memory references

3 common mechanisms

Shared global memory (typical SMP)

One-sided communication

Message passing (typical HPC cluster)

CTA model needs to account for these

Fall 2016 CIS*3090 Parallel Programming 15

(1) Shared memory

CTA has no “shared memory” per se

Real meaning: Pi doesn’t have ref. in local cache already, must get from elsewhere

Incurs λ latency, in practice includes overhead of cache coherency mechanism, memory bus

Programmer’s standpoint

Convenient: read/write global variables

Tricky: still need sync. conventions (mutex)

Fall 2016 CIS*3090 Parallel Programming 16

(2) One-sided communication

“Poor man’s” shared memory

Cray calls “shmem”

Pi can get or put any location = explicitly transferring data between shared address space and local memory @ λ cost

No HW keeping each P’s view of location coherent cheaper to build, smaller λ

“Private” addresses reserved for each P

As with shared mem, need sync conventions

Easier to debug? Fall 2016 CIS*3090 Parallel Programming 17

(3) Message passing

Primitive operations: send and receive

Considered “2-sided” since needs HW and/or OS cooperation at source & destination P’s

“Easier to debug” due to explicit comm.

Communication and synchronization are combined automatically vs. needing separate APIs

Fall 2016 CIS*3090 Parallel Programming 18

Most popular mechanisms

For multicore/SMP: shared memory

For clusters: message passing

And… can build virtual layer for either!

Effect of shared memory implemented with message passing

Send/receive implemented via shared mem.

Occurs anyway when MPI processes happen to be on same node’s cores

Fall 2016 CIS*3090 Parallel Programming 19

Underlying HW used for non-local memory refs.

Bus, we know

Good: direct connect between origin/dest.

Bad: only one comm. at a time (serialized)

Crossbar, saw with SunFire

Good: direct connect and multiple comm.

Bad: most expensive, n2 HW cost

Messaging forwarding (packets)

Good: less HW, multiple comm. possible

Bad: multiple “hops” pile up latency Fall 2016 CIS*3090 Parallel Programming 20

Another wrinkle: “consistency”

Coherence = HW mechanism to sync views of multiple L1 caches

Saw MESI protocol, popular choice

When coherence mechanism operates comes from consistency policy

Reducing sync operations…

Minimizes bus congestion

Makes attaching more cores practical before hitting memory bandwidth wall

Fall 2016 CIS*3090 Parallel Programming 21

Memory consistency models

Sequential (what we’re used to) If P1 sets X before P2 reads, P2 will get new X

Guaranteed at cost of HW overhead

Relaxed If P1 sets X before P2 reads, P2 may get new X

Why? HW can be faster if P1’s store is buffered, lazily propagated to memory

Careful: e.g. louses up Dekker’s algo (p56)

Special occasions, provide test-and-swap inst.

Fall 2016 CIS*3090 Parallel Programming 22

Dekker’s Algorithm for Mutual Exclusion (wikipedia)

flag[0] := false // [i] true => i wants to enter C/S

flag[1] := false

turn := 0 // or 1 => whose turn it is in case both flags true

p0: flag[0] := true

while flag[1]=true {

if turn ≠ 0 {

flag[0] := false

while turn ≠ 0 {

}

flag[0] := true

}

}

// critical section

...

turn := 1

flag[0] := false

// remainder section

p1: flag[1] := true

while flag[0]=true {

if turn ≠ 1 {

flag[1] := false

while turn ≠ 1 {

}

flag[1] := true

}

}

// critical section

...

turn := 0

flag[1] := false

// remainder section

Danger sequence assuming processes

are preemptible:

flag[0] := true

while flag[1]=true {

Interrupt / Reschedule

}

// critical section

...

flag[1] := true

while flag[0]=true {

}

// critical section

...

Interrupt / Reschedule

Fall 2016 CIS*3090 Parallel Programming

23

So our computers have sequential consistency, right?

No! Modern architectures…

Actually employ relaxed consistency

Make it look (to programmer) like sequential consistency

But! programmer has to play by the rules

Means careful use of mutexes for global variables is essential!

Lock/unlock secretly use memory “fence” to activate cache coherence mechanism for sync

Fall 2016 CIS*3090 Parallel Programming 24

Benefit of CTA model

Apply it back to “count 3s”

Since single global count variable requires non-local ref., incurs λ avoid that in

favour of local count variables (e.g. stack)

Steers us to best solution by taking memory ref. time into account!

Fall 2016 CIS*3090 Parallel Programming 25

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 26

Figure 1.7 The first try at Count 3s

λ

Model’s analysis not quite accurate

Points up problem of global variable because of high-cost non-local ref.

Major problem here was thread contention for single global variable + its lock

OK, “model” is supposed to abstract away low-level details of real computer

As long as it leads you to the right conclusion (even for wrong reason), it’s doing its job!

Fall 2016 CIS*3090 Parallel Programming 27