Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

27
Chap. 2 part 2 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1

Transcript of Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

Page 1: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Chap. 2 part 2

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Page 2: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Why develop abstract models of computer architectures?

Don’t want to treat every computer as a special case for SW development!

Programming for speed is not just “hacking” it’s based on algorithms

Have math basis for measuring time/space complexity of algorithms (Big-O notation)

Vast research into efficient algorithms

Fall 2016 CIS*3090 Parallel Programming 2

Page 3: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

What exactly is Big-O counting?

Counts abstract “instructions,” like pseudocode

Assumption is that counting instructions is close enough to reality (=machine code)

Comes from traditional “RAM” model

“Any mem. location – whether inst. or data – can be referenced (read or written) in ‘unit’ time without regard to its location”

Memory is effectively unbounded

Fall 2016 CIS*3090 Parallel Programming 3

Page 4: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Is RAM model accurate?

Defects

RAM model can’t account for caching and paging effects “unit time” assumption

VM illusion of unbounded memory

It’s “close enough”

If you care, you can program with cache/paging in mind…

And now your code may be computer- or OS-dependent!

Fall 2016 CIS*3090 Parallel Programming 4

Page 5: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Extending RAM to PRAM

Simple enhancements

Instead of one inst. execution unit, now N executing in lock step (as if global clock)

Superficially looks like multicore system

Still single (global) memory

Every PE still accesses it in unit time, as before

All PEs immediately see any change in the memory image

Fall 2016 CIS*3090 Parallel Programming 5

Page 6: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Defects in PRAM model

Simultaneous memory access issues R+R, R+W, W+W?

Solved by different rules in PRAM variants

Fatal weaknesses

“Close enough” for dual core (since that’s what it approximates)

As no. of cores grows in SMP, HW cannot maintain unit-time uniform mem. image

For non-SMP (cluster), non-local memory orders of mag. slower than local memory

Fall 2016 CIS*3090 Parallel Programming 6

Page 7: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Unfortunate consequences

PRAM model not good at predicting performance of parallel algos!

Intelligent designers may be “led astray”

Valiant’s Algo. example

Naïve parallel programmers may unconsciously adopt this model

Helps explain why naïve parallel programs don’t give scalable performance unless by luck (chap. 1)

Fall 2016 CIS*3090 Parallel Programming 7

Page 8: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Better parallel model: CTA

Candidate Type Architecture

More complex than PRAM, but still much simpler than any real computer

Key characteristic: distinguishes local vs. non-local memory accesses

Allows you to differentiate the costs on a particular system

Fall 2016 CIS*3090 Parallel Programming 8

Page 9: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Figure 2.10

9

Page 10: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

CTA components (Fig 2.10)

Pi = a single “RAM” model PE Processor + local memory + network interface

P’s might be all equal, or one (P0) might be “controller”

Local memory is for program and data

What is “node degree”?

Max. no. of nodes any node can directly comm. with at one time; could be just 1

Degree = a measure of network interconnectedness and effective bandwidth

Fall 2016 CIS*3090 Parallel Programming 10

Page 11: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

No global memory!

P has two kinds of memory refs.

Own local memory

Non-local ref. over the network

But actual dual core and SMP processors do have global mem!

Will need to represent it within the model (stay tuned)

Fall 2016 CIS*3090 Parallel Programming 11

Page 12: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Recognizing memory ref. costs

Memory ref. time = “latency”

Local refs. considered “unit time” (1), like RAM model

Non-local refs symbolized with lambda (λ), can be measured and averaged (Tab 2.1)

λ increases with no. of P (but less than linearly)

Fall 2016 CIS*3090 Parallel Programming 12

Page 13: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 13

Table 2.1 Estimates for λ for common architectures;

speeds generally do not include congestion or other

traffic delays.

Page 14: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

One benefit of CTA model

Yields “Locality Rule” for fast programs

KEY POINT: maximize local memory refs., minimize non-local

Ergo, more efficient even to do redundant local calculations than to make many non-local refs.

Nice practical application for stochastic calculation (p53)

Fall 2016 CIS*3090 Parallel Programming 14

Page 15: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

How MIMD machines do non-local memory references

3 common mechanisms

Shared global memory (typical SMP)

One-sided communication

Message passing (typical HPC cluster)

CTA model needs to account for these

Fall 2016 CIS*3090 Parallel Programming 15

Page 16: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

(1) Shared memory

CTA has no “shared memory” per se

Real meaning: Pi doesn’t have ref. in local cache already, must get from elsewhere

Incurs λ latency, in practice includes overhead of cache coherency mechanism, memory bus

Programmer’s standpoint

Convenient: read/write global variables

Tricky: still need sync. conventions (mutex)

Fall 2016 CIS*3090 Parallel Programming 16

Page 17: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

(2) One-sided communication

“Poor man’s” shared memory

Cray calls “shmem”

Pi can get or put any location = explicitly transferring data between shared address space and local memory @ λ cost

No HW keeping each P’s view of location coherent cheaper to build, smaller λ

“Private” addresses reserved for each P

As with shared mem, need sync conventions

Easier to debug? Fall 2016 CIS*3090 Parallel Programming 17

Page 18: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

(3) Message passing

Primitive operations: send and receive

Considered “2-sided” since needs HW and/or OS cooperation at source & destination P’s

“Easier to debug” due to explicit comm.

Communication and synchronization are combined automatically vs. needing separate APIs

Fall 2016 CIS*3090 Parallel Programming 18

Page 19: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Most popular mechanisms

For multicore/SMP: shared memory

For clusters: message passing

And… can build virtual layer for either!

Effect of shared memory implemented with message passing

Send/receive implemented via shared mem.

Occurs anyway when MPI processes happen to be on same node’s cores

Fall 2016 CIS*3090 Parallel Programming 19

Page 20: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Underlying HW used for non-local memory refs.

Bus, we know

Good: direct connect between origin/dest.

Bad: only one comm. at a time (serialized)

Crossbar, saw with SunFire

Good: direct connect and multiple comm.

Bad: most expensive, n2 HW cost

Messaging forwarding (packets)

Good: less HW, multiple comm. possible

Bad: multiple “hops” pile up latency Fall 2016 CIS*3090 Parallel Programming 20

Page 21: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Another wrinkle: “consistency”

Coherence = HW mechanism to sync views of multiple L1 caches

Saw MESI protocol, popular choice

When coherence mechanism operates comes from consistency policy

Reducing sync operations…

Minimizes bus congestion

Makes attaching more cores practical before hitting memory bandwidth wall

Fall 2016 CIS*3090 Parallel Programming 21

Page 22: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Memory consistency models

Sequential (what we’re used to) If P1 sets X before P2 reads, P2 will get new X

Guaranteed at cost of HW overhead

Relaxed If P1 sets X before P2 reads, P2 may get new X

Why? HW can be faster if P1’s store is buffered, lazily propagated to memory

Careful: e.g. louses up Dekker’s algo (p56)

Special occasions, provide test-and-swap inst.

Fall 2016 CIS*3090 Parallel Programming 22

Page 23: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Dekker’s Algorithm for Mutual Exclusion (wikipedia)

flag[0] := false // [i] true => i wants to enter C/S

flag[1] := false

turn := 0 // or 1 => whose turn it is in case both flags true

p0: flag[0] := true

while flag[1]=true {

if turn ≠ 0 {

flag[0] := false

while turn ≠ 0 {

}

flag[0] := true

}

}

// critical section

...

turn := 1

flag[0] := false

// remainder section

p1: flag[1] := true

while flag[0]=true {

if turn ≠ 1 {

flag[1] := false

while turn ≠ 1 {

}

flag[1] := true

}

}

// critical section

...

turn := 0

flag[1] := false

// remainder section

Danger sequence assuming processes

are preemptible:

flag[0] := true

while flag[1]=true {

Interrupt / Reschedule

}

// critical section

...

flag[1] := true

while flag[0]=true {

}

// critical section

...

Interrupt / Reschedule

Fall 2016 CIS*3090 Parallel Programming

23

Page 24: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

So our computers have sequential consistency, right?

No! Modern architectures…

Actually employ relaxed consistency

Make it look (to programmer) like sequential consistency

But! programmer has to play by the rules

Means careful use of mutexes for global variables is essential!

Lock/unlock secretly use memory “fence” to activate cache coherence mechanism for sync

Fall 2016 CIS*3090 Parallel Programming 24

Page 25: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Benefit of CTA model

Apply it back to “count 3s”

Since single global count variable requires non-local ref., incurs λ avoid that in

favour of local count variables (e.g. stack)

Steers us to best solution by taking memory ref. time into account!

Fall 2016 CIS*3090 Parallel Programming 25

Page 26: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 26

Figure 1.7 The first try at Count 3s

λ

Page 27: Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 . Why develop abstract models ... orders of mag. slower than

Model’s analysis not quite accurate

Points up problem of global variable because of high-cost non-local ref.

Major problem here was thread contention for single global variable + its lock

OK, “model” is supposed to abstract away low-level details of real computer

As long as it leads you to the right conclusion (even for wrong reason), it’s doing its job!

Fall 2016 CIS*3090 Parallel Programming 27