Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS3090 Fall 2016 Fall 2016 CIS3090...

Chap. 2 part 2

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Why develop abstract models of computer architectures?

Don’t want to treat every computer as a special case for SW development!

Programming for speed is not just “hacking” it’s based on algorithms

Have math basis for measuring time/space complexity of algorithms (Big-O notation)

Vast research into efficient algorithms

What exactly is Big-O counting?

Counts abstract “instructions,” like pseudocode

Assumption is that counting instructions is close enough to reality (=machine code)

Comes from traditional “RAM” model

“Any mem. location – whether inst. or data – can be referenced (read or written) in ‘unit’ time without regard to its location”

Memory is effectively unbounded

Is RAM model accurate?

Defects

RAM model can’t account for caching and paging effects “unit time” assumption

VM illusion of unbounded memory

It’s “close enough”

If you care, you can program with cache/paging in mind…

And now your code may be computer- or OS-dependent!

Extending RAM to PRAM

Simple enhancements

Instead of one inst. execution unit, now N executing in lock step (as if global clock)

Superficially looks like multicore system

Still single (global) memory

Every PE still accesses it in unit time, as before

All PEs immediately see any change in the memory image

Defects in PRAM model

Simultaneous memory access issues R+R, R+W, W+W?

Solved by different rules in PRAM variants

Fatal weaknesses

“Close enough” for dual core (since that’s what it approximates)

As no. of cores grows in SMP, HW cannot maintain unit-time uniform mem. image

For non-SMP (cluster), non-local memory orders of mag. slower than local memory

Unfortunate consequences

PRAM model not good at predicting performance of parallel algos!

Intelligent designers may be “led astray”

Valiant’s Algo. example

Naïve parallel programmers may unconsciously adopt this model

Helps explain why naïve parallel programs don’t give scalable performance unless by luck (chap. 1)

Better parallel model: CTA

Candidate Type Architecture

More complex than PRAM, but still much simpler than any real computer

Key characteristic: distinguishes local vs. non-local memory accesses

Allows you to differentiate the costs on a particular system

Figure 2.10

CTA components (Fig 2.10)

Pi = a single “RAM” model PE Processor + local memory + network interface

P’s might be all equal, or one (P0) might be “controller”

Local memory is for program and data

What is “node degree”?

Max. no. of nodes any node can directly comm. with at one time; could be just 1

Degree = a measure of network interconnectedness and effective bandwidth

No global memory!

P has two kinds of memory refs.

Own local memory

Non-local ref. over the network

But actual dual core and SMP processors do have global mem!

Will need to represent it within the model (stay tuned)

Recognizing memory ref. costs

Memory ref. time = “latency”

Local refs. considered “unit time” (1), like RAM model

Non-local refs symbolized with lambda (λ), can be measured and averaged (Tab 2.1)

λ increases with no. of P (but less than linearly)

Table 2.1 Estimates for λ for common architectures;

speeds generally do not include congestion or other

traffic delays.

One benefit of CTA model

Yields “Locality Rule” for fast programs

KEY POINT: maximize local memory refs., minimize non-local

Ergo, more efficient even to do redundant local calculations than to make many non-local refs.

Nice practical application for stochastic calculation (p53)

How MIMD machines do non-local memory references

3 common mechanisms

Shared global memory (typical SMP)

One-sided communication

Message passing (typical HPC cluster)

CTA model needs to account for these

(1) Shared memory

CTA has no “shared memory” per se

Real meaning: Pi doesn’t have ref. in local cache already, must get from elsewhere

Incurs λ latency, in practice includes overhead of cache coherency mechanism, memory bus

Programmer’s standpoint

Convenient: read/write global variables

Tricky: still need sync. conventions (mutex)

(2) One-sided communication

“Poor man’s” shared memory

Cray calls “shmem”

Pi can get or put any location = explicitly transferring data between shared address space and local memory @ λ cost

No HW keeping each P’s view of location coherent cheaper to build, smaller λ

“Private” addresses reserved for each P

As with shared mem, need sync conventions

Easier to debug? Fall 2016 CIS*3090 Parallel Programming 17

(3) Message passing

Primitive operations: send and receive

Considered “2-sided” since needs HW and/or OS cooperation at source & destination P’s

“Easier to debug” due to explicit comm.

Communication and synchronization are combined automatically vs. needing separate APIs

Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

Documents

Transcript of Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS*3090 Fall 2016 Fall 2016 CIS*3090...

CIS Newsletter Fall 2010

Fall 2007 1 CIS 764 Database Systems Engineering L1: Introduction to … CIS 764 Enterprise Database Systems Engineering: Software.

Advanced Databases ( CIS 6930) Fall 2016 …mschneid/Teaching/CIS4930+CIS...Advanced Databases ( CIS 6930) Fall 2016 Instructor: Dr. Markus Schneider BEFORE WE BEGIN… •NOSQL :

Chap. 6 Part 1 - University of Guelphgardnerw/courses/cis3090/lectures/ch6-1.pdf · 2016-10-24 · Chap. 6 Part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1

Fall 2008 1 CIS 764 Database Systems Engineering L1: Introduction to … CIS 764 Enterprise Database Systems Engineering: Software.

CIS 321 – Fall 2004 Data Communications & Networking

CIS 5512 - Operating Systems Synchronization Professor Qiang Zeng Fall 2015.

CIS 465 - Review for Final exam1 CIS 465 – 101/103 Fall 2000 Review for Final Exam.

Image Segmentation CIS 601 Fall 2004 Longin Jan Latecki.

high rise lifts · Walnut / st st. mirror Or Satin HiS5S Wenge HB HISSS Wenge PC HISSS / HIS55 / St.St_ Satin H 3090 Driftwood / H 3090 Driftwood Brown H 3090 Driftwood / 3090 H 3090

Damian Tamayo Tutorial DTM Data Generator Fall 2008 CIS 764.

OptiPlex 3090 Micro

Game Evaluation By: Chad Seippel CIS 487 – Fall 2013.

CIS 101 Orientation Document Fall 2017 - Cerritos Collegeweb.cerritos.edu/.../CIS_101_orientation_document_fall_2017_22332.… · CIS 101 Orientation Document Fall 2017 ... A 180-day

Introduction to Python III CIS 391: Artificial Intelligence Fall, 2008.

Welcome to CIS 235 Computer Networks Fall, 2007 Prof Peterson.

CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu.

CIS Seminar Leadership Gems Fall 2008. CIS Seminar September 11 th.

Wireless Network Security By Patrick Yount and CIS 4360 Fall 2009 CIS 4360 Fall 2009.

WebGL Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012.

Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS3090 Fall 2016 Fall 2016 CIS3090...

Transcript of Chap. 2 part 2gardnerw/courses/cis3090/lectures/ch2-2.pdf · CIS3090 Fall 2016 Fall 2016 CIS3090...

Chap. 6 Part 1 - University of Guelphgardnerw/courses/cis3090/lectures/ch6-1.pdf · 2016-10-24 · Chap. 6 Part 1 CIS3090 Fall 2016 Fall 2016 CIS3090 Parallel Programming 1