Lec01 - University of California, San...

31
Scott B. Baden CSE 160 Introduction to parallel computing

Transcript of Lec01 - University of California, San...

Page 1: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Scott B. Baden

CSE 160

Introduction to parallel computing

Page 2: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Welcome to Parallel Computation! •  Your instructor is Scott B. Baden •  Your TA is Tan Nguyen •  Lab hours: •  Office hours:

Mon 3-4p, Thu 4-5p, or by appointment •  Section: Friday 2:00 to 2:50 pm, WLH 2204

No Section this week

Scott B. Baden / CSE 160 / Winter 2011 2

Page 3: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Content •  Our home page is

http://www-cse.ucsd.edu/classes/wi11/cse160 •  All class announcements will be made on-

line so check this web page frequently •  Moodle •  One required text:

An Introduction to Parallel Programming, by Peter Pacheco, Morgan Kaufmann, 2011

•  Useful information on-line http://www-cse.ucsd.edu/users/baden/Doc

Scott B. Baden / CSE 160 / Winter 2011 3

Page 4: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Background •  Pre-requisite: CSE 100 / Math 176 •  C/++ programming experience •  Do you know about at least one of the

following?  Threads or other form of parallel computation  Cache memory hierarchies

•  Numerical analysis background not required but useful

Scott B. Baden / CSE 160 / Winter 2011 4

Page 5: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Course Requirements •  [4] Programming assignments (50%)

 Includes a lab writeup, must be typed  Assignments shall be done in teams of 2

•  Exams (35%)  Midterm (15%)  Final (20%)

•  [4 or 5] in-class pop quizzes (15%)

Scott B. Baden / CSE 160 / Winter 2011 5

Page 6: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Policies •  By taking this course, you implicitly agree

to abide by the following the course polices: http://www-cse.ucsd.edu/classes/wi11/cse160/Policies.html

•  Academic Honesty  Do you own work  Plagiarism and cheating will not be tolerated

Scott B. Baden / CSE 160 / Winter 2011 6

Page 7: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Hardware and software platforms •  Hardware

 Multi-core server: ieng6-203

•  Software  Pthreads, OpenMP  MPI

Scott B. Baden / CSE 160 / Winter 2011 7

Page 8: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Course overview •  Theory and practice of parallel computation •  Emphasis on multi-core implementations,

threads programming •  Case studies to develop a toolbox of

problem solving and software techniques •  Learn how to recognize an appropriate way

to implement an application

Scott B. Baden / CSE 160 / Winter 2011 8

Page 9: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Syllabus •  Fundamentals

Motivation, system organization, hardware execution models, limits to performance, program execution models, theoretical models

•  Software and programming  Programming models and techniques: multithreading

and message passing  Architectural considerations: multicore primarily  pthreads, OopenMP MPI

•  Parallel algorithm design and implementation  Case studies to develop a repertoire of problem solving

techniques  Data structures and their efficient implementation:

load balancing and performance  Performance tradeoffs, evaluation, and tuning

Scott B. Baden / CSE 160 / Winter 2011 9

Page 10: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

What is parallel processing ? •  Decompose a workload onto simultaneously

executing physical resources •  Multiple processors co-operate to process a related

set of tasks – tightly coupled •  Improve some aspect of performance

 Speedup: 100 processors run ×100 faster than one  Capability: Tackle a larger problem, more accurately  Algorithmic, e.g. search  Locality: more cache memory and bandwidth

•  Virtual or physical •  Reliability more of an issue at the high end or in

critical applications

Scott B. Baden / CSE 160 / Winter 2011 10

Page 11: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Parallel Processing, Concurrency & Distributed Computing

•  Parallel processing   Performance (and capacity) is the main goal   More tightly coupled than distributed computation

•  Concurrency   Concurrency control: serialize certain computations to ensure

correctness, e.g. database transactions   Performance need not be the main goal

•  Distributed computation   Geographically distributed   Multiple resources computing & communicating unreliably   “Cloud” or “Grid” computing, large amounts of storage   Looser, coarser grained communication and synchronization

•  May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism”

Scott B. Baden / CSE 160 / Winter 2011 11

Page 12: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Why is parallel computation inevitable? •  Physical limits on processor clock speed

and heat dissipation •  A parallel computer increases memory

capacity and bandwidth as well as the computational rate

Scott B. Baden / CSE 160 / Winter 2011 12

Average CPU clock speeds http://www.pcpitstop.com/research/cpu.asp Nvidia

Page 13: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

How does parallel computing relate to other branches of computer science?

•  Parallel processing generalizes problems we encounter on single processor computers

•  A parallel computer is just an extension of the traditional memory hierarchy

•  The need to preserve locality, which prevails in virtual memory, cache memory, and registers, also applies to a parallel computer

Scott B. Baden / CSE 160 / Winter 2011 13

Page 14: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

A Motivating Application - TeraShake Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center

Scott B. Baden / CSE 160 / Winter 2011 14

epicenter.usc.edu/cmeportal/TeraShake.html

Page 15: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

How TeraShake Works •  Divide up Southern

California into blocks •  For each block, get all

the data about geological structures, fault information, …

•  Map the blocks onto processors of the supercomputer

•  Run the simulation using current information on fault activity and on the physics of earthquakes

Scott B. Baden / CSE 160 / Winter 2011 15

SDSC Machine Room

DataCentral@SDSC

Page 16: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Animation

Scott B. Baden / CSE 160 / Winter 2011 16

Page 17: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The advance of technology

Scott B. Baden / CSE 160 / Winter 2011 17

Page 18: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Today’s laptop would have been yesterday’s supercomputer

Scott B. Baden / CSE 160 / Winter 2011 18

•  Cray-1 Supercomputer

•  80 MHz processor •  8 Megabytes

memory

•  Water cooled •  1.8m H x 2.2m W •  4 tons •  Over $10M in

1976

•  MacBook •  2.4GHz Intel Core 2 Duo •  4 Gigabytes memory,

3 Megabytes shared cache •  NVIDIA GeForce 320m

256MB shared DDR3 SDRAM •  Wireless Networking •  Air cooled •  ~ 2.7 x 33 x 23 cm. 2.1 kg •  $1149 in Sept. 2010

Page 19: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Technological disruption •  New capabilities → increased knowledge through

improvements in computer modelling •  Changes in the common wisdom for solving a

problem including the implementation

Scott B. Baden / CSE 160 / Winter 2011 19

Intel 48 core processor, 2009

Cray-1, 1976,���240 Megaflops

Nvidia Tesla, 4.14 Tflops, 2009

Sony Playstation 3, 150 Glfops, 2006

Beowulf cluster, ���late 1990s

Tilera 100 core processor, 2009 ASCI Red,

1997, 1Tflop

Connection Machine CM-2, 1987

Page 20: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The age of the multi-core processor

Scott B. Baden / CSE 160 / Winter 2011 20

•  On chip parallel computer •  IBM Power4 (2001), many others

follow (Intel, AMD, Tilera, Cell Broadband Engine)

•  First dual core laptops (2005-6) •  GPUs (nVidia, ATI): supercomputer

on a desktop

Page 21: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The Impact •  You are taking this class •  A renaissance in parallel computation •  Parallelism is no longer restricted to

machine rooms, it is available to everyone

•  We all have a parallel computer at our fingertips

•  If we don’t use the parallelism, we lose it

Scott B. Baden / CSE 160 / Winter 2011 21

Page 22: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The payoff •  Capability

 We solved a problem that we couldn’t solve before, or under conditions that were not possible previously

•  Performance  Solve the same problem in less time than before  This can provide a capability if we are solving

many problem instances •  The result achieved must justify the effort

 Enable new scientific discovery  Software costs must be reasonable

Scott B. Baden / CSE 160 / Winter 2011 22

Page 23: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

How hard is it? •  Two types of users

 Enjoy the capabilities that parallelism provides w/o being aware of the details, e.g. photoshop

 Get into the driver’s seat: write parallel programs, enjoy the benefits of customization, personal preferences

•  A well behaved single processor algorithm may behave poorly on a parallel computer, and may need to be reformulated

•  There is no magic compiler that can turn a serial program into an efficient parallel program all the time and on all machines

Scott B. Baden / CSE 160 / Winter 2011 23

Page 24: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

What is involved? •  Performance programming

 Low-level details: heavily application dependent

 Irregularity in the computation and its data structures forces us to think even harder

 Users don’t start from scratch-they reuse old code

 Beware of dirty rotten code in need of redesign! •  Parallelism introduces many new tradeoffs

 Redesign the software  Rethink the problem solving technique

Scott B. Baden / CSE 160 / Winter 2011 24

Page 25: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

1/4/11 Scott B. Baden / CSE 160 / Winter 2011 25

Memory hierarchies and address space organization

Page 26: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The processor-memory gap •  The result of technological trends •  Difference in processing and memory speeds

growing exponentially over time

Scott B. Baden / CSE 160 / Winter 2011 26

1980 1985 1990 1995 2000 2005100

101

102

103

104

105

Year

Performance

Memory (DRAM)

Processor

Page 27: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

An important principle: locality •  Programs generally exhibit two forms of locality

in accessing memory  Temporal locality (time)  Spatial locality (space)

•  Often involves loops •  Opportunities for reuse

for t=0 to T-1 for i = 1 to N-2

u[i]= (u[i-1] + u[i+1]) / 2

Scott B. Baden / CSE 160 / Winter 2011 27

Page 28: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Memory hierarchies •  Exploit reuse through a hierarchy of smaller

but faster memories •  Put things in faster memory

if we reuse them frequently

Scott B. Baden / CSE 160 / Winter 2011 28

O(100) CP

O(10) CP (10 - 100 B)

Disk

DRAM

L2

CPU

L1 2-3 CP (10 to 100 B)

Many GB or TB

256KB to 4 MB

O(106) CP GB

32 to 64 KB

1CP (1 word)

Page 29: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

The Benefits of Cache Memory •  Let say that we have a small fast memory

that is 10 times faster (access time) than main memory …

•  If we find what we are looking for 90% of the time (a hit), the access time approaches that of fast memory

•  Taccess = 0.90 × 1 + (1-0.9) × 10 = 1.9 •  Memory appears to be 5 times faster •  We organize the references by blocks •  We can have multiple levels of cache

Scott B. Baden / CSE 160 / Winter 2011 29

Page 30: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Sidebar •  If cache memory access time is 10 times

faster than main memory … •  Cache “hit time” Tcache = Tmain / 10 •  And if we find what we are looking for f ×

100% of the time (“cache hit rate”) … •  Access time = f × Tcache + (1- f ) × Tmain

= f × Tmain /10 + (1- f ) × Tmain

= (1-(9f/10)) × Tmain •  We are now 1/(1-(9f/10)) times faster •  To simplify, we use Tcache = 1, Tmain = 10

Scott B. Baden / CSE 160 / Winter 2011 30

Page 31: Lec01 - University of California, San Diegocseweb.ucsd.edu/classes/wi11/cse160/Lectures/Lec01.pdfAlgorithmic, e.g. search Locality: more cache memory and bandwidth • Virtual or physical

Nehalem’s Memory Hierarchy •  Source: Intel 64 and IA-32 Architectures

Optimization Reference Manual, Table 2.7

Scott B. Baden / CSE 160 / Winter 2011 31

realworldtech.com

Latency (cycles)

4

10

35+

Associativity

8

8

16

Line size (bytes)

64

Write update policy

Writeback

Inclusive Non- inclusive

Non- inclusive

4MB for Gainestown