Lec01 - University of California, San...
Transcript of Lec01 - University of California, San...
Scott B. Baden
CSE 160
Introduction to parallel computing
Welcome to Parallel Computation! • Your instructor is Scott B. Baden • Your TA is Tan Nguyen • Lab hours: • Office hours:
Mon 3-4p, Thu 4-5p, or by appointment • Section: Friday 2:00 to 2:50 pm, WLH 2204
No Section this week
Scott B. Baden / CSE 160 / Winter 2011 2
Content • Our home page is
http://www-cse.ucsd.edu/classes/wi11/cse160 • All class announcements will be made on-
line so check this web page frequently • Moodle • One required text:
An Introduction to Parallel Programming, by Peter Pacheco, Morgan Kaufmann, 2011
• Useful information on-line http://www-cse.ucsd.edu/users/baden/Doc
Scott B. Baden / CSE 160 / Winter 2011 3
Background • Pre-requisite: CSE 100 / Math 176 • C/++ programming experience • Do you know about at least one of the
following? Threads or other form of parallel computation Cache memory hierarchies
• Numerical analysis background not required but useful
Scott B. Baden / CSE 160 / Winter 2011 4
Course Requirements • [4] Programming assignments (50%)
Includes a lab writeup, must be typed Assignments shall be done in teams of 2
• Exams (35%) Midterm (15%) Final (20%)
• [4 or 5] in-class pop quizzes (15%)
Scott B. Baden / CSE 160 / Winter 2011 5
Policies • By taking this course, you implicitly agree
to abide by the following the course polices: http://www-cse.ucsd.edu/classes/wi11/cse160/Policies.html
• Academic Honesty Do you own work Plagiarism and cheating will not be tolerated
Scott B. Baden / CSE 160 / Winter 2011 6
Hardware and software platforms • Hardware
Multi-core server: ieng6-203
• Software Pthreads, OpenMP MPI
Scott B. Baden / CSE 160 / Winter 2011 7
Course overview • Theory and practice of parallel computation • Emphasis on multi-core implementations,
threads programming • Case studies to develop a toolbox of
problem solving and software techniques • Learn how to recognize an appropriate way
to implement an application
Scott B. Baden / CSE 160 / Winter 2011 8
Syllabus • Fundamentals
Motivation, system organization, hardware execution models, limits to performance, program execution models, theoretical models
• Software and programming Programming models and techniques: multithreading
and message passing Architectural considerations: multicore primarily pthreads, OopenMP MPI
• Parallel algorithm design and implementation Case studies to develop a repertoire of problem solving
techniques Data structures and their efficient implementation:
load balancing and performance Performance tradeoffs, evaluation, and tuning
Scott B. Baden / CSE 160 / Winter 2011 9
What is parallel processing ? • Decompose a workload onto simultaneously
executing physical resources • Multiple processors co-operate to process a related
set of tasks – tightly coupled • Improve some aspect of performance
Speedup: 100 processors run ×100 faster than one Capability: Tackle a larger problem, more accurately Algorithmic, e.g. search Locality: more cache memory and bandwidth
• Virtual or physical • Reliability more of an issue at the high end or in
critical applications
Scott B. Baden / CSE 160 / Winter 2011 10
Parallel Processing, Concurrency & Distributed Computing
• Parallel processing Performance (and capacity) is the main goal More tightly coupled than distributed computation
• Concurrency Concurrency control: serialize certain computations to ensure
correctness, e.g. database transactions Performance need not be the main goal
• Distributed computation Geographically distributed Multiple resources computing & communicating unreliably “Cloud” or “Grid” computing, large amounts of storage Looser, coarser grained communication and synchronization
• May or may not involve separate physical resources, e.g. multitasking “Virtual Parallelism”
Scott B. Baden / CSE 160 / Winter 2011 11
Why is parallel computation inevitable? • Physical limits on processor clock speed
and heat dissipation • A parallel computer increases memory
capacity and bandwidth as well as the computational rate
Scott B. Baden / CSE 160 / Winter 2011 12
Average CPU clock speeds http://www.pcpitstop.com/research/cpu.asp Nvidia
How does parallel computing relate to other branches of computer science?
• Parallel processing generalizes problems we encounter on single processor computers
• A parallel computer is just an extension of the traditional memory hierarchy
• The need to preserve locality, which prevails in virtual memory, cache memory, and registers, also applies to a parallel computer
Scott B. Baden / CSE 160 / Winter 2011 13
A Motivating Application - TeraShake Simulates a 7.7 earthquake along the southern San Andreas fault near LA using seismic, geophysical, and other data from the Southern California Earthquake Center
Scott B. Baden / CSE 160 / Winter 2011 14
epicenter.usc.edu/cmeportal/TeraShake.html
How TeraShake Works • Divide up Southern
California into blocks • For each block, get all
the data about geological structures, fault information, …
• Map the blocks onto processors of the supercomputer
• Run the simulation using current information on fault activity and on the physics of earthquakes
Scott B. Baden / CSE 160 / Winter 2011 15
SDSC Machine Room
DataCentral@SDSC
Animation
Scott B. Baden / CSE 160 / Winter 2011 16
The advance of technology
Scott B. Baden / CSE 160 / Winter 2011 17
Today’s laptop would have been yesterday’s supercomputer
Scott B. Baden / CSE 160 / Winter 2011 18
• Cray-1 Supercomputer
• 80 MHz processor • 8 Megabytes
memory
• Water cooled • 1.8m H x 2.2m W • 4 tons • Over $10M in
1976
• MacBook • 2.4GHz Intel Core 2 Duo • 4 Gigabytes memory,
3 Megabytes shared cache • NVIDIA GeForce 320m
256MB shared DDR3 SDRAM • Wireless Networking • Air cooled • ~ 2.7 x 33 x 23 cm. 2.1 kg • $1149 in Sept. 2010
Technological disruption • New capabilities → increased knowledge through
improvements in computer modelling • Changes in the common wisdom for solving a
problem including the implementation
Scott B. Baden / CSE 160 / Winter 2011 19
Intel 48 core processor, 2009
Cray-1, 1976,���240 Megaflops
Nvidia Tesla, 4.14 Tflops, 2009
Sony Playstation 3, 150 Glfops, 2006
Beowulf cluster, ���late 1990s
Tilera 100 core processor, 2009 ASCI Red,
1997, 1Tflop
Connection Machine CM-2, 1987
The age of the multi-core processor
Scott B. Baden / CSE 160 / Winter 2011 20
• On chip parallel computer • IBM Power4 (2001), many others
follow (Intel, AMD, Tilera, Cell Broadband Engine)
• First dual core laptops (2005-6) • GPUs (nVidia, ATI): supercomputer
on a desktop
The Impact • You are taking this class • A renaissance in parallel computation • Parallelism is no longer restricted to
machine rooms, it is available to everyone
• We all have a parallel computer at our fingertips
• If we don’t use the parallelism, we lose it
Scott B. Baden / CSE 160 / Winter 2011 21
The payoff • Capability
We solved a problem that we couldn’t solve before, or under conditions that were not possible previously
• Performance Solve the same problem in less time than before This can provide a capability if we are solving
many problem instances • The result achieved must justify the effort
Enable new scientific discovery Software costs must be reasonable
Scott B. Baden / CSE 160 / Winter 2011 22
How hard is it? • Two types of users
Enjoy the capabilities that parallelism provides w/o being aware of the details, e.g. photoshop
Get into the driver’s seat: write parallel programs, enjoy the benefits of customization, personal preferences
• A well behaved single processor algorithm may behave poorly on a parallel computer, and may need to be reformulated
• There is no magic compiler that can turn a serial program into an efficient parallel program all the time and on all machines
Scott B. Baden / CSE 160 / Winter 2011 23
What is involved? • Performance programming
Low-level details: heavily application dependent
Irregularity in the computation and its data structures forces us to think even harder
Users don’t start from scratch-they reuse old code
Beware of dirty rotten code in need of redesign! • Parallelism introduces many new tradeoffs
Redesign the software Rethink the problem solving technique
Scott B. Baden / CSE 160 / Winter 2011 24
1/4/11 Scott B. Baden / CSE 160 / Winter 2011 25
Memory hierarchies and address space organization
The processor-memory gap • The result of technological trends • Difference in processing and memory speeds
growing exponentially over time
Scott B. Baden / CSE 160 / Winter 2011 26
1980 1985 1990 1995 2000 2005100
101
102
103
104
105
Year
Performance
Memory (DRAM)
Processor
An important principle: locality • Programs generally exhibit two forms of locality
in accessing memory Temporal locality (time) Spatial locality (space)
• Often involves loops • Opportunities for reuse
for t=0 to T-1 for i = 1 to N-2
u[i]= (u[i-1] + u[i+1]) / 2
Scott B. Baden / CSE 160 / Winter 2011 27
Memory hierarchies • Exploit reuse through a hierarchy of smaller
but faster memories • Put things in faster memory
if we reuse them frequently
Scott B. Baden / CSE 160 / Winter 2011 28
O(100) CP
O(10) CP (10 - 100 B)
Disk
DRAM
L2
CPU
L1 2-3 CP (10 to 100 B)
Many GB or TB
256KB to 4 MB
O(106) CP GB
32 to 64 KB
1CP (1 word)
The Benefits of Cache Memory • Let say that we have a small fast memory
that is 10 times faster (access time) than main memory …
• If we find what we are looking for 90% of the time (a hit), the access time approaches that of fast memory
• Taccess = 0.90 × 1 + (1-0.9) × 10 = 1.9 • Memory appears to be 5 times faster • We organize the references by blocks • We can have multiple levels of cache
Scott B. Baden / CSE 160 / Winter 2011 29
Sidebar • If cache memory access time is 10 times
faster than main memory … • Cache “hit time” Tcache = Tmain / 10 • And if we find what we are looking for f ×
100% of the time (“cache hit rate”) … • Access time = f × Tcache + (1- f ) × Tmain
= f × Tmain /10 + (1- f ) × Tmain
= (1-(9f/10)) × Tmain • We are now 1/(1-(9f/10)) times faster • To simplify, we use Tcache = 1, Tmain = 10
Scott B. Baden / CSE 160 / Winter 2011 30
Nehalem’s Memory Hierarchy • Source: Intel 64 and IA-32 Architectures
Optimization Reference Manual, Table 2.7
Scott B. Baden / CSE 160 / Winter 2011 31
realworldtech.com
Latency (cycles)
4
10
35+
Associativity
8
8
16
Line size (bytes)
64
Write update policy
Writeback
Inclusive Non- inclusive
Non- inclusive
4MB for Gainestown