CS 320 Spring 2003 Introduction Laxmikant Kale .

CS 320Spring 2003Introduction

Laxmikant Kale

http://charm.cs.uiuc.edu

2

Course objectives and outline• You will learn about:

– Parallel architectures overview• Message passing support, routing, interconnection networks

• Cache-coherent scalable shared memory, synchronization

• Later

– Relaxed consistency models (?)

– Novel architectures: Tera, Blue Gene, Processors-in-memory

– Parallel programming models• Emphasis on 3: message passing, shared memory, and shared objects

• Ongoing evaluation and comparison of models

– Commonly needed parallel algorithms/operations• Analysis techniques

– Parallel application categories

– Performance analysis and optimization of parallel applications

– Parallel application case studies

3

Project and homeworks• Significant (effort and grade percentage) course project

– Groups of 5 students

– Expect publication quality results

• Homeworks/machine problems:– weekly (sometimes biweekly)

• Parallel machines:– NCSA Origin 2000, Turing Cluster, SUN cluster, SMP machine

– Possible: Large machines for evaluating scalability:• 1000 processor NCSA cluster

• 3000 processor Lemieux machine at PSC

4

Resources• Much of the course will be run via the web

– Lecture slides, assignments, will be available on the course web page

• http://www-courses.cs.uiuc.edu/~cs320

– Most of the reading material (papers, manuals) will be on the web

– Projects will coordinate and submit information on the web• Web pages for individual pages will be linked to the course web page

– Newsgroup: uiuc.class.ece392

• You are expected to read the newsgroup and web pages regularly

5

Advent of parallel computing

• “Parallel computing is necessary to increase speeds”– Cry of the ‘70s

– Processors kept pace with Moore’s law:• Doubling speeds every 18 months

• Now, finally, the time is ripe– Uniprocessors are commodities (and proc. speeds shows signs

of slowing down)

– Highly economical to build parallel machines

6

Why parallel computing• It is the only way to increase speed beyond uniprocessors

– Except, of course, waiting for uniprocessors to become faster!

– Several applications require orders of magnitude higher performance than feasible on uniprocessors

• Cost effectiveness:– older argument

– in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster.

– So: combine microcomputers to get speed at lower costs

– Incremental scalability:• can get in-between performance points with 20, 50, 100,…

processors

– But:• You may get speedup lower than 400 on 2000 processors!

• Microcomputers became faster, killing supercomputers, effectively

7

Technology Trends

The natural building block for multiprocessors is now also about the fastest!

Performance

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

8

Economics

• Commodity microprocessors not only fast but CHEAP

• Development cost is tens of millions of dollars (5-100 typical)

• BUT, many more are sold compared to supercomputers

– Crucial to take advantage of the investment, and use the commodity building block

– Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

• Standardization by Intel makes small, bus-based SMPs commodity

• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip

9

What to Expect?• Parallel Machine classes:

– Cost and usage defines a class! Architecture of a class may change.

– Desktops, Engineering workstations, database/web servers, suprtcomputers,

• Commodity (home/office) desktop:– less than $10,000

– possible to provide 10-50 processors for that price!

– Driver applications: • games, video /signal processing,

• possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents

• New applications?

10

Engineeering workstations• Price: less than $100,000 (used to be):

– new proce level acceptable may be $50,000

– 100+ processors, large memory,

– Driver applications:• CAD (Computer aided design) of various sorts

• VLSI

• Structural and mechanical simulations…

• Etc. (many specialized applications)

11

Commercial Servers• Price range: variable ($10,000 - several hundreds of thousands)

– defining characteristic: usage

– Database servers, decision support (MIS), web servers, e-commerce

• High availability, fault tolerance are main criteria

• Trends to watch out for:– Likely emergence of specialized architectures/systems

• E.g. Oracle’s “No Native OS” approach

• Currently dominated by database servers, and TPC benchmarks– TPC: transactions per second

– But this may change to data mining and application servers, with corresponding impact on architecure.

12

Supercomputers• “Definition”: expensive system?!

– Used to be defined by architecture (vector processors, ..)

– More than a million US dollars?

– Thousands of processors

• Driving applications– Grand challenges in science and engineering:

– Global weather modeling and forecast

– Rational Drug design / molecular simulations

– Processing of genetic (genome) information

– Rocket simulation

– Airplane design (wings and fluid flow..)

– Operations research?? Not recognized yet

– Other non-traditional applications?

13

Consider Scientific Supercomputing• Proving ground and driver for innovative architecture and

techniques – Market smaller relative to commercial as MPs become mainstream

– Dominated by vector machines starting in 70s

– Microprocessors have made huge gains in floating-point performance

• high clock rates

• pipelined floating point units (e.g., multiply-add every cycle)

• instruction-level parallelism

• effective use of caches (e.g., automatic blocking)

– Plus economics

• Large-scale multiprocessors replace vector supercomputers– Well under way already

– Except with the Earth Simulator: thousands of vector processors

14

Scientific Computing Demand

15

Engineering Computing Demand

• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis, combustion

efficiency), – Aeronautics (airflow analysis, engine efficiency, structural

mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization

• in all of the above

• entertainment (films like Toy Story)

• architecture (walk-throughs and rendering)

– Financial modeling (yield and derivative analysis)– etc.

16

Applications: Speech and Image Processing

1980 1985 1990 1995

1 MIPS

10 MIPS

100 MIPS

1 GIPS

Sub-BandSpeech Coding

200 WordsIsolated SpeechRecognition

SpeakerVeri¼cation

CELPSpeech Coding

ISDN-CD StereoReceiver

5,000 WordsContinuousSpeechRecognition

HDTV Receiver

CIF Video

1,000 WordsContinuousSpeechRecognitionTelephone

NumberRecognition

10 GIPS

• Also CAD, Databases, . . .

• 100 processors gets you 10 years, 1000 gets you 20 !

17

Learning Curve for Parallel Applications

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,

891 on 128-processor Cray T3D

18

Scalability Challenges• Scalability Challenges

– Machines are getting bigger and faster• But

– Communication Speeds?– Memory speeds?

"Now, here, you see, it takes all the running you can do to keep in the same place"

---Red Queen to Alice in “Through The Looking Glass”

–Further:–Applications are getting more ambitious and complex

•Irregular structures and Dynamic behavior

–Programming models?

19

Current Scenario: Machines• Extremely High Performance machines abound

• Clusters in every lab – GigaFLOPS per processor!

– 100 GFLOPS/S performance possible

• High End machines at centers and labs:– Many thousand processors, multi-TF performance

– Earth Simulator, ASCI White, PSC Lemieux,..

• Future Machines– Blue Gene/L : 128k processors!

– Blue Gene Cyclops Design: 1M processors• Multiple Processors per chip

• Low Memory to Processor Ratio

20

Communication Architecture

• On clusters:

– 100 MB ethernet• 100 μs latency

– Myrinet switches • User level memory-mapped communication

• 5-15 μs latency, 200 MB/S Bandwidth..

• Relatively expensive, when compared with cheap PCs

– VIA, Infiniband

• On high end machines:

– 5-10 μs latency, 300-500 MB/S BW

– Custom switches (IBM, SGI, ..)

– Quadrix

• Overall:

– Communication speeds have increased but not as much as processor speeds

21

Memory and Caches• Bottom line again:

– Memories are faster, but not keeping pace with processors

– Deep memory hierarchies:• On Chip and off chip.

– Must be handled almost explicitly in programs to get good performance

• A factor of 10 (or even 50) slowdown is possible with bad cache behavior

• Increase reuse of data: If the data is in cache, use it for as many different things you need to do..

• Blocking helps

22

Application Complexity is increasing• Why?

– With more FLOPS, need better algorithms..• Not enough to just do more of the same..

– Better algorithms lead to complex structure

– Example: Gravitational force calculation• Direct all-pairs: O(N2), but easy to parallelize

• Barnes-Hut: N log(N) but more complex

– Multiple modules, dual time-stepping

– Adaptive and dynamic refinements

• Ambitious projects – Projects with new objectives lead to dynamic behavior and

multiple components

23

Disparity between peak and attained speed• As a combination of all of these factors:

– The attained performance of most real applications is substantially lower than the peak performance of machines

– Caution: Expecting to attain peak performance is a pitfall..• We don’t use such a metric for our internal combustion engines, for

example

• But it gives us a metric to gauge how much improvement is possible

CS 320 Spring 2003 Introduction Laxmikant Kale .

Documents

Transcript of CS 320 Spring 2003 Introduction Laxmikant Kale .