CS 320 Spring 2003 Introduction Laxmikant Kale .
-
Upload
derek-wade -
Category
Documents
-
view
216 -
download
0
Transcript of CS 320 Spring 2003 Introduction Laxmikant Kale .
CS 320Spring 2003Introduction
Laxmikant Kale
http://charm.cs.uiuc.edu
2
Course objectives and outline• You will learn about:
– Parallel architectures overview• Message passing support, routing, interconnection networks
• Cache-coherent scalable shared memory, synchronization
• Later
– Relaxed consistency models (?)
– Novel architectures: Tera, Blue Gene, Processors-in-memory
– Parallel programming models• Emphasis on 3: message passing, shared memory, and shared objects
• Ongoing evaluation and comparison of models
– Commonly needed parallel algorithms/operations• Analysis techniques
– Parallel application categories
– Performance analysis and optimization of parallel applications
– Parallel application case studies
3
Project and homeworks• Significant (effort and grade percentage) course project
– Groups of 5 students
– Expect publication quality results
• Homeworks/machine problems:– weekly (sometimes biweekly)
• Parallel machines:– NCSA Origin 2000, Turing Cluster, SUN cluster, SMP machine
– Possible: Large machines for evaluating scalability:• 1000 processor NCSA cluster
• 3000 processor Lemieux machine at PSC
4
Resources• Much of the course will be run via the web
– Lecture slides, assignments, will be available on the course web page
• http://www-courses.cs.uiuc.edu/~cs320
– Most of the reading material (papers, manuals) will be on the web
– Projects will coordinate and submit information on the web• Web pages for individual pages will be linked to the course web page
– Newsgroup: uiuc.class.ece392
• You are expected to read the newsgroup and web pages regularly
5
Advent of parallel computing
• “Parallel computing is necessary to increase speeds”– Cry of the ‘70s
– Processors kept pace with Moore’s law:• Doubling speeds every 18 months
• Now, finally, the time is ripe– Uniprocessors are commodities (and proc. speeds shows signs
of slowing down)
– Highly economical to build parallel machines
6
Why parallel computing• It is the only way to increase speed beyond uniprocessors
– Except, of course, waiting for uniprocessors to become faster!
– Several applications require orders of magnitude higher performance than feasible on uniprocessors
• Cost effectiveness:– older argument
– in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster.
– So: combine microcomputers to get speed at lower costs
– Incremental scalability:• can get in-between performance points with 20, 50, 100,…
processors
– But:• You may get speedup lower than 400 on 2000 processors!
• Microcomputers became faster, killing supercomputers, effectively
7
Technology Trends
The natural building block for multiprocessors is now also about the fastest!
Performance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
8
Economics
• Commodity microprocessors not only fast but CHEAP
• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip
9
What to Expect?• Parallel Machine classes:
– Cost and usage defines a class! Architecture of a class may change.
– Desktops, Engineering workstations, database/web servers, suprtcomputers,
• Commodity (home/office) desktop:– less than $10,000
– possible to provide 10-50 processors for that price!
– Driver applications: • games, video /signal processing,
• possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents
• New applications?
10
Engineeering workstations• Price: less than $100,000 (used to be):
– new proce level acceptable may be $50,000
– 100+ processors, large memory,
– Driver applications:• CAD (Computer aided design) of various sorts
• VLSI
• Structural and mechanical simulations…
• Etc. (many specialized applications)
11
Commercial Servers• Price range: variable ($10,000 - several hundreds of thousands)
– defining characteristic: usage
– Database servers, decision support (MIS), web servers, e-commerce
• High availability, fault tolerance are main criteria
• Trends to watch out for:– Likely emergence of specialized architectures/systems
• E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks– TPC: transactions per second
– But this may change to data mining and application servers, with corresponding impact on architecure.
12
Supercomputers• “Definition”: expensive system?!
– Used to be defined by architecture (vector processors, ..)
– More than a million US dollars?
– Thousands of processors
• Driving applications– Grand challenges in science and engineering:
– Global weather modeling and forecast
– Rational Drug design / molecular simulations
– Processing of genetic (genome) information
– Rocket simulation
– Airplane design (wings and fluid flow..)
– Operations research?? Not recognized yet
– Other non-traditional applications?
13
Consider Scientific Supercomputing• Proving ground and driver for innovative architecture and
techniques – Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point performance
• high clock rates
• pipelined floating point units (e.g., multiply-add every cycle)
• instruction-level parallelism
• effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers– Well under way already
– Except with the Earth Simulator: thousands of vector processors
14
Scientific Computing Demand
15
Engineering Computing Demand
• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)– Automotive (crash simulation, drag analysis, combustion
efficiency), – Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism), – Computer-aided design– Pharmaceuticals (molecular modeling)– Visualization
• in all of the above
• entertainment (films like Toy Story)
• architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)– etc.
16
Applications: Speech and Image Processing
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVeri¼cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
17
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
18
Scalability Challenges• Scalability Challenges
– Machines are getting bigger and faster• But
– Communication Speeds?– Memory speeds?
"Now, here, you see, it takes all the running you can do to keep in the same place"
---Red Queen to Alice in “Through The Looking Glass”
–Further:–Applications are getting more ambitious and complex
•Irregular structures and Dynamic behavior
–Programming models?
19
Current Scenario: Machines• Extremely High Performance machines abound
• Clusters in every lab – GigaFLOPS per processor!
– 100 GFLOPS/S performance possible
• High End machines at centers and labs:– Many thousand processors, multi-TF performance
– Earth Simulator, ASCI White, PSC Lemieux,..
• Future Machines– Blue Gene/L : 128k processors!
– Blue Gene Cyclops Design: 1M processors• Multiple Processors per chip
• Low Memory to Processor Ratio
20
Communication Architecture
• On clusters:
– 100 MB ethernet• 100 μs latency
– Myrinet switches • User level memory-mapped communication
• 5-15 μs latency, 200 MB/S Bandwidth..
• Relatively expensive, when compared with cheap PCs
– VIA, Infiniband
• On high end machines:
– 5-10 μs latency, 300-500 MB/S BW
– Custom switches (IBM, SGI, ..)
– Quadrix
• Overall:
– Communication speeds have increased but not as much as processor speeds
21
Memory and Caches• Bottom line again:
– Memories are faster, but not keeping pace with processors
– Deep memory hierarchies:• On Chip and off chip.
– Must be handled almost explicitly in programs to get good performance
• A factor of 10 (or even 50) slowdown is possible with bad cache behavior
• Increase reuse of data: If the data is in cache, use it for as many different things you need to do..
• Blocking helps
22
Application Complexity is increasing• Why?
– With more FLOPS, need better algorithms..• Not enough to just do more of the same..
– Better algorithms lead to complex structure
– Example: Gravitational force calculation• Direct all-pairs: O(N2), but easy to parallelize
• Barnes-Hut: N log(N) but more complex
– Multiple modules, dual time-stepping
– Adaptive and dynamic refinements
• Ambitious projects – Projects with new objectives lead to dynamic behavior and
multiple components
23
Disparity between peak and attained speed• As a combination of all of these factors:
– The attained performance of most real applications is substantially lower than the peak performance of machines
– Caution: Expecting to attain peak performance is a pitfall..• We don’t use such a metric for our internal combustion engines, for
example
• But it gives us a metric to gauge how much improvement is possible