Why Parallel/Distributed Computing
description
Transcript of Why Parallel/Distributed Computing
Why Parallel/Distributed Computing
Sushil K. PrasadSushil K. [email protected]@gsu.edu
.
What is Parallel and Distributed computing?
Solving a single problem faster using multiple Solving a single problem faster using multiple CPUsCPUs
E.g. Matrix Multiplication C = A X BE.g. Matrix Multiplication C = A X B
Parallel = Shared Memory among all CPUs Parallel = Shared Memory among all CPUs Distributed = Local Memory/CPUDistributed = Local Memory/CPU Common Issues: Partition, Synchronization, Common Issues: Partition, Synchronization,
Dependencies, load balancingDependencies, load balancing
.
Eniac (350 op/s) 1946 - (U.S. Army photo)
.
ASCI White (10 teraops/sec 2006)
Mega flops = 10^6 flops = 2^20Giga = 10^9 = billion = 2^30Tera = 10^12 = trillion = 2^40Peta = 10^15 = quadrillion = 2^50Exa = 10^18 = quintillion = 2^60
.
65 Years of Speed Increases
ENIAC
350 flops
1946
Today - 2011
8 Peta flops = 10^15 flops
K computer
.
Why Parallel and Distributed Computing? Grand Challenge ProblemsGrand Challenge Problems
Weather Forecasting; Global WarmingWeather Forecasting; Global Warming Materials Design – Superconducting Materials Design – Superconducting
material at room temperature; nano-material at room temperature; nano-devices; spaceships.devices; spaceships.
Organ Modeling; Drug DiscoveryOrgan Modeling; Drug Discovery
.
Why Parallel and Distributed Computing? Physical Limitations of Circuits Physical Limitations of Circuits
Heat and light effectHeat and light effect Superconducting material to counter heat effectSuperconducting material to counter heat effect Speed of light effect – no solution!Speed of light effect – no solution!
.
Microprocessor RevolutionMicros
Minis
Mainframes
Speed (log scale)
Time
Supercomputers
.
VLSI – Effect of Integration VLSI – Effect of Integration 1 M transistor enough for full 1 M transistor enough for full
functionality - Dec’s Alpha (90’s)functionality - Dec’s Alpha (90’s) Rest must go into multiple CPUs/chipRest must go into multiple CPUs/chip
Cost – Multitudes of average CPUs give Cost – Multitudes of average CPUs give better FLPOS/$ compared to traditional better FLPOS/$ compared to traditional supercomputerssupercomputers
Why Parallel and Distributed Computing?
.
Modern Parallel Computers Caltech’s Cosmic Cube (Seitz and Fox)Caltech’s Cosmic Cube (Seitz and Fox) Commercial copy-catsCommercial copy-cats
nCUBE Corporation (512 CPUs)nCUBE Corporation (512 CPUs) Intel’s Supercomputer SystemsIntel’s Supercomputer Systems
iPSC1, iPSC2, Intel Paragon (512 CPUs)iPSC1, iPSC2, Intel Paragon (512 CPUs) Thinking Machines CorporationThinking Machines Corporation
CM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMDCM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMD CM5 – fat-tree interconnect - MIMD CM5 – fat-tree interconnect - MIMD
Tiahe-1a 4.7 petaflops, 14K Tiahe-1a 4.7 petaflops, 14K Xeon X5670 and 7,168 X5670 and 7,168 Nvidia Tesla M2050 M2050 K-computer 8 petaflops (10^15 FLOPS), 2011, 68 K 2.0GHz 8-core CPUs 68 K 2.0GHz 8-core CPUs
548,352 cores; 548,352 cores;
.
Everyday ReasonsEveryday Reasons Available local networked workstations and Grid resources should be Available local networked workstations and Grid resources should be
utilizedutilized Solve compute-intensive problems fasterSolve compute-intensive problems faster
Make infeasible problems feasibleMake infeasible problems feasibleReduce design timeReduce design timeLeverage of large combined memory Leverage of large combined memory
Solve larger problems in same amount of timeSolve larger problems in same amount of timeImprove answer’s precisionImprove answer’s precisionReduce design timeReduce design time
Gain competitive advantage Gain competitive advantage Exploit commodity multi-core and GPU chipsExploit commodity multi-core and GPU chips Find Jobs!Find Jobs!
Why Parallel and Distributed Computing?
.
Why Shared Memory programming? Easier conceptual environmentEasier conceptual environment Programmers typically familiar with concurrent Programmers typically familiar with concurrent threadsthreads and and
processesprocesses sharing address space sharing address space CPUs within multi-core chips share memoryCPUs within multi-core chips share memory OpenMP an application programming interface (API) for OpenMP an application programming interface (API) for
shared-memory systemsshared-memory systems Supports higher performance parallel programming of Supports higher performance parallel programming of
symmetrical multiprocessorssymmetrical multiprocessors Java threadsJava threads MPI for Distributed Memory ProgrammingMPI for Distributed Memory Programming
.
Seeking Concurrency
Data dependence graphsData dependence graphs Data parallelismData parallelism Functional parallelismFunctional parallelism PipeliningPipelining
.
Data Dependence Graph
Directed graphDirected graph Vertices = tasksVertices = tasks Edges = dependenciesEdges = dependencies
.
Data Parallelism
Independent tasks apply same operation to Independent tasks apply same operation to different elements of a data setdifferent elements of a data set
Okay to perform operations concurrentlyOkay to perform operations concurrently Speedup: potentially p-fold, p #processorsSpeedup: potentially p-fold, p #processors
for i 0 to 99 do a[i] b[i] + c[i]endfor
.
Functional Parallelism Independent tasks apply different operations to Independent tasks apply different operations to
different data elementsdifferent data elements
First and second statementsFirst and second statements Third and fourth statementsThird and fourth statements Speedup: Limited by amount of concurrent sub-tasksSpeedup: Limited by amount of concurrent sub-tasks
a 2b 3m (a + b) / 2s (a2 + b2) / 2v s - m2
.
Pipelining Divide a process into stagesDivide a process into stages Produce several items simultaneouslyProduce several items simultaneously Speedup: Limited by amount of concurrent sub-Speedup: Limited by amount of concurrent sub-
tasks = #of stages in the pipelinetasks = #of stages in the pipeline