CS 420 - Design of Algorithms
description
Transcript of CS 420 - Design of Algorithms
![Page 1: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/1.jpg)
CS 420 - Design of Algorithms
Parallel Computer Architecture and Software Models
![Page 2: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/2.jpg)
Parallel Computing –It’s about performance
Greater performance is the reason for parallel computingMany types of scientific and engineering programs are too large and too complex for traditional uniprocessorsSuch large problems are common is – Ocean modeling, weather modeling,
astrophysics, solid state physics, power systems, CFD….
![Page 3: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/3.jpg)
FLOPS – a measure of performance
FLOPS – Floating Point Operations per Second… a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS - 106 FLOPS GigaFLOPS – GFLOPS – 109 FLOPS TeraFLOPS – TFLOPS – 1012 FLOPS PetaFLOPS – PFLOPS – 1015 FLOPS
![Page 4: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/4.jpg)
How fast …
Cray 1 - ~150 MFLOPSPentium 4 – 3-6 GFLOPSIBM’s BlueGene - +360 TFLOPSPSC’s Big Ben – 10 TFLOPSHumans --- it depends as calculators – 0.001 MFLOPS as information processors – 10PFLOPS
![Page 5: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/5.jpg)
FLOPS vs. MIPS
FLOPS only concerned with floating pointing calculationsother performance issues memory latency cache performance I/O capacity Interconnect
![Page 6: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/6.jpg)
See…
www.Top500.org biannual performance reports and … rankings of the fastest computers in
the world
![Page 7: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/7.jpg)
Performance
Speedup(n processors) = time(1 processor)/time(n processors)
** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach
![Page 8: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/8.jpg)
Consider…
from: www.lib.utexas.edu/maps/indian_ocean.html
![Page 9: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/9.jpg)
… a model of the Indian Ocean -
73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface points
Need to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data points
Every 10 minutes for 4 hours – 24 time steps
![Page 10: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/10.jpg)
So –
73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the
model grid
Suppose calculations of 100 instructions per grid point 350,400,000,000,000 instructions in
model
![Page 11: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/11.jpg)
Then -
Imagine that you have a computer that can run 1 billion (109)instructions per second3.504 x 1014 / 109 = 35040 seconds or 9.7 hours
![Page 12: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/12.jpg)
But –
On a 10 teraflops computer – 3.504 x 1014 / 1013 = 35.0 seconds
![Page 13: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/13.jpg)
Gaining performance
Pipelining More instructions –faster More instructions in execution at the
same time in a single processor Not usually an attractive strategy
these days – why?
![Page 14: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/14.jpg)
Instruction Level Parallelism (ILP)
based on the fact that many instructions do not depend on instructions that are before them…Processor has extra hardware to execute several instructions at the same time …multiple adders…
![Page 15: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/15.jpg)
Pipelining and ILP not the solution to our problem – why?
near incremental improvements in performancebeen done alreadywe need orders of magnitude improvements in performance
![Page 16: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/16.jpg)
Gaining Performance
Vector ProcessorsScientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift
object x to the right
Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)
![Page 17: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/17.jpg)
Gaining Performance
Vector ProcessorsDeclining popularity for a while – Hardware expensive
Popularity returning – Applications – science, engineering,
cryptography, media/graphics Earth Simulator your computer?
![Page 18: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/18.jpg)
Parallel Computer Architecture
Shared Memory ArchitecturesDistributed Memory
![Page 19: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/19.jpg)
Shared Memory Systems
Multiple processors connected to/share the same pool of memorySMPEvery processor has, potentially, access to and control of every memory location
![Page 20: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/20.jpg)
Shared Memory Computers
MemoryProcessor
ProcessorProcessor
Processor
Processor Processor
![Page 21: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/21.jpg)
Shared Memory Computers
Memory Memory Memory
Processor
Processor
Processor
![Page 22: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/22.jpg)
Shared Memory Computer
Memory Memory Memory
Processor
Processor
Processor
Switch
![Page 23: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/23.jpg)
Share Memory Computers
SGI Origin2000 – at NCSABalder256 250mhz R10000 processors128 Gbyte Memory
![Page 24: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/24.jpg)
Shared Memory Computers
Rachel at PSC64 1.15 Ghz EV7 processors256 Gbytes of shared memory
![Page 25: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/25.jpg)
Distributed Memory Systems
Multiple processors each with their own memoryInterconnected to share/exchange data, processingModern architectural approach to supercomputersSupercomputers and Clusters similar**Hybrid distributed/shared memory
![Page 26: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/26.jpg)
Clusters – distributed memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Interconnect
![Page 27: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/27.jpg)
ClusterDistributed Memory with SMP
Proc1
Memory
Memory
Memory
Memory
Interconnect
Proc2 Proc1
Memory
Proc2 Proc1
Memory
Proc2
Proc1Proc2 Proc1Proc2 Proc1Proc2
![Page 28: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/28.jpg)
Distributed Memory Supercomputer
BlueGene/L DOE/IBM0.7 Ghz PowerPC 440131072 Processors previous - 32768
Processors
367 Teraflops was 70 TFlops
![Page 29: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/29.jpg)
Distributed Memory Supercomputer
Thunder at LLNLNumber 19 was Number 5
20 Teraflops1.4 Ghz Itanium processors4096 processors
![Page 30: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/30.jpg)
Earth Simulator
JapanBuilt by NECNumber 14
was Number 1
40 TFlops640 Nodeseach node = 8 vector processors640x640 full crossbar
![Page 31: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/31.jpg)
Grid Computing Systems
What is a Grid Means different things to different
people
Distributed Processors Around campus Around the state Around the world
![Page 32: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/32.jpg)
Grid Computing Systems
Widely distributedLoosely connected (i.e. Internet)No central management
![Page 33: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/33.jpg)
Grid Computing SystemsConnected Clusters/other dedicated scientific computers
I2/Abilene
![Page 34: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/34.jpg)
Grid Computer Systems
InternetInternet
Control/Scheduler
Harvested Idle Cycles
![Page 35: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/35.jpg)
Grid Computing Systems
Dedicated Grids TeraGrid Sabre NASA Information Power Grid
Cycle Harvesting Grids Condor *GlobalGridExchange (Parabon) Seti@home http://setiathome.berkeley.edu/ Einstein@home http://einstein.phys.uwm.edu/
![Page 36: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/36.jpg)
Flynn’s Taxonomy
Single Instruction/Single Data - SISD
Multiple Instruction/Single Data - MISD
Single Instruction/Multiple Data - SIMD
Multiple Instruction/Multiple Data - MIMD
*Single Program/Multiple Data - SPMD
![Page 37: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/37.jpg)
SISD – Single Instruction Single Data
Single instruction stream “single instruction execution per clock cycle”Single data stream – one pieced of data per clock cycleDeterministicTradition CPU, most single CPU PCs
Load x to a
Load y to b
Add B to A
Store A
Load x to a
…
![Page 38: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/38.jpg)
Single Instruction Multiple Data
One Instruction streamMultiple data streams (partitions)Given instruction operates on multiple data elementsLockstepDeterministicProcessor Arrays, Vector ProcessorsCM-2, Cray-C90
Load A(1)
Load B(1)
Store C(1)
…
…
Load A(2)
Load B(2)
Store C(2)
…
…
Load A(3)
Load B(3)
Store C(3)
…
…
C(1)=A(1)*B(1)
C(2)=A(2)*B(2)
C(3)=A(3)*B(3)
PE-1 PE-2 PE-n
![Page 39: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/39.jpg)
Multiple Instruction Single Data
Multiple instruction streamsOperate on single data streamSeveral instructions operate on the same data element – concurrentlyA bit strange – CMU
Multi-pass filters Encryption – code
cracking
Load A(1)
Load B(1)
Store C(1)
…
…
Load A(1)
Load B(2)
Store C(2)
…
…
Load A(1)
Load B(3)
Store C(3)
…
C(1)=A(1)*4 C(2)=A(1)*4
PE-1 PE-2 PE-n
C(3)=A(1)*4
![Page 40: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/40.jpg)
Multiple Instruction Multiple Data
Multiple Instruction StreamsMultiple Data StreamsEach processor has own instructions/own dataMost Supercomputers, Clusters, Grids
Load A(1)
Load B(1)
Store C(1)
…
Load G
A=SQRT(G)
Store C
…
Call func2(C,G)
Load B
Call func1(B,C)
Store G
C(1)=A(1)*4 C = A*Pi
PE-1 PE-2 PE-n
![Page 41: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/41.jpg)
Single Program Multiple Data
Single Code Image/ExecutableEach Processor has own dataInstruction execution under program controlDMC, SMP
if PE=1 then…
Load A
Load B
Store C
…
Load A
Load B
Store C
…
Load A
Load B
Store C
…
C=A*B C=A*B
PE-1 PE-2 PE-n
C=A*B
if PE=2 then…if PE=n then…
![Page 42: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/42.jpg)
Multiple Program Multiple Data
MPMD like SPMD ……except each processor run separate, independent executableHow to implement interprocess communicationsSocketMPI-2 – more later
ProgA
ProgA
ProgA
ProgA
ProgA
ProgB
ProgC
ProgD
SPMD
MPMD
![Page 43: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/43.jpg)
UMA and NUMAUMA – Uniform Memory Access all processors have equal access to
memory Usually found in SMPs Identical processors Difficult to implement as n of
processors increases Good processor to memory bandwidth Cache Coherency CC –
important can be implemented in hardware
![Page 44: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/44.jpg)
UMA and NUMA
NUMA – Non Uniform Memory Access Access to memory differs by processor local processor = good access, nonlocal
processors = not so good access Usually multiple computers or multiple
SMPs Memory access across interconnect is
slow Cache Coherency CC –
can be done usually not a problem
![Page 45: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/45.jpg)
Let’s revisit speedup…
we can achieve speedup (theoretically) by using more processors,…but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance Parallelizability of algorithms
![Page 46: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/46.jpg)
Amdahl’s Law
According to Amdahl’s Law… Speedup = 1/(S + (1-S)/N) where S is the purely sequential part of the
program N is the number of processors
![Page 47: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/47.jpg)
Amdahl’s LawWhat does it mean – Part of a program can is parallelizable Part of the program must be sequential
(S)
Amdahl’s law says – Speedup is constrained by the portion of
the program that must remain sequential relative to the part that is parallelized.
Note: If S is very small – “embarrassingly parallel problem” sometimes anyway!
![Page 48: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/48.jpg)
Software models for parallel computing
Sockets and other P2P modelsThreadsShared MemoryMessage PassingData Parallel
![Page 49: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/49.jpg)
Sockets and others
TCP Sockets establish TCP links among processes send messages through sockets
RPC, CORBA, DCOMWebservices, SOAP…
![Page 50: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/50.jpg)
ThreadsA single executable runs…… at specific points in execution launches new executables – threads…… threads can be launched on other PEs… threads close – control returns to main program…fork and joinPosix, MicrosoftOpenMP is implemented with threads
ThreadThreads
Threads
t1
t2
t3
t0
t1
t2
t3
t0
![Page 51: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/51.jpg)
Shared Memory
Processes share common memory spaceData sharing via common memory spaceProtocol needed to “play nice” with memoryOpenMP
MemoryProcessor
ProcessorProcessor
Processor
Processor Processor
![Page 52: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/52.jpg)
Distributed Memory - Message Passing
Data messages are passed from PE to PEMessage Passing is explicit … under program controlParallelization is designed by the programmer……implemented by the programmer
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Interconnect
![Page 53: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/53.jpg)
Message Passing
Message Passing usually implement as a library – functions and subroutine callsMost common – MPI – Message Passing InterfaceStandards –
MPI-1 MPI-2
Implementations MPICH OpenMPI MPICH-GM (Myrinet MPICH-G2 – MPICH-G
![Page 54: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/54.jpg)
Message PassingHybrid DM/SMPHow does it look from a message passing perspective?How is MPI implemented?
Proc1
Memory
Memory
Memory
Memory
Interconnect
Proc2 Proc1
Memory
Proc2 Proc1
Memory
Proc2
Proc1Proc2 Proc1Proc2 Proc1Proc2
![Page 55: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/55.jpg)
Data ParallelProcesses work concurrently on pieces of single data structureSMP – each process works on portion of structure in common memoryDMS – data structure is partitioned, distributed, computed (and collected)
from -http://www.llnl.gov/computing/tutorials/parallel_comp/#Flynn
![Page 56: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/56.jpg)
Data Parallel
Can be done with calls to libraries, compiler directives…can be automatic (sort of)High Performance Fortran (HPF)Fortran 95
![Page 57: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/57.jpg)
Comments on Automatic Parallelization
Some compilers can automatically parallelize portions of code (HPF)Usually loops are the targetEssentially a serial algorithm with portions pushed out to other processorsProblems Not parallel algorithm, not under
programmer control (at least partly) might be wrong might result in slowdown
![Page 58: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/58.jpg)
See…
http://www.llnl.gov/computing/tutorials/parallel_comp/
![Page 59: CS 420 - Design of Algorithms](https://reader036.fdocuments.in/reader036/viewer/2022062423/568149b6550346895db6eed8/html5/thumbnails/59.jpg)