27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27....
Transcript of 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27....
![Page 1: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/1.jpg)
27. Parallel Programming I
Moore’s Law and the Free Lunch, Hardware Architectures, ParallelExecution, Flynn’s Taxonomy, Scalability: Amdahl and Gustafson,Data-parallelism, Task-parallelism, Scheduling
[Task-Scheduling: Cormen et al, Kap. 27]
771
![Page 2: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/2.jpg)
The Free Lunch
The free lunch is over 35
35"The Free Lunch is Over", a fundamental turn toward concurrency in software, Herb Sutter, Dr. Dobb’s Journal, 2005772
![Page 3: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/3.jpg)
Moore’s Law
Gordon E. Moore (1929)Observation by Gordon E. Moore:
The number of transistors on integrated circuits doublesapproximately every two years.
773
![Page 4: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/4.jpg)
Moore’s Law
By
Wgs
imon
,htt
ps:/
/com
mons
.wik
imed
ia.o
rg/w
/ind
ex.p
hp?c
urid
=151
9354
2
774
![Page 5: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/5.jpg)
For a long time...
the sequential execution became faster (Instruction LevelParallelism, Pipelining, Higher Frequencies)more and smaller transistors = more performanceprogrammers simply waited for the next processor generation
775
![Page 6: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/6.jpg)
Today
the frequency of processors does not increase significantly andmore (heat dissipation problems)the instruction level parallelism does not increase significantly anymorethe execution speed is dominated by memory access times (butcaches still become larger and faster)
776
![Page 7: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/7.jpg)
Trends
http
://w
ww.g
otw.
ca/p
ubli
cati
ons/
conc
urre
ncy-
ddj.
htm
777
![Page 8: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/8.jpg)
Multicore
Use transistors for more compute coresParallelism in the softwareProgrammers have to write parallel programs to benefit from newhardware
778
![Page 9: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/9.jpg)
Forms of Parallel Execution
VectorizationPipeliningInstruction Level ParallelismMulticore / MultiprocessingDistributed Computing
779
![Page 10: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/10.jpg)
Vectorization
Parallel Execution of the same operations on elements of a vector(register)
x
y+ x+ yskalar
x1 x2 x3 x4
y1 y2 y3 y4+ x1 + y1 x2 + y2 x3 + y3 x4 + y4vector
x1 x2 x3 x4
y1 y2 y3 y4fma 〈x, y〉vector
780
![Page 11: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/11.jpg)
Home Work
781
![Page 12: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/12.jpg)
More efficient
782
![Page 13: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/13.jpg)
Pipeline
783
![Page 14: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/14.jpg)
Throughput
Throughput = Input or output data rateNumber operations per time unitlarger througput is betterApproximation
throughput =1
max(computationtime(stages))
ignores lead-in and lead-out times
784
![Page 15: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/15.jpg)
Latency
Time to perform a computationPipeline latency only constant when Pipeline is balanced: sum ofall operations over all stagesUnbalanced Pipeline
First batch as with the balanced pipelineIn a balanced version, latency= #stages ·max(computationtime(stages))
785
![Page 16: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/16.jpg)
Homework Example
Washing T0 = 1h, Drying T1 = 2h, Ironing T2 = 1h, Tidy upT3 = 0.5h
Latency first batch: L = T0 + T1 + T2 + T3 = 4.5h
Latency second batch: L = T1 + T1 + T2 + T3 = 5.5h
In the long run: 1 batch every 2h (0.5/h).
786
![Page 17: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/17.jpg)
Throughput vs. Latency
Increasing throughput can increase latencyStages of the pipeline need to communicate and synchronize:overhead
787
![Page 18: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/18.jpg)
Pipelines in CPUs
Fetch Decode Execute Data Fetch Writeback
Multiple Stages
Every instruction takes 5 time units (cycles)In the best case: 1 instruction per cycle, not always possible(“stalls”)
Paralellism (several functional units) leads to faster execution.
788
![Page 19: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/19.jpg)
ILP – Instruction Level Parallelism
Modern CPUs provide several hardware units and executeindependent instructions in parallel.
PipeliningSuperscalar CPUs (multiple instructions per cycle)Out-Of-Order Execution (Programmer observes the sequentialexecution)Speculative Execution
789
![Page 20: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/20.jpg)
27.2 Hardware Architectures
790
![Page 21: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/21.jpg)
Shared vs. Distributed Memory
CPU CPU CPU
Shared Memory
Mem
CPU CPU CPU
Mem Mem Mem
Distributed Memory
Interconnect
791
![Page 22: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/22.jpg)
Shared vs. Distributed Memory Programming
Categories of programming interfaces
Communication via message passingCommunication via memory sharing
It is possible:
to program shared memory systems as distributed systems (e.g. withmessage passing MPI)program systems with distributed memory as shared memory systems(e.g. partitioned global address space PGAS)
792
![Page 23: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/23.jpg)
Shared Memory Architectures
Multicore (Chip Multiprocessor - CMP)Symmetric Multiprocessor Systems (SMP)Simultaneous Multithreading (SMT = Hyperthreading)
one physical core, Several Instruction Streams/Threads: several virtualcoresBetween ILP (several units for a stream) and multicore (several units forseveral streams). Limited parallel performance.
Non-Uniform Memory Access (NUMA)
Same programming interface
793
![Page 24: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/24.jpg)
Overview
CMP SMP NUMA
794
![Page 25: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/25.jpg)
An Example
AMD Bulldozer: be-tween CMP and SMT
2x integer core1x floating point core
Wik
iped
ia
795
![Page 26: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/26.jpg)
Flynn’s Taxonomy
Single-Core Fault-Tolerance
Vector Computing / GPU Multi-Core796
![Page 27: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/27.jpg)
Massively Parallel Hardware[General Purpose] Graphical ProcessingUnits ([GP]GPUs)
Revolution in High PerformanceComputing
Calculation 4.5 TFlops vs. 500 GFlopsMemory Bandwidth 170 GB/s vs. 40GB/s
SIMD
High data parallelismRequires own programming model. Z.B.CUDA / OpenCL
797
![Page 28: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/28.jpg)
27.3 Multi-Threading, Parallelism and Concurrency
798
![Page 29: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/29.jpg)
Processes and Threads
Process: instance of a program
each process has a separate context, even a separate address spaceOS manages processes (resource control, scheduling, synchronisation)
Threads: threads of execution of a program
Threads share the address spacefast context switch between threads
799
![Page 30: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/30.jpg)
Why Multithreading?
Avoid “polling” resources (files, network, keyboard)Interactivity (e.g. responsivity of GUI programs)Several applications / clients in parallelParallelism (performance!)
800
![Page 31: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/31.jpg)
Multithreading conceptually
Thread 1
Thread 2
Thread 3
Single Core
Thread 1
Thread 2
Thread 3
Multi Core
801
![Page 32: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/32.jpg)
Thread switch on one core (Preemption)
thread 1 thread 2
idlebusy
Store State t1Interrupt
Load State t2
busyidle
Store State t2Interrupt
Load State t1busy
idle
802
![Page 33: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/33.jpg)
Parallelitat vs. ConcurrencyParallelism: Use extra resources to solve a problem fasterConcurrency: Correctly and efficiently manage access to sharedresourcesBegriffe überlappen offensichtlich. Bei parallelen Berechnungenbesteht fast immer Synchronisierungsbedarf.
Parallelism
Work
Resources
Concurrency
Requests
Resources
803
![Page 34: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/34.jpg)
Thread Safety
Thread Safety means that in a concurrent application of a programthis always yields the desired results.
Many optimisations (Hardware, Compiler) target towards the correctexecution of a sequential program.
Concurrent programs need an annotation that switches off certainoptimisations selectively.
804
![Page 35: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/35.jpg)
Example: Caches
Access to registers faster than toshared memory.Principle of locality.Use of Caches (transparent to theprogrammer)
If and how far a cache coherency is guar-anteed depends on the used system.
805
![Page 36: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/36.jpg)
27.4 Scalability: Amdahl and Gustafson
806
![Page 37: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/37.jpg)
Scalability
In parallel Programming:
Speedup when increasing number p of processorsWhat happens if p→∞?Program scales linearly: Linear speedup.
807
![Page 38: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/38.jpg)
Parallel Performance
Given a fixed amount of computing work W (number computingsteps)
Sequential execution time T1Parallel execution time on p CPUs
Perfection: Tp = T1/p
Performance loss: Tp > T1/p (usual case)Sorcery: Tp < T1/p
808
![Page 39: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/39.jpg)
Parallel Speedup
Parallel speedup Sp on p CPUs:
Sp =W/TpW/T1
=T1Tp.
Perfection: linear speedup Sp = p
Performance loss: sublinear speedup Tp > T1/p (the usual case)Sorcery: superlinear speedup Tp < T1/p
Efficiency:Ep = Sp/p
809
![Page 40: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/40.jpg)
Reachable Speedup?Parallel Program
Parallel Part Seq. Part
80% 20%
T1 = 10
T8 =?
T8 =10 · 0.8
8+ 10 · 0.2 = 1 + 2 = 3
S8 =T1T8
=10
3= 3.33 810
![Page 41: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/41.jpg)
Amdahl’s Law: Ingredients
Computational work W falls into two categories
Paralellisable part Wp
Not parallelisable, sequential part Ws
Assumption: W can be processed sequentially by one processor inW time units (T1 = W ):
T1 = Ws +Wp
Tp ≥ Ws +Wp/p
811
![Page 42: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/42.jpg)
Amdahl’s Law
Sp =T1Tp≤ Ws +Wp
Ws +Wp
p
812
![Page 43: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/43.jpg)
Amdahl’s Law
With sequential, not parallelizable fraction λ: Ws = λW ,Wp = (1− λ)W :
Sp ≤1
λ+ 1−λp
ThusS∞ ≤
1
λ
813
![Page 44: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/44.jpg)
Illustration Amdahl’s Law
p = 1
t
Ws
Wp
p = 2
Ws
Wp
p = 4
Ws
Wp
T1
814
![Page 45: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/45.jpg)
Amdahl’s Law is bad news
All non-parallel parts of a program can cause problems
815
![Page 46: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/46.jpg)
Gustafson’s Law
Fix the time of executionVary the problem size.Assumption: the sequential part stays constant, the parallel partbecomes larger
816
![Page 47: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/47.jpg)
Illustration Gustafson’s Law
p = 1
t
Ws
Wp
p = 2
Ws
Wp Wp
p = 4
Ws
Wp Wp Wp Wp
T
817
![Page 48: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/48.jpg)
Gustafson’s LawWork that can be executed by one processor in time T :
Ws +Wp = T
Work that can be executed by p processors in time T :
Ws + p ·Wp = λ · T + p · (1− λ) · T
Speedup:
Sp =Ws + p ·Wp
Ws +Wp= p · (1− λ) + λ
= p− λ(p− 1)
818
![Page 49: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/49.jpg)
Amdahl vs. Gustafson
Amdahl Gustafson
p = 4 p = 4
819
![Page 50: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/50.jpg)
27.5 Task- and Data-Parallelism
820
![Page 51: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/51.jpg)
Parallel Programming Paradigms
Task Parallel: Programmer explicitly defines parallel tasks.Data Parallel: Operations applied simulatenously to an aggregateof individual items.
821
![Page 52: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/52.jpg)
Example Data Parallel (OMP)
double sum = 0, A[MAX];#pragma omp parallel for reduction (+:ave)for (int i = 0; i< MAX; ++i)
sum += A[i];return sum;
822
![Page 53: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/53.jpg)
Example Task Parallel (C++11 Threads/Futures)
double sum(Iterator from, Iterator to){
auto len = from − to;if (len > threshold){
auto future = std::async(sum, from, from + len / 2);return sumS(from + len / 2, to) + future.get();
}else
return sumS(from, to);}
823
![Page 54: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/54.jpg)
Work Partitioning and Scheduling
Partitioning of the work into parallel task (programmer or system)
One task provides a unit of workGranularity?
Scheduling (Runtime System)
Assignment of tasks to processorsGoal: full resource usage with little overhead
824
![Page 55: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/55.jpg)
Example: Fibonacci P-Fib
if n ≤ 1 thenreturn n
elsex← spawn P-Fib(n− 1)y ← spawn P-Fib(n− 2)syncreturn x+ y;
825
![Page 56: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/56.jpg)
P-Fib Task Graph
826
![Page 57: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/57.jpg)
P-Fib Task Graph
827
![Page 58: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/58.jpg)
Question
Each Node (task) takes 1 time unit.Arrows depict dependencies.Minimal execution time when numberof processors =∞?
critical path
828
![Page 59: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/59.jpg)
Performance Model
p processorsDynamic schedulingTp: Execution time on p processors
829
![Page 60: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/60.jpg)
Performance Model
Tp: Execution time on p processorsT1: work: time for executing total workon one processorT1/Tp: Speedup
830
![Page 61: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/61.jpg)
Performance Model
T∞: span: critical path, execution timeon∞ processors. Longest path fromroot to sink.T1/T∞: Parallelism: wider is betterLower bounds:
Tp ≥ T1/p Work lawTp ≥ T∞ Span law
831
![Page 62: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/62.jpg)
Greedy Scheduler
Greedy scheduler: at each time it schedules as many as availbaletasks.TheoremOn an ideal parallel computer with p processors, a greedy schedulerexecutes a multi-threaded computation with work T1 and span T∞ intime
Tp ≤ T1/p+ T∞
832
![Page 63: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/63.jpg)
BeispielAssume p = 2.
Tp = 5 Tp = 4
833
![Page 64: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/64.jpg)
Proof of the Theorem
Assume that all tasks provide the same amount of work.
Complete step: p tasks are available.incomplete step: less than p steps available.
Assume that number of complete steps larger than bT1/pc.Executed work ≥ P · (bT1/pc · p) = T1 − T1 mod p+ p ≥ T1.Contradiction. Therefore maximally bT1/pc complete steps.
Each incomplete step executed at any time all available tasks t withdeg−(t) = 0 and decreases the length of the span. Otherwise thechosen span would not have been maximal. Number of incompletesteps thus maximally T∞.
834
![Page 65: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/65.jpg)
Consequence
if p� T1/T∞, i.e. T∞ � T1/p, then Tp ≈ T1/p.
Example FibonacciT1(n)/T∞(n) = Θ(φn/n). For moderate sizes of n we can use a lotof processors yielding linear speedup.
835
![Page 66: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/66.jpg)
Granularity: how many tasks?#Tasks = #Cores?Problem if a core cannot be fully usedExample: 9 units of work. 3 core.Scheduling of 3 sequential tasks.
Exclusive utilization:
P1
P2
P3
s1
s2
s3
Execution Time: 3 Units
Foreign thread disturbing:
P1
P2
P3
s1
s2 s1
s3
Execution Time: 5 Units836
![Page 67: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/67.jpg)
Granularity: how many tasks?#Tasks = Maximum?Example: 9 units of work. 3 cores.Scheduling of 9 sequential tasks.
Exclusive utilization:
P1
P2
P3
s1
s2
s3
s4
s5
s6
s7
s8
s9
Execution Time: 3 + ε Units
Foreign thread disturbing:
P1
P2
P3
s1
s2
s3
s4 s5
s6 s7
s8
s9
Execution Time: 4 Units. Full uti-lization.
837
![Page 68: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/68.jpg)
Granularity: how many tasks?
#Tasks = Maximum?Example: 106 tiny units of work.
P1
P2
P3
Execution time: dominiert vom Overhead.
838
![Page 69: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/69.jpg)
Granularity: how many tasks?
Answer: as many tasks as possible with a sequential cutoff such thatthe overhead can be neglected.
839
![Page 70: 27. Parallel Programming I - ETH Zlec.inf.ethz.ch/DA/2017/slides/daLecture21.en.handout.pdf · 27. Parallel Programming I Moore’s Law and the Free Lunch, ... The number of transistors](https://reader031.fdocuments.in/reader031/viewer/2022020316/5b652fcd7f8b9a6e1f8b61a0/html5/thumbnails/70.jpg)
Example: Parallelism of Mergesort
Work (sequential runtime) ofMergesort T1(n) = Θ(n log n).Span T∞(n) = Θ(n)
Parallelism T1(n)/T∞(n) = Θ(log n)(Maximally achievable speedup withp =∞ processors)
split
merge
840