This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by...
-
Upload
russell-randall -
Category
Documents
-
view
216 -
download
3
Transcript of This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by...
![Page 1: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/1.jpg)
This module created with support form NSF CDER Early Adopter Program
Module developed Fall 2014by Apan Qasem
Parallel Performance: Analysis and Evaluation
Lecture TBDCourse TBD
Term TBD
![Page 2: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/2.jpg)
Review
• Performance evaluation of parallel programs
![Page 3: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/3.jpg)
Speedup
• Sequential SpeedupSseq = Execorig/Execnew
• Parallel Speedup Spar = Execseq/Execpar
Spar = Exec1/ExecN
• Linear Speedup Spar = N
• Super Linear SpeedupSpar > N
![Page 4: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/4.jpg)
Amdahl’s Law for Parallel Programs
• Speedup is bounded by the amount of parallelism available in the program
• If the fraction of code that runs in parallel is p then maximum speedup that can be obtained with N processors
ExTimenew = (ExTimeseq * p * 1/N) + (ExTimeseq * (1 – p))
ExTimepar = ExTimeseq * ((1 – p) + p/N)
Speedup = ExTimeseq/ExTimepar
= 1/((1-p) + p/N) = N / (N (1 –p) + p)
![Page 5: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/5.jpg)
max theoretical speedup
max speedup in relation to number of processors
Scalability
![Page 6: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/6.jpg)
Scalability
• Program continues to provide speedups as we add more processing cores • Does Amdahl’s Law hold for large values of N for a
particular program
• The ability of a parallel program's performance to scale is a result of a number of interrelated factors
• The algorithm may have inherent limits to scalability
![Page 7: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/7.jpg)
Strong and Weak Scaling
• Strong Scaling • Adding more cores allows us to solve the
problem faster • e.g., fold the same protein faster
• Weak Scaling • Adding more cores allows us to solve larger
problem• e.g., fold a bigger protein
![Page 8: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/8.jpg)
The Road to High Performance
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 20130.0E+00
2.0E+06
4.0E+06
6.0E+06
8.0E+06
1.0E+07
1.2E+07
1.4E+07
1.6E+07
1.8E+07Ac
hiev
ed P
erfo
rman
ce (G
FLO
PS)
Celebrating 20 years
teraflop
gigaflop
petaflop
![Page 9: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/9.jpg)
The Road to High Performance
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 201310%
20%
30%
40%
50%
60%
70%
80%
90%
100%Fr
actio
n of
Pea
k (E
ffici
ency
)
Celebrating 20 years
multicoresarrive
![Page 10: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/10.jpg)
Lost Performance
-2.5E+06
-2.0E+06
-1.5E+06
-1.0E+06
-5.0E+05
0.0E+00U
nexp
loite
d Pe
rfor
man
ce (G
FLO
PS)
Celebrating 20 years
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013
![Page 11: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/11.jpg)
Need More Than Performance
2003 2005 2007 2009 2011 20120
500
1000
1500
2000
2500M
FLO
PS/W
att
GPUsarrive
No power data prior to 2003
Celebrating 20 years
![Page 12: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/12.jpg)
Communication Costs
Algorithms have two costs1. Arithmetic (FLOPS)2. Communication: moving data between
• levels of a memory hierarchy (sequential case) • processors over a network (parallel case).
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
Slide source: Jim Demmel, UC Berkeley
![Page 13: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/13.jpg)
Avoiding Communication
• Running time of an algorithm is sum of 3 terms:• # flops * time_per_flop• # words moved / bandwidth• # messages * latency
Slide source: Jim Demmel, UC Berkeley
communication
• Goal : organize code to avoid communication• Between all memory hierarchy levels
• L1 L2 DRAM network• Not just hiding communication (overlap with arith) (speedup 2x ) • Arbitrary speedups possible
Annual improvements
Time_per_flop Bandwidth Latency
Network 26% 15%
DRAM 23% 5%59%
![Page 14: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/14.jpg)
14
Power Consumption in HPC Applications
Memory; 27%
FP; 11%
INT ALU; 12%Fetch ; 13%
Decode; 8%
Reserva-tion sta-tions; 5%
Other, 24%
Data from NCOMMAS weather modeling applications on AMD Barcelona
![Page 15: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/15.jpg)
Techniques For Improving Parallel Performance
• Data locality • Thread Affinity• Energy
![Page 16: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/16.jpg)
Memory Hierarchy : Single Processor
SecondLevelCache
(L2)
Control
Datapath
SecondaryMemory
(Disk)
On-Chip Components
RegFile
MainMemory(DRAM)
Data
CacheInstr
Cache
ITLBDTLB
Speed (cycles): ½ 1’s 10’s 100’s 10,000’s
Size (bytes): 100’s 10K’s M’s G’s T’s
Cost per byte: highest lowest
Nothing gained without locality
![Page 17: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/17.jpg)
Types of Locality
• Temporal Locality (locality in time)• If a memory location is referenced then it is likely that it
will be referenced again soon Keep most recently accessed data items closer to the processor
• Spatial Locality (locality in space)• If a memory location is referenced, the locations with
nearby addresses are likely to be referenced soon Move blocks consisting of contiguous words closer to the processor
demo
![Page 18: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/18.jpg)
Shared-caches on MulticoresB
lue G
ene/LTile
ra6
4In
tel C
ore
2 D
uo
![Page 19: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/19.jpg)
Data Parallelism
D/pD/p D/p D/p
D = data
D
![Page 20: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/20.jpg)
Data Parallelism
D/pD/p D/p D/p
D = data
D
typically, same taskon different parts ofthe data spawn
synchronize
![Page 21: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/21.jpg)
D/k
D
DD/k ≤ Cache Capacity
D/k D/kD/k
Shared-cache and Data Parallelization
intra-core locality
![Page 22: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/22.jpg)
Tiled Data Access
individual thread
“beam” sweep blocking of i and j
parallellization over ii and jj
“unit” sweepparallelization over i, j, k
no blocking
“plane” sweep parallelization over k
no blocking
ij
k
![Page 23: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/23.jpg)
reuse over time, multiple sweeps over working set
Reduced granularityImproved intra-core locality
thread granularity
smaller working set per thread
ij
k
Data Locality and Thread Granularity
![Page 24: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/24.jpg)
Exploiting Locality With Tiling
// parallel regionthread_construct() ... // repeated access for j = 1, M ... a[i][j] ... ... b[i][j] ...
for j = 1, M, T // parallel region thread_construct() ... // repeated access for jj = j, j + T - 1 ... a[i][jj] ... ... b[i][jj] ...
![Page 25: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/25.jpg)
Exploiting Locality With Tiling
// parallel regionfor i = 1, N ... // repeated access for j = 1, M ... a[i][j] ... ... b[i][j] ...
for j = 1, M, T // parallel region for i = 1, N ... // repeated access for jj = j, j + T - 1 ... a[i][jj] ... ... b[i][jj] ...
demo
![Page 26: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/26.jpg)
Locality with Distribution
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
a(i,j) b(i,j) ...
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
a(i,j) ...
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
b(i,j) ...
reduces threads granularity
improves intro-core locality
![Page 27: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/27.jpg)
Locality with Fusion
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
a(i,j) b(i,j) ...
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
a(i,j) ...
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
b(i,j) ...
![Page 28: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/28.jpg)
Combined Tiling and Fusion
for i = 1, M, T // parallel region thread_construct() for ii = i, i + T - 1 = a(ii,j) = b(ii,j)
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
a(i,j) ...
// parallel regionthread_construct() ... for j = 1, N for i = 1, M ...
b(i,j) ...
![Page 29: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/29.jpg)
Pipelined Parallelism
• Pipelined parallelism can be used to parallelize applications that exhibit producer-consumer behavior
• Gained importance because of the low synchronization cost between cores on CMPs• Being used to parallelize programs that were previously
considered sequential
• Arises in many different contexts• Optimization problems• Image processing• Compression • PDE solvers
![Page 30: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/30.jpg)
Pipelined Parallelism
CP
Shared Data Set
P
C
Synchronization window
Any streaming application : Netflix
![Page 31: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/31.jpg)
Ideal Synchronization Window
CP
Shared Data Set
P
C
inter-core data locality
![Page 32: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/32.jpg)
Synchronization Window Bounds
Bad
Not asbad
Better?
![Page 33: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/33.jpg)
Thread Affinity
• Binding a thread to a particular core
• Soft affinity • Affinity suggested by programmer/software;
may or may not be honored by OS
• Hard affinity• affinity suggested by system software/runtime
system; honored by OS
![Page 34: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/34.jpg)
Thread Affinity and Performance
• Temporal Locality• A thread running on the same core throughout
it’s lifetime will be able to exploit the cache
• Resource usage• Shared caches• TLBs• Prefetch units • …
![Page 35: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/35.jpg)
Thread Affinity and Resource Usage
Key idea• If thread i and j have favorable resource usage
then bind them to the same “cohort”
• If thread i and j have unfavorable resource usage then bind them to different “cohorts”
• A cohort is a group of cores that share resources
demo
![Page 36: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/36.jpg)
Load Balancing
This one dominates!
![Page 37: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/37.jpg)
Thread Affinity Tools
• GNU + OpenMP • Environment variable GOMP_CPU_AFFINITY
• Pthreads• pthread_setaffinity_np()
• Linux API• sched_setaffinity()
• Command line tools • taskset
![Page 38: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/38.jpg)
Power Consumption
• Improved power consumption does not always coincide with improved performance
• In fact, for many applications it is the opposite
P = CV2f
• Need to account for power, explicitly
![Page 39: This module created with support form NSF CDER Early Adopter Program Module developed Fall 2014 by Apan Qasem Parallel Performance: Analysis and Evaluation.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649eb45503460f94bbc6b8/html5/thumbnails/39.jpg)
Optimizations for Power
• Techniques are similar but objectives are different• Fuse code to get a better mix of instructions• Distribute code to separate and FP-intensive tasks
• Can use affinity to reduce overall system power consumption • Bind hot-cold tasks to same cohort• Distribute hot-hot tasks across multiple cohorts
• Techniques with hardware support• DVFS : slow down a subset of cores