Auto-tuning Performance on Multicore Computers
-
Upload
tasha-frye -
Category
Documents
-
view
36 -
download
3
description
Transcript of Auto-tuning Performance on Multicore Computers
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
1
BERKELEY PAR LAB
Auto-tuning Performance on Multicore Computers
Auto-tuning Performance on Multicore Computers
Samuel WilliamsDavid Patterson, advisor and chair
Kathy Yelick
Sara McMains
Ph.D. Dissertation Talk
Auto-tuning Performance on Multicore Computers
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
2
BERKELEY PAR LAB
Multicore ProcessorsMulticore ProcessorsMulticore Processors
3
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Memory Bandwidth
Superscalar Era
Single thread performance scaled at 50% per year
Bandwidth increases much more slowly, but we could add additional bits or channels.
Lack of diversity in architecture = lack of individual tuning
Power wall has capped single thread performance
Log(
Per
form
ance
)
1990 1995 2000 2005
~50%
/ ye
ar
ProcessorMemory Latency
~25% / year
~12% / year
0% / year
7% / year
Multicore is th
e
agreed solution
Multicore is th
e
agreed solution
4
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Superscalar Era
Single thread performance scaled at 50% per year
Bandwidth increases much more slowly, but we could add additional bits or channels.
Lack of diversity in architecture = lack of individual tuning
Power wall has capped single thread performance
Log(
Per
form
ance
)
1990 1995 2000 2005
~50%
/ ye
ar
~25% / year
ProcessorMemory
~12% / year
0% / year
But, no agreement o
n
what it should lo
ok like
But, no agreement o
n
what it should lo
ok like
5
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Take existing power limited superscalar processors, and add cores.
Serial code shows no improvement
Parallel performance might improve at 25% per year.
= improvement in power efficiency from process tech.
DRAM bandwidth is currently only improving by ~12% per year.
Intel / AMD model.
Multicore Era
Log(
Per
form
ance
)
2003 2007
+0% / year (serial)
2005 2009
+25% / year (p
arallel)
~12% / year
6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Radically different approach Give up superscalar OOO
cores in favor of small, efficient, in-order cores
Huge, abrupt, shocking drop in single thread performance
May eliminate power wall, allowing many cores, and performance to scale at up to 40% per year
Troubling, the number of DRAM channels many need to double every three years
Niagara, Cell, GPUs
Multicore Era
Log(
Per
form
ance
)
2003 20072005 2009
+0% / year (serial)
+40%
/ ye
ar (p
arall
el)
+12% / year
7
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Another option would be to reduce per core performance (and power) every year
Graceful degradation in serial performance
Bandwidth is still an issue
Multicore Era
Log(
Per
form
ance
)
2003 20072005 2009
+40%
/ ye
ar (p
arall
el)
-12% / year (serial)
+12% / year
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore Era
There are still many other architectural knobs Superscalar? Dual issue? VLIW ? Pipeline depth SIMD width ? Multithreading (vertical, simultaneous)? Shared caches vs. Private Caches ? FMA vs. MUL+ADD vs. MUL or ADD Clustering of cores Crossbars vs. Rings vs. Meshes
Currently, no consensus on optimal configuration As a result, there is a plethora of multicore architectures
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
9
BERKELEY PAR LAB
Computational MotifsComputational MotifsComputational Motifs
10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Computational Motifs
Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing Numerical Methods common throughout scientific computing
Dense and Sparse Linear Algebra Computations on Structured and Unstructured Grids Spectral Methods N-Body Particle Methods Monte Carlo
Within each dwarf, there are a number of computational kernels
The Berkeley View, and subsequently Par Lab expanded these to many other domains in computing (embedded, SPEC, DB, Games) Graph Algorithms Combinational Logic Finite State Machines etc…
rechristened Computational Motifs
Each could be black-boxed into libraries or frameworks by domain experts. But how do we get good performance given the diversity of architecture?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
11
BERKELEY PAR LAB
Auto-tuningAuto-tuningAuto-tuning
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(motivation)
Given a huge diversity of processor architectures, an code hand-optimized for one architecture will likely deliver poor performance on another. Moreover, code optimized for one input data set may deliver poor performance on another.
We want a single code base that delivers performance portability across the breadth of architectures today and into the future.
Auto-tuners are composed of two principal components: A code generator based on high-level functionality rather than parsing C And the auto-tuner proper that searches for the optimal parameters for each
optimization.
Auto-tuner’s don’t invent or discover optimizations, they search through the parameter space for a variety of known optimizations.
Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
13
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(code generation)
The code generator produces many code variants for some numerical kernel using known optimizations and transformations.
For example, cache blocking adds several parameterized loop nests. Prefetching adds parameterized intrinsics to the code Loop unrolling and reordering explicitly unrolls the code. Each unrolling
is a unique code variant.
Kernels can have dozens of different optimizations, some of which can produce hundreds of code variants.
The code generators used in this work are kernel-specific, and were written in Perl.
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning(search)
In this work, we use two search techniques: Exhaustive - examine every combination of parameter for every
optimization. (often intractable) Heuristics - use knowledge of architecture or algorithm to restrict
the search space.
Parameter Space for Optimization A
Par
amet
er S
pace
for
Opt
imiz
atio
n B
Parameter Space for Optimization A
Par
amet
er S
pace
for
Opt
imiz
atio
n B
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
15
BERKELEY PAR LAB
Overview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
Auto-tuning Performance on Multicore Computers
Auto-tuning Performance on Multicore ComputersAuto-tuning Performance on Multicore Computers
16
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Thesis Contributions
Introduced the Roofline Model Extended auto-tuning to the structured grid motif
(specifically LBMHD) Extended auto-tuning to multicore
Fundamentally different from running auto-tuned serial code on multicore SMPs.
Apply the concept to LBMHD and SpMV.
Analyzed the breadth of multicore architectures in the context of auto-tuned SpMV and LBMHD
Discussed future directions in auto-tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
17
BERKELEY PAR LAB
Multicore SMPsMulticore SMPsChapter 3
Multicore SMPsOverview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems_
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(Conventional cache-based memory hierarchy)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
BIFBIF
VMTPPEVMTPPE
512KL2
512KL2
BIFBIF
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(local store-based memory hierarchy)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
IBM QS20(Cell Blade)
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
VMTPPEVMTPPE
512KL2
512KL2
<20
GB
/s(e
ach
dire
ctio
n)
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
VMTPPEVMTPPE
512KL2
512KL2
XDR memory controllersXDR memory controllers XDR memory controllersXDR memory controllers
EIB (ring network)EIB (ring network) EIB (ring network)EIB (ring network)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
21
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(CMT = Chip Multithreading)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
IBM QS20(Cell Blade)
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
XDR memory controllersXDR memory controllers XDR memory controllersXDR memory controllers
EIB (ring network)EIB (ring network) EIB (ring network)EIB (ring network)
Sun UltraSPARC T2+ T5140(Victoria Falls)
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
22
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak double-precision FLOP rates)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
74.66 GFLOP/s 73.60 GFLOP/s
29.25 GFLOP/s18.66 GFLOP/s
23
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(DRAM pin bandwidth)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
21 GB/s(read)
10 GB/s(write)21 GB/s
51 GB/s42 GB/s(read)
21 GB/s(write)
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(Non-Uniform Memory Access)
Intel Xeon E5345(Clovertown)
AMD Opteron 2356(Barcelona)
Sun UltraSPARC T2+ T5140(Victoria Falls)
IBM QS20(Cell Blade)
667MHz FBDIMMs667MHz FBDIMMs
MCH (4x64b controllers)MCH (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
FSB10.66 GB/s
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
Cor
eC
ore
Cor
eC
ore
4MBL2
4MBL2
10.66 GB/sFSB
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
4GB
/s(e
ach
dire
ctio
n)O
pter
onO
pter
on
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
667MHz DDR2DIMMs
667MHz DDR2DIMMs
10.6 GB/s
Opt
eron
Opt
eron
512K
512K
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
Opt
eron
512K
512K
512K
512K
512K
512K
2MB victim2MB victim
SRI / xbarSRI / xbar
2x64b controllers2x64b controllers
Hyp
erT
rans
port
Hyp
erT
rans
port
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
179 GB/s 90 GB/s
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256K
256K
MF
CM
FC
<20
GB
/s(e
ach
dire
ctio
n)
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
SP
ES
PE
256K
256K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
25
BERKELEY PAR LAB
Roofline ModelRoofline ModelChapter 4
Roofline ModelOverview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
26
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Memory Traffic
Total bytes to/from DRAM
Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …
Oblivious of lack of sub-cache line spatial locality
27
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity
For purposes of this talk, we’ll deal with floating-point kernels Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes Includes cache effects Many interesting problems have constant AI (w.r.t. problem size)
Bad given slowly increasing DRAM bandwidth Bandwidth and Traffic are key optimizations
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Basic Idea
Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.
Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance.
Moreover, provides insights as to which optimizations will potentially be beneficial.
AttainablePerformanceij
= minFLOP/s with Optimizations1-i
AI * Bandwidth with Optimizations1-j
29
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (computational ceilings)
Plot on log-log scale Given AI, we can easily
bound performance But architectures are much
more complicated
We will bound performance as we eliminate specific forms of in-core parallelism
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
30
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (computational ceilings)
Opterons have dedicated multipliers and adders.
If the code is dominated by adds, then attainable performance is half of peak.
We call these Ceilings They act like constraints on
performance
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
31
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(computational ceilings)
Opterons have 128-bit datapaths.
If instructions aren’t SIMDized, attainable performance will be halved
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
w/out SIMD
32
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (computational ceilings)
On Opterons, floating-point instructions have a 4 cycle latency.
If we don’t express 4-way ILP, performance will drop by as much as 4x
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidth
mul / add imbalance
33
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (communication ceilings)
We can perform a similar exercise taking away parallelism from the memory subsystem
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (communication ceilings)
Explicit software prefetch instructions are required to achieve peak bandwidth
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
35
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model(communication ceilings)
Opterons are NUMA As such memory traffic
must be correctly balanced among the two sockets to achieve good Stream bandwidth.
We could continue this by examining strided or random memory access patterns
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
36
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (computation + communication)
We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
peak DP
mul / add imbalance
w/out ILP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (locality walls)
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FLOPs
Compulsory MissesAI =
38
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (locality walls)
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
FLOPs
Allocations + Compulsory MissesAI =
39
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (locality walls)
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
FLOPs
Capacity + Allocations + CompulsoryAI =
40
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Constructing a Roofline Model (locality walls)
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
+co
nflict m
iss traffic
FLOPs
Conflict + Capacity + Allocations + CompulsoryAI =
41
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism
Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations.
0.5
1.0
1/8
actual FLOP:Byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
42
BERKELEY PAR LAB
Auto-tuningLattice-Boltzmann
Magnetohydrodynamics (LBMHD)
Auto-tuningLattice-Boltzmann
Magnetohydrodynamics (LBMHD)Chapter 6
Auto-tuningLattice-Boltzmann
Magnetohydrodynamics (LBMHD)
Overview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
43
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Introduction to Lattice Methods
Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions No temporal locality between points in space within one time step Higher dimensional phase space
Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to
reconstruct macroscopic quantities Significant Memory capacity requirements
14
12
4
16
13
5
9
8
212
0
25
3
1
24
22
23
2618
15
6
19
17
7
11
10
20
+Z
+Y
+X
44
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(general characteristics)
Plasma turbulence simulation Two distributions:
momentum distribution (27 scalar components) magnetic distribution (15 vector components)
Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)
Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 FLOPs/byte (ideal)
momentum distribution
14
4
13
16
5
8
9
21
12
+Y
2
25
1
3
24
23
22
26
0
18
6
17
19
7
10
11
20
15
+Z
+X
magnetic distribution
14
13
16
21
12
25
24
23
22
26
18
17
19
20
15
+Y
+Z
+X
macroscopic variables
+Y
+Z
+X
45
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(implementation details)
Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees
spatial locality, unit-stride, and vectorizes well
Parallelization Fortran version used MPI to communicate between nodes.
= bad match for multicore The version in this work uses pthreads for multicore
(this thesis is not about innovation in the threading model or programming language) MPI is not used when auto-tuning
Two problem sizes: 643 (~330MB) 1283 (~2.5GB)
46
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SOA Memory Access Pattern
Consider a simple D2Q9 lattice method using SOA There are 9 read arrays, and 9 write arrays,
but all accesses are unit stride LBMHD has 73 read and 79 write streams per thread
(+1,0)
(0,+1)
(0,-1)
(-1,0) (0,0)
(+1,+1)
(+1,-1)(-1,-1)
(-1,+1)
x dimension
write_array[ ][ ]
read_array[ ][ ]
?
47
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
LBMHD has a AI of 0.7 on write allocate architectures, and 1.0 on those with cache bypass or no write allocate.
MUL / ADD imbalance Some architectures will be
bandwidth-bound, while others compute bound.
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:Byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
48
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Standard cache-based implementation can be easily parallelized with pthreads.
NUMA is implicitly exploited Although scalability looks
good, is performance ?
Per
form
ance
Concurrency &Problem Size
49
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Standard cache-based implementation can be easily parallelized with pthreads.
NUMA is implicitly exploited Although scalability looks
good, is performance ?
50
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(reference implementation)
Superscalar performance is surprisingly good given the complexity of the memory access pattern.
Cell PPE performance is abysmal.
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:Byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
51
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(array padding)
LBMHD touches >150 arrays.
Most caches have limited associativity
Conflict misses are likely Apply heuristic to pad
arrays
52
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(vectorization)
LBMHD touches > 150 arrays.
Most TLBs have << 128 entries.
Vectorization technique creates a vector of points that are being updated.
Loops are interchanged and strip mined
Exhaustively search for the optimal “vector length” that balances page locality with L1 cache misses.
53
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(other optimizations)
Heuristic-based software prefetching
Exhaustive search for low-level optimizations
Loop unrolling/reordering SIMDization Cache Bypass increases
arithmetic intensity by 50% Small TLB pages on VF
54
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(Cell SPE implementation)
We can write a local store implementation and run it on the Cell SPEs.
Ultimately, Cell’s weak DP hampers performance
55
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(speedup for largest problem)
We can write a local store implementation and run it on the Cell SPEs.
Ultimately, Cell’s weak DP hampers performance1.6x1.6x 4x4x
3x3x 130x130x
56
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(vs. Roofline)
Most architectures reach their roofline bound performance
Clovertown’s snoop filter is ineffective.
Niagara suffers from instruction mix issues
Cell PPEs are latency limited Cell SPEs are compute-bound
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out FMA
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:Byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out SIMD
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
57
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD Performance(Summary)
Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs
despite only 2x the area, and 2x the peak DP FLOPs
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
58
BERKELEY PAR LAB
Auto-tuningSparse Matrix-Vector Multiplication
Auto-tuningSparse Matrix-Vector Multiplication
Chapter 8
Auto-tuningSparse Matrix-Vector Multiplication
Overview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
59
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Sparse Matrix-Vector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate: y = Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance
A x y
60
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Dataset (Matrices)
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /Harbor
QCDFEM /Ship
Economics Epidemiology
FEM /Accelerator
Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
61
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Models for SMPs
Reference SpMV implementation has an AI of 0.166, but can readily exploit FMA.
The best we can hope for is an AI of 0.25 for non-symmetric matrices.
All architectures are memory-bound, but some may need in-core optimizations
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
62
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Reference implementation is CSR.
Simple parallelization by rows balancing nonzeros.
No implicit NUMA exploitation
Despite superscalar’s use of 8 cores, they see little speedup.
Niagara and PPEs show near linear speedups
Per
form
ance
Matrix
63
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Reference implementation is CSR.
Simple parallelization by rows balancing nonzeros.
No implicit NUMA exploitation
Despite superscalar’s use of 8 cores, they see little speedup.
Niagara and PPEs show near linear speedups
64
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Reference Implementation)
Roofline for dense matrix in sparse format.
Superscalars achieve bandwidth-limited performance
Niagara comes very close to bandwidth limit.
Clearly, NUMA and prefetching will be essential.
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
65
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(NUMA and Software Prefetching)
NUMA-aware allocation is essential on memory-bound NUMA SMPs.
Explicit software prefetching can boost bandwidth and change cache replacement policies
Cell PPEs are likely latency-limited.
used exhaustive search
66
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Matrix Compression)
After maximizing memory bandwidth, the only hope is to minimize memory traffic.
exploit: register blocking other formats smaller indices
Use a traffic minimization heuristic rather than search
Benefit is clearly
matrix-dependent. Register blocking enables
efficient software prefetching (one per cache line)
67
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Cache and TLB Blocking)
Based on limited architectural knowledge, create heuristic to choose a good cache and TLB block size
Hierarchically store the resultant blocked matrix.
Benefit can be significant on the most challenging matrix
68
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(Cell SPE implementation)
Cache blocking can be easily transformed into local store blocking.
With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell BCOO only 2x1 and larger always blocks
69
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(median speedup)
Cache blocking can be easily transformed into local store blocking.
With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell BCOO only 2x1 and larger always blocks
2.8x2.8x 2.6x2.6x
1.4x1.4x 15x15x
70
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(vs. Roofline)
Roofline for dense matrix in sparse format.
Compression improves AI Auto-tuning can allow us to slightly
exceed Stream bandwidth
(but not pin bandwidth) Cell PPEs perennially deliver poor
performance
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(PPEs)
w/out ILP
w/out FMA
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Victoria Falls)
25% FP
12% FP
6% FP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Xeon E5345(Clovertown)
mul / add imbalance
w/out SIMD
w/out ILP
Bandw
idth
on sm
all d
atas
ets
peak DP
Bandw
idth
on la
rge
data
sets
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
Opteron 2356(Barcelona)
mul / add imbalance
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidt
h
w/out
SW
pre
fetch
w/out
NUM
A
0.5
1.0
1/8
actual FLOP:byte ratio
att
ain
ab
le G
FL
OP
/s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
QS20 Cell Blade(SPEs)
w/out ILP
w/out FMA
w/out SIMD
peak DP
Stream
Ban
dwidt
h
misa
ligne
d DM
A
w/out
NUM
A
71
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(summary)
Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve
performance Cell still requires a non-portable, ISA-specific implementation to
achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code
to achieve better performance.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
72
BERKELEY PAR LAB
SummarySummarySummaryOverview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
73
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary
Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually
Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.
Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.
(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,
but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.
Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-
based Cell
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
74
BERKELEY PAR LAB
Future Directionsin Auto-tuning
Future Directionsin Auto-tuning
Chapter 9
Future Directionsin Auto-tuning
Overview
Multicore SMPs
The Roofline Model
Auto-tuning LBMHD
Auto-tuning SpMV
Summary
Future Work
75
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Roofline)
Automatic generation of Roofline figures Kernel oblivious Select computational metric of interest Select communication channel of interest Designate common “optimizations” Requires a benchmark
Using performance counters to generate runtime Roofline figures Given a real kernel, we wish to understand the bottlenecks to
performance. Much friendlier visualization of performance counter data
76
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Making search tractable)
Parameter Space for Optimization A
Par
amet
er S
pace
for
O
ptim
izat
ion
B
Parameter Space for Optimization A
Par
amet
er S
pace
for
O
ptim
izat
ion
B Given the explosion in optimizations, exhaustive search is clearly
not tractable. Moreover, heuristics require extensive architectural knowledge. In our SC08 work, we tried a greedy approach (one optimization at a
time) We could make it iterative or, we could make it look like steepest
descent (with some local search)
77
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Future Work(Auto-tuning Motifs)
We could certainly auto-tune other individual kernels in any motif, but this requires building a kernel-specific auto-tuner
However, we should strive for motif-wide auto-tuning. Moreover, we want to decouple data type (e.g. double precision)
from the parallelization structure.
1. A motif description or pattern language for each motif. e.g. taxonomy of structured grids + code snippet for stencil write auto-tuner parses these, and produces optimized code.
2. A series of DAG rewrite rules for each motif.Rules allow:
Insertion of additional nodes
Duplication of nodes
Reordering
78
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Rewrite Rules
x
yA0
0
0
0
0
0
0
0
x
yA0
0
0
0
0
0
0
0
Consider SpMV In FP, each node in the DAG is a MAC DAG makes locality explicit (e.g. local store blocking) BCSR adds zeros to the DAG We can cut edges and reconnect them to enable parallelization. We can reorder operations Any other data type/node type conforming to these rules can reuse all our auto-tuning efforts
x
yA0
0
0
0
0
0
0
0
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
79
BERKELEY PAR LAB
AcknowledgmentsAcknowledgmentsAcknowledgments
80
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgments
Berkeley ParLab Thesis Committee: David Patterson, Kathy Yelick, Sara McMains BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc,
Rajesh Nishtala, etc… Rest of ParLab
Lawrence Berkeley National Laboratory FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …
Hardware Donations and Remote Access Sun Microsystems IBM AMD FZ Julich Intel
This research was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231, by Microsoft and Intel funding through award #20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
81
BERKELEY PAR LAB
Questions?Questions?Questions?