S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl...
Transcript of S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl...
Thinking Outside of the Tera-Scale BoxThinking Outside of the Tera-Scale Box
Piotr Luszczek
Brief History of Tera-flop: 1997
1997
ASCI Red
Brief History of Tera-flop: 2007
2007
Intel Polaris
1997
ASCI Red
Brief History of Tera-flop: GPGPU
GPGPU
20082007
Intel Polaris
1997
ASCI Red
Tera-flop: a Dictionary Perspective• Belongs to • Belongs tog
supercomputer expert• Consumes 500 kW
ggaming geek
• Consumes under 500 W• Costs over $1M• Requires distributed
• Costs under $200• Requires only a CUDA q
memory programmingq y
kernel
• Memory: 1 GiB or more • Memory: 1 GiB or more
Double-Dip Recession?Cilk, Intel TBB,
Productivity
Cilk, Intel TBB, Intel Ct, Apple
GCD, MS Concrt, …
Hybrid MulticoreMulticore
HMPP, PGI Accelerator, PathScale?
20072005
Time
Locality Space in HPC ChallengeHPLM t M t M l
Parallel-TransposeSTREAM Mat-Mat-MulSTREAM
FFTRandomAccess
Locality Space in HPC Challenge: BoundsHPLM t M t M l
Parallel-TransposeSTREAM Mat-Mat-MulSTREAM
Bandwidth-bound
AORSA2D
Compute-bound
PCIe
FFTRandomAccessLatency-bound
CUDA vs. OpenCL: Similar Software Stacks
Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N
for k = 1:TC(i, j) += A(i, k) * B(k, j)
CUBLAS Volkov et al. MAGMA Nakasato
Fermi C2050: from CUDA to OpenCL700
500
600
/s]
400
500
ce [G
flop/
200
300
erfo
rman
c
CUDA, textureCUDA, NO textureOpenCL, no changeOpenCL good unrolling
100
200
Pe
OpenCL, good unrollingOpenCL, bad unrolling
0
Problem size
Fermi C2050: Simple OpenCL Porting600
500
400
/s
200
300
Gflo
p/ Native OpenCL Mat-mul
Ported OpenCL Mat-mul
100
200
0
192
384
576
768
960
152
344
536
728
920
112
304
496
688
880
072
264
456
648
840
032
224
416
608
800
992
184
376
568
760
952
1 3 5 7 9 11 13 15 17 19 21 23 24 26 28 30 32 34 36 38 40 42 44 46 48 49 51 53 55 57 59
Problem size
ATI 5870: Simple OpenCL Porting1600
1200
1400
1000
p/s
Native OpenCL Mat-mul
Ported OpenCL Mat-mul (inlined indexing)
600
800
Gflo
p
Ported OpenCL
200
400
0
192
384
576
768
960
1152
1344
1536
1728
1920
2112
2304
2496
2688
2880
3072
3264
3456
3648
3840
4032
4224
4416
4608
4800
4992
5184
5376
5568
5760
5952
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5
Problem size
Making Case for Autotuning: Use of Texture600
500
400
p/s
200
300
Gflo
p
Fermi C2050 OpenCL texture
Fermi C2050 OpenCL NO texture
100
0
Problem size
Making Case for Autotuning: Blocking Factors
Provided by NVIDIANVIDIA
Mat-Mul Portability Summary
Achieved PerformanceSingle precision
Achieved PerformanceDouble precisiong p p
ATI OpenCL 52% 57%ATI IL 74% 87%NVIDIA CUDA 63% 58%NVIDIA CUDA 63% 58%NVIDIA OpenCL 57% 49%Multicore (vendor) 79% 79%Multicore (autotuned) 46% 40%
Now let’s think outside of the boxNow, let s think outside of the box
Recipe for Performance Before MulticoreL d/StLoad/Store
I t h d liInstr. scheduling
Wider superscalar instr. issue
Increase clock frequency
Increase number of transistorIncrease number of transistor
Recipe for Performance After Multicore
More cores
Parallel software
Recipe for Future Performance
More cores
Better sequential software
DAG Runtime
From Assembly to DAG: PLASMA• Typical loop in assembly • LU factorizationyp p y
– load R0, #1000– divf F1, F2, F3
– for i = 1:N– GETR2(A[i, i])
Sequential Sequential instruction
Sequential Sequential task
– load R1, #1000– divf F2, F4, F5
– for j = i+1:N
– TRSM(A[i, i], A[i, j])
stream stream
– dec R1– jump …
– for k = i+1:N
– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling
GETRF2
TRSM GEMM GEMM
GETRF2
DAG schedulingTRSM GEMM GEMM
DAG scheduling
Scalability of LU Factorization on Multicore40
45
100
1206 cores = 1 socket
20
25
30
35
60
80
100
24 cores = 4 sockets
0
5
10
15
0
20
40
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
140
160
180
PLASMA48 cores = 8 sockets
60
80
100
120
140
MKL 11.0
0
20
40
60
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
LAPACK
AMD Istanbul 2 8 GHz1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 AMD Istanbul 2.8 GHz
Combining Multicore and GPU for QR Fact.0.6
0.5
AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480
0.4
0.3
Tflo
p/s
0.2 CPUs + GPU
CPUs
0
0.1 GPU
03840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200
Performance on Distributed Memory30
35
30
35
15
20
25
Tflo
p/s
15
20
25
30
Tflo
p/s
QR LU
0
5
10
108 192 432 768 1200 30720
5
10
108 192 432 768 1200 3072Number of cores
108 192 432 768 1200 3072
Number of cores
30
35
DPLASMA
Peak
15
20
25
Tflo
p/s
Cholesky
Cray LibSci
DPLASMA
0
5
10
108 192 432 768 1200 3072
ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-core
Number of cores
Dual socet 6 coreInterconnect: Seastar, 3D torus
People and Projects• Jack Dongarra • MAGMAg• Mathieu Faverge• Jakub Kurzak
• PLASMA• DPLASMA and DAGuEJakub Kurzak
• Hatem Ltaief• Stan Tomov
DPLASMA and DAGuE– George Bosilca– Aurelien BouteillerStan Tomov
• Asim YarKhan• Peng Du
– Anthony Danalis– Thomas Herault
• Peng Du• Rick Weber
– Perre Lemarinier