S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl...

Thinking Outside of the Tera-Scale BoxThinking Outside of the Tera-Scale Box

Piotr Luszczek

Brief History of Tera-flop: 1997

1997

ASCI Red

Brief History of Tera-flop: 2007

2007

Intel Polaris

1997

ASCI Red

Brief History of Tera-flop: GPGPU

GPGPU

20082007

Intel Polaris

1997

ASCI Red

Tera-flop: a Dictionary Perspective• Belongs to • Belongs tog

supercomputer expert• Consumes 500 kW

ggaming geek

• Consumes under 500 W• Costs over $1M• Requires distributed

• Costs under $200• Requires only a CUDA q

memory programmingq y

kernel

• Memory: 1 GiB or more • Memory: 1 GiB or more

Double-Dip Recession?Cilk, Intel TBB,

Productivity

Cilk, Intel TBB, Intel Ct, Apple

GCD, MS Concrt, …

Hybrid MulticoreMulticore

HMPP, PGI Accelerator, PathScale?

20072005

Time

Locality Space in HPC ChallengeHPLM t M t M l

Parallel-TransposeSTREAM Mat-Mat-MulSTREAM

FFTRandomAccess

Locality Space in HPC Challenge: BoundsHPLM t M t M l

Parallel-TransposeSTREAM Mat-Mat-MulSTREAM

Bandwidth-bound

AORSA2D

Compute-bound

PCIe

FFTRandomAccessLatency-bound

CUDA vs. OpenCL: Similar Software Stacks

Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N

for k = 1:TC(i, j) += A(i, k) * B(k, j)

CUBLAS Volkov et al. MAGMA Nakasato

Fermi C2050: from CUDA to OpenCL700

500

600

/s]

400

500

ce [G

flop/

200

300

erfo

rman

c

CUDA, textureCUDA, NO textureOpenCL, no changeOpenCL good unrolling

100

200

Pe

OpenCL, good unrollingOpenCL, bad unrolling

0

Problem size

Fermi C2050: Simple OpenCL Porting600

500

400

/s

200

300

Gflo

p/ Native OpenCL Mat-mul

Ported OpenCL Mat-mul

100

200

0

192

384

576

768

960

152

344

536

728

920

112

304

496

688

880

072

264

456

648

840

032

224

416

608

800

992

184

376

568

760

952

1 3 5 7 9 11 13 15 17 19 21 23 24 26 28 30 32 34 36 38 40 42 44 46 48 49 51 53 55 57 59

Problem size

ATI 5870: Simple OpenCL Porting1600

1200

1400

1000

p/s

Native OpenCL Mat-mul

Ported OpenCL Mat-mul (inlined indexing)

600

800

Gflo

p

Ported OpenCL

200

400

0

192

384

576

768

960

1152

1344

1536

1728

1920

2112

2304

2496

2688

2880

3072

3264

3456

3648

3840

4032

4224

4416

4608

4800

4992

5184

5376

5568

5760

5952

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5

Problem size

Making Case for Autotuning: Use of Texture600

500

400

p/s

200

300

Gflo

p

Fermi C2050 OpenCL texture

Fermi C2050 OpenCL NO texture

100

0

Problem size

Making Case for Autotuning: Blocking Factors

Provided by NVIDIANVIDIA

Mat-Mul Portability Summary

Achieved PerformanceSingle precision

Achieved PerformanceDouble precisiong p p

ATI OpenCL 52% 57%ATI IL 74% 87%NVIDIA CUDA 63% 58%NVIDIA CUDA 63% 58%NVIDIA OpenCL 57% 49%Multicore (vendor) 79% 79%Multicore (autotuned) 46% 40%

Now let’s think outside of the boxNow, let s think outside of the box

Recipe for Performance Before MulticoreL d/StLoad/Store

I t h d liInstr. scheduling

Wider superscalar instr. issue

Increase clock frequency

Increase number of transistorIncrease number of transistor

Recipe for Performance After Multicore

More cores

Parallel software

Recipe for Future Performance

More cores

Better sequential software

DAG Runtime

From Assembly to DAG: PLASMA• Typical loop in assembly • LU factorizationyp p y

– load R0, #1000– divf F1, F2, F3

– for i = 1:N– GETR2(A[i, i])

Sequential Sequential instruction

Sequential Sequential task

– load R1, #1000– divf F2, F4, F5

– for j = i+1:N

– TRSM(A[i, i], A[i, j])

stream stream

– dec R1– jump …

– for k = i+1:N

– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling

GETRF2

TRSM GEMM GEMM

GETRF2

DAG schedulingTRSM GEMM GEMM

DAG scheduling

Scalability of LU Factorization on Multicore40

45

100

1206 cores = 1 socket

20

25

30

35

60

80

100

24 cores = 4 sockets

0

5

10

15

0

20

40

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

140

160

180

PLASMA48 cores = 8 sockets

60

80

100

120

140

MKL 11.0

0

20

40

60

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

LAPACK

AMD Istanbul 2 8 GHz1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 AMD Istanbul 2.8 GHz

Combining Multicore and GPU for QR Fact.0.6

0.5

AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

0.4

0.3

Tflo

p/s

0.2 CPUs + GPU

CPUs

0

0.1 GPU

03840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200

Performance on Distributed Memory30

35

30

35

15

20

25

Tflo

p/s

15

20

25

30

Tflo

p/s

QR LU

0

5

10

108 192 432 768 1200 30720

5

10

108 192 432 768 1200 3072Number of cores

108 192 432 768 1200 3072

Number of cores

30

35

DPLASMA

Peak

15

20

25

Tflo

p/s

Cholesky

Cray LibSci

DPLASMA

0

5

10

108 192 432 768 1200 3072

ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-core

Number of cores

Dual socet 6 coreInterconnect: Seastar, 3D torus

People and Projects• Jack Dongarra • MAGMAg• Mathieu Faverge• Jakub Kurzak

• PLASMA• DPLASMA and DAGuEJakub Kurzak

• Hatem Ltaief• Stan Tomov

DPLASMA and DAGuE– George Bosilca– Aurelien BouteillerStan Tomov

• Asim YarKhan• Peng Du

– Anthony Danalis– Thomas Herault

• Peng Du• Rick Weber

– Perre Lemarinier

S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl...

Documents

Transcript of S1 0950 Luszczek presentation - archive.ll.mit.edu · Locality Space in HPC Challenge HPL MtMtMl...