Polly-ACC: Transparent Compilation to...

26
spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1

Transcript of Polly-ACC: Transparent Compilation to...

Page 1: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Polly-ACC: Transparent Compilation to Heterogeneous Hardware

Torsten Hoefler (with Tobias Grosser)

1

Page 2: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

2

Evading various “ends” – the hardware view

Page 3: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

3

row = 0;

output_image_ptr = output_image;

output_image_ptr += (NN * dead_rows);

for (r = 0; r < NN - KK + 1; r++) {

output_image_offset = output_image_ptr;

output_image_offset += dead_cols;

col = 0;

for (c = 0; c < NN - KK + 1; c++) {

input_image_ptr= input_image;

input_image_ptr+= (NN * row);

kernel_ptr = kernel;

S0: *output_image_offset = 0;

for (i = 0; i < KK; i++) {

input_image_offset = input_image_ptr;

input_image_offset += col;

kernel_offset = kernel_ptr;

for (j = 0; j < KK; j++) {

S1: temp1 = *input_image_offset++;

S1: temp2 = *kernel_offset++;

S1: *output_image_offset += temp1 * temp2;

}

kernel_ptr += KK;

input_image_ptr+= NN;

}

S2: *output_image_offset = ((*output_image_offset)/

normal_factor);

output_image_offset++ ;

col++;

}

output_image_ptr += NN;

row++;

}

}

Fortran

C/C++CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Multi-Core CPU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

GP

UG

PU

Accelerator

Sequential Software Parallel Hardware

Page 4: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Non-Goal:

Algorithmic Changes

4

Design Goals

Automatic

“Regression Free” High Performance

Automatic accelerator mapping

-

How close can we get?

Page 5: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Tool: Polyhedral Modeling

Iteration Space

0 1 2 3 4 5

j

i

5

4

3

2

1

0

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }

(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)

Program Code

(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)

for (i = 0; i <= N; i++)

for (j = 0; j <= i; j++)

S(i,j);

4

Polly -- Performing Polyhedral

Optimizations on a Low-Level

Intermediate Representation

Tobias Grosser et al,

Parallel Processing Letter, 2012

Page 6: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

6

Mapping Computation to Device

0

1

2

0

1

2

0 1 2 3 0 1 2 3

0

0 1

1

Device Blocks & ThreadsIteration Space

𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖

4% 2,

𝑗

3% 2 }

0

1

10

i

j

𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }

Page 7: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

7

Memory Hierarchy of a Heterogeneous System

Page 8: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

8

Host-device date transfers

Page 9: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

9

Host-device date transfers

Page 10: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

10

Mapping onto fast memory

Page 11: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

11

Mapping onto fast memory

Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM

Transactions on Architecture and Code Optimization, 2013

Page 12: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Profitability Heuristic

Trivial

Unsuitable

Insufficient Compute

static dynamic

Modeling Execution

All

Loop

Nests

GPU

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 13: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

13

From kernels to program – data transfers

void heat(int n, float A[n], float hot, float cold) {

float B[n] = {0};

initialize(n, A, cold);

setCenter(n, A, hot, n/4);

for (int t = 0; t < T; t++) {

average(n, A, B);

average(n, B, A);

printf("Iteration %d done", t);

}

}

Page 14: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

14

Data Transfer – Per Kernel

Host Memory

initialize()

setCenter()

average()

average()

average()

D → 𝐻

D → 𝐻

𝐻 → 𝐷 𝐷 → 𝐻

tim

e

𝐻 → 𝐷 𝐷 → 𝐻

𝐻 → 𝐷 𝐷 → 𝐻

Device Memory

void heat(int n, float A[n], ...) {

initialize(n, A, cold);

setCenter(n, A, hot, n/4);

for (int t = 0; t < T; t++) {

average(n, A, B);

average(n, B, A);

printf("Iteration %d done", t);

} }

Page 15: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

15

Data Transfer – Inter Kernel Caching

Host Memory

𝐷 → 𝐻

Host Memory

initialize()

setCenter()

average()

average()

average()

tim

e

𝐻 → 𝐷

Device Memory

void heat(int n, float A[n], ...) {

initialize(n, A, cold);

setCenter(n, A, hot, n/4);

for (int t = 0; t < T; t++) {

average(n, A, B);

average(n, B, A);

printf("Iteration %d done", t);

} }

Page 16: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

16

EvaluationEvaluation

Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)

Mobile: 4 core Haswell NVIDIA GT730M (Kepler)

Page 17: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

1

10

100

1000

10000

SCoPs 0-dim 1-dim 2-dim 3-dim

No Heuristics Heuristics

17

LLVM Nightly Test Suite

# C

om

pute

Regio

ns /

Kern

els

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 18: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

18

Some results: Polybench 3.2

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

arithmean: ~30x

geomean: ~6x

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

Page 19: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

0:00

1:12

2:24

3:36

4:48

6:00

7:12

8:24

Mobile Workstation

icc icc -openmp clang Polly ACC

19

Compiles all of SPEC CPU 2006 – Example:

LBM

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Ru

ntim

e (

m:s

)

Xeon E5-2690 (10 cores, 0.5Tflop) vs.

Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

essentially my 4-core x86 laptop

with the (free) GPU that’s in there

~20%

~4x

Page 20: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

20

Cactus ADM (SPEC 2006)

Work

sta

tion

Mobile

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 21: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

21

Cactus ADM (SPEC 2006) - Data Transfer

Work

sta

tion

Mobile

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 22: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Polly-ACC

22

Automatic

“Regression Free” High Performance

http://spcl.inf.ethz.ch/Polly-ACC

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Page 23: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Unfortunately not …

Limited to affine code regions

Maybe generalizes to control-restricted programs

No distributed anything!!

Good news:

Much of traditional HPC fits that model

Infrastructure is coming along

Bad news:

Modern data-driven HPC and Big Data fits less well

Need a programming model for distributed heterogeneous machines!

23

Brave new compiler world!?

Page 24: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

How do we program GPUs today?

ld

ld

ld

ld

st

st

st

st

device compute core active thread instruction latency

CUDA• over-subscribe hardware• use spare parallel slack for latency

hiding

MPI• host controlled• full device

synchronization

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 25: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Latency hiding at the cluster level?

ld

ld

ld

ld

device compute core active thread instruction latency

dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding

st

put

st

put

ld

ld

ld

ld

st

st

st

st

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 26: Polly-ACC: Transparent Compilation to …spcl.inf.ethz.ch/.../.pdf/hoefler-grosser-polly-acc-sc16.pdfWorkstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell

spcl.inf.ethz.ch

@spcl_eth

Tobias Gysi, Jeremiah Baer, TH: “dCUDA: Hardware Supported

Overlap of Computation and Communication”

Wednesday, Nov. 16th

4:00-4:30pm

Room 355-D

26

Talk on Wednesday