LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication...

37
LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 1

Transcript of LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication...

Page 1: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

LLVM-based Communication Optimizationsfor PGAS Programs Akihiro Hayashi (Rice University)

Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University)

2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15

1

Page 2: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

A Big Picture

2Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

©RIKENAICS

©Berkeley Lab.

X10, Habanero-UPC++,…

Communication

Optimization

Page 3: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

PGAS Languages q High-productivity features:

§ Global-View §  Task parallelism §  Data Distribution §  Synchronization

3

X10Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/

CAFHabanero-UPC++

Page 4: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Communication is implicit in some PGAS Programming Models q Global Address Space

§ Compiler and Runtime is responsible for performing communications across nodes

4

1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS 4: }

RemoteDataAccessinChapel

Page 5: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Communication is Implicit in some PGAS Programming Models (Cont’d)

5

1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS

RemoteDataAccess

if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…);}

Run>meaffinityhandling

1: var x = 1; 2: on Locales[1] { 3: … = 1;

CompilerOp>miza>on

OR!

Page 6: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Communication Optimizationis Important

6Asynthe>cChapelprogram

onIntelXeonCPUX5660ClusterswithQDRInifiniband

1!

10!

100!

1000!

10000!

100000!

1000000!

10000000!

Latency(m

s)

TransferredByte

Optimized (Bulk Transfer) Unoptimized

1,500x !

59x !

Lower is better

Page 7: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

PGAS Optimizations are language-specific

7Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

X10, Habanero-UPC++,…

© Argonne National Lab.

©Berkeley Lab.

©RIKENAICS

Chapel Compiler

UPC Compiler

X10 Compiler Habanero-C Compiler

Page 8: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Our goal

8Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

©RIKENAICS

©Berkeley Lab.

X10, Habanero-UPC++,…

Page 9: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Why LLVM? q Widely used language-agnostic compiler

9

LLVM Intermediate Representation (LLVM IR)

C/C++Frontend

Clang

C/C++, Fortran, Ada, Objective-CFrontend

dragonegg Chapel

Frontend

x86 backend

Power PCbackend

ARMbackend

PTXbackend

Analysis & Optimizations

UPC++Frontend

x86 Binary PPC Binary ARM Binary GPU Binary

Page 10: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Summary & Contributions

q Our Observations : §  Many PGAS languages share semantically similar constructs §  PGAS Optimizations are language-specific

q Contributions: §  Built a compilation framework that can uniformly optimize

PGAS programs (Initial Focus : Communication) ü Enabling existing LLVM passes for communication optimizations ü PGAS-aware communication optimizations

10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,

Page 11: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Overview of our framework

11

Chapel Programs

UPC++ Programs

X10 Programs

Chapel-LLVM

frontend UPC++-

LLVM frontend

X10-LLVM frontend

LLVM IR LLVM-based

Communication Optimization

passes

Lowering Pass

CAF Programs

CAF-LLVM frontend

1. Vanilla LLVM IR 2. use address space feature to express communications

Need to be implemented when supporting a new language/runtime

Generally language-agnostic

Page 12: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

How optimizations work

12

store i64 1, i64 addrspace(100)* %x, …

// x is possibly remotex = 1;

Chapelshared_var<int> x;x = 1;

UPC++

1.Existing LLVM Optimizations

2.PGAS-aware Optimizations

treat remote access as if it

were local access Runtime-Specific Lowering"

Address space-aware Optimizations

Communication API Calls!

Page 13: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

LLVM-based Communication Optimizations for Chapel 1.  Enabling Existing LLVM passes

§  Loop invariant code motion (LICM) §  Scalar replacement, …

2.  Aggregation §  Combine sequences of loads/stores on

adjacent memory location into a single memcpy

13These are already implemented in the standard Chapel compiler

Page 14: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

An optimization example:LICM for Communication Optimizations

14LICM = Loop Invariant Code Motion

for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x;}

LICM by LLVM

Page 15: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

An optimization example:Aggregation

15

// p is possibly remotesum = p.x + p.y;

llvm.memcpy(…);

load i64 addrspace(100)* %pptr+0load i64 addrspace(100)* %pptr+4

x! y!

GET! GET!

GET!

Page 16: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

LLVM-based Communication Optimizations for Chapel 3.  Locality Optimization

§  Infer the locality of data and convert possibly-remote access to definitely-local access at compile-time if possible

4.  Coalescing §  Remote array access vectorization

16These are implemented, but not in the standard Chapel compiler

Page 17: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

An Optimization example:Locality Optimization

17

1: proc habanero(ref x, ref y, ref z) {2: var p: int = 0;3: var A:[1..N] int;4: local { p = z; }5: z = A(0) + z; 6:}

2.pandzaredefinitelylocal

3.Definitely-localaccess!(avoidrun@meaffinitychecking)

1.Aisdefinitely-local

Page 18: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

An Optimization example:Coalescing

18

1:for i in 1..N {2: … = A(i);3:}

1:localA = A;2:for i in 1..N {3: … = localA(i);4:}

Performbulktransfer

Convertedtodefinitely-local

access

BeforeAUer

Page 19: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Performance Evaluations:Benchmarks

19

Application Size

Smith-Waterman 185,600 x 192,000

Cholesky Decomp 10,000 x 10,000

NPB EP CLASS = D

Sobel 48,000 x 48,000

SSCA2 Kernel 4 SCALE = 16

Stream EP 2^30

Page 20: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Performance Evaluations:Platforms q Cray XC30™ Supercomputer @ NERSC

§  Node ü  Intel Xeon E5-2695 @ 2.40GHz x 24 cores ü 64GB of RAM

§  Interconnect ü Cray Aries interconnect with Dragonfly topology

q Westmere Cluster @ Rice §  Node

ü  Intel Xeon CPU X5660 @ 2.80GHz x 12 cores ü 48 GB of RAM

§  Interconnect ü Quad-data rated infiniband

20

Page 21: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Performance Evaluations:Details of Compiler & Runtime q Compiler

§  Chapel Compiler version 1.9.0 §  LLVM 3.3

q Runtime : §  GASNet-1.22.0

ü Cray XC : aries ü Westmere Cluster : ibv-conduit

§  Qthreads-1.10 ü Cray XC: 2 shepherds, 24 workers / shepherd ü Westmere Cluster : 2 shepherds, 6 workers / shepherd

21

Page 22: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

BRIEF SUMMARY OF PERFORMANCE EVALUATIONS

Performance Evaluation

22

Page 23: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

0.0

1.0

2.0

3.0

4.0

5.0

SW Cholesky Sobel StreamEP EP SSCA2

Perf

orm

ance

Impr

ovem

ent

over

LLV

M-u

nopt Coalescing Locality Opt

Aggregation Existing

Results on the Cray XC (LLVM-unopt vs. LLVM-allopt)

23

ü  4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

2.1x

19.5x

1.1x

2.4x 1.4x 1.3x

Higher is better

Page 24: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

0.0

1.0

2.0

3.0

4.0

5.0

SW Cholesky Sobel StreamEP EP SSCA2

Perf

orm

ance

Impr

ovem

ent

over

LLV

M-u

nopt

Coalescing Locality Opt Aggregation Existing

Results on Westmere Cluster(LLVM-unopt vs. LLVM-allopt)

24

2.3x

16.9x

1.1x

2.5x

1.3x 2.3x

ü  4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

Page 25: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION

Performance Evaluation

25

Page 26: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Cholesky Decomposition

26

dependencies0001122233

001122233

01122233

1122233

122233

22233

2233

233

33 3

Node0

Node1

Node2

Node3

Page 27: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Metrics 1.  Performance & Scalability

§  Baseline (LLVM-unopt) §  LLVM-based Optimizations (LLVM-allopt)

2.  The dynamic number of communication API calls

3.  Analysis of optimized code 4.  Performance comparison

§  Conventional C-backend vs. LLVM-backend 27

Page 28: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Performance Improvement by LLVM (Cholesky on the Cray XC)

28

1

0.1 0.1 0.2 0.2 0.3

2.6 2.7

3.7 4.1 4.3 4.5

0

1

2

3

4

5

1 locale! 2 locales! 4 locales! 8 locales! 16 locales! 32 locales!

Spee

dup

over

LLV

M-u

nopt

1l

ocal

e

LLVM-unopt LLVM-allopt

ü  LLVM-based communication optimizations show scalability

Page 29: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Communication API calls elimination by LLVM (Cholesky on the Cray XC)

29

100.0% 100.0% 100.0% 100.0%

12.1% 0.2%

89.2% 100.0%

0

0.2

0.4

0.6

0.8

1

1.2

LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT

Dyn

amic

num

ber o

f co

mm

unic

atio

n AP

I cal

ls

(nor

mal

ized

to L

LVM

-uno

pt)

LLVM-unopt LLVM-allopt

8.3x improvement

500x improvement

1.1x improvement

Page 30: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Analysis of optimized code

30

for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT }}}

1.ALLOCATE LOCAL BUFFER 2.PERFORM BULK TRANSFER for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT }}}

LLVM-unopt LLVM-allopt

Page 31: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Performance comparison with C-backend

31

2.8

0.1 0.1 0.1 0.3 0.7

0.4 1

0.1 0.1 0.2 0.2 0.3 0.4

2.6 2.7

3.7 4.1 4.3 4.5 4.5

0

1

2

3

4

5

1locale 2locales 4locales 8locales 16locales 32locales 64locales

Spee

dup

over

LLV

M-u

nopt

1l

ocal

e

C-backend LLVM-unopt LLVM-allopt

C-backend is faster!

Page 32: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Current limitation

q  In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces

32

Locale(16bit)

addr(48bit)

For LLVM Code Generation : 64bit packed pointer

For C Code Generation : 128bit struct pointer

ptr.locale;ptr.addr;

ptr >> 48 ptr | 48BITS_MASK;

1. Needs more instructions 2. Lose opportunities for Alias analysis

Page 33: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Conclusions q LLVM-based Communication optimizations for

PGAS Programs §  Promising way to optimize PGAS programs in a

language-agnostic manner §  Preliminary Evaluation with 6 Chapel applications

ü Cray XC30 Supercomputer –  4.6x average performance improvement

ü Westmere Cluster –  4.4x average performance improvement

33

Page 34: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Future work q Extend LLVM IR to support parallel programs with

PGAS and explicit task parallelism §  Higher-level IR

34

Parallel Programs

(Chapel, X10, CAF, HC, …)

1.RI-PIR Gen 2.Analysis

3.Transformation

1.RS-PIR Gen 2.Analysis

3.Transformation LLVM

Runtime-Independent Optimizations

e.g. Task Parallel Construct

LLVM Runtime-Specific

Optimizations e.g. GASNet API

Binary

Page 35: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Acknowledgements q Special thanks to

§  Brad Chamberlain (Cray) §  Rafael Larrosa Jimenez (UMA) §  Rafael Asenjo Plaza (UMA) § Habanero Group at Rice

35

Page 36: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Backup slides

36

Page 37: LLVM-based Communication Optimizations for PGAS Programs · 2015-11-30 · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao

Compilation Flow

37

Chapel Programs

AST Generation and Optimizations

C-code Generation

LLVM Optimizations Backend Compiler’s Optimizations (e.g. gcc –O3)

LLVM IR C Programs

LLVM IR Generation

Binary Binary