Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R....

51
Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc [email protected] et.in

Transcript of Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R....

Page 1: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011

(C)RG@SERC,IISc

Challenges and Opportunities in HeterogeneousMulti-core Era

R. GovindarajanHPC Lab,SERC, IISc

[email protected]

Page 2: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 2

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and

Task Parallelism on CPUs• Conclusions

Page 3: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 3

Moore’s Law : Transistors

Page 4: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 4

Moore’s Law : Performance

Processor performance doubles every 1.5 yearsProcessor performance doubles every 1.5 years

Page 5: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 5

Overview

• Introduction• Trends in Processor Architecture

– Instruction Level Parallelism– Multi-Cores– Accelerators

• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and

Task Parallelism on CPUs• Conclusions

Page 6: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 6

Progress in Processor Architecture• More transistors New architecture

innovations– Multiple Instruction Issue processors

• VLIW • Superscalar• EPIC

– More on-chip caches, multiple levels of cache hierarchy, speculative execution, …

Era of Instruction Level Parallelism

Page 7: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 7

Moore’s Law: Processor Architecture Roadmap (Pre-2000)

First P

Super-scalar

EPIC

RISC

VLIW ILP Processors

Page 8: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 8

Multicores : The Right Turn

6 G

Hz

1 C

ore

3 G

Hz

1 C

ore

1 G

Hz

1 C

ore

Performance 3 GHz

16 Core3 GHz 4 Core

3 GHz 2 Core

Page 9: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 9

Era of Multicores (Post 2000)

• Multiple cores in a single die

• Early efforts utilized multiple cores for multiple programs

• Throughput oriented rather than speedup-oriented!

Page 10: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 10

Moore’s Law: Processor Architecture Roadmap (Post-2000)

First P

RISC

VLIW Super-scalar

EPIC Multi-cores

Page 11: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 11

Progress in Processor Architecture

• More transistors New architecture innovations– Multiple Instruction Issue processors– More on-chip caches– Multi cores– Heterogeneous cores and accelerators

Graphics Processing Units (GPUs)

Cell BE, Clearspeed

Larrabee or SCC

Reconfigurable accelerators …

Era of Heterogeneous Accelerators

Page 12: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 12

Moore’s Law: Processor Architecture Roadmap (Post-2000)

First P

RISC

VLIW Super-scalar

EPIC Multi-cores

Accele-rators

Page 13: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 13

Accelerators

Page 14: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 14

Accelerators: Hype or Reality?

Some Top500 Systems (June 2011 List)

Rank System Description # Procs. R_max(TFLOPS)

2 Tianhe Xeon + Nvidia Tesla C2050 GPUs

186368 2,566

4 Nebulae-Dawning

Intel X5650, NVidia Tesla C2050 GPU

55,680 + 64,960

1,271

7 Tsubame Xeon + Nvidia GPU 73278 1,192

10 Roadrunner Opteron + CellBE 6480 +12960

1,105

Page 15: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 15

Accelerators – Cell BE

Page 16: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 16

Accelerator – Fermi S2050

Page 17: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 17

How is the GPU connected?

C0

Memory

C1 C2 C3

IMC

SLC

Page 18: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 18

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism • Data Parallelism on GPUs and

Task Parallelism on CPUs• Conclusions

Page 19: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 19

Handling the Multi-Core Challenge

• Shared and Distributed Memory Programming Languages– OpenMP– MPI

• Other Parallel Languages (partitioned global address space languages)– X10, UPC, Chapel, …

• Emergence of Programming Languages for GPU– CUDA– OpenCL

© RG@SERC,IISc 19

Page 20: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 20

GPU Programming: Good News

• Emergence of Programming Languages for GPU– CUDA– OpenCL – Open Standards

• Growing collection of code base – CUDAzone– Packages supporting GPUs by ISV

• Impressive performance – Yes!

• What about Programmer Productivity?

© RG@SERC,IISc 20

Page 21: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 21

GPU Programming: Boon or Bane • Challenges in GPU programming

– Managing parallelism across SMs and SPMD cores– Transfer of data between CPU and GPU– Managing CPU-GPU memory bandwidth efficiently– Efficient use of different level of memory (GPU

memory, Shared Memory, Constant and Texture Memory, …

– Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.

– Identifying appropriate execution configuration for efficient execution

– Synchronization across multiple SMs

© RG@SERC,IISc 21

Page 22: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 22

The Challenge : Data Parallelism

Cg

CUDA

OpenCL

AMD CAL

Page 23: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 23

Challenge – Level 2

C0

Memory

C1 C2 C3

MC

SLC

Synergistic Exec. on CPU & GPU Cores

Page 24: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 24

Challenge – Level 3

Synergistic Exec. on Multiple Nodes

Memory NIC

N/WSwitch

Memory NIC

MemoryNIC

Page 25: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 25

Data, Thread, & Task Level Parallelism

MPI

CUDA

OpenCLOpenMP

SSE

CAL, PTX, …

Page 26: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 26

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores

– CUDA for Clusters

• Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs

• Conclusions

Page 27: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 27

Synergistic Execution on Multiple Hetergeneous Cores

Our Apprach

Compiler/Runtime System

CellBE

OtherAccel.

Multicores

GPUsSSE

StreamingLang.

MPIOpenMP

CUDA/OpenCL

Parallel Lang.

Array Lang. (Matlab)

Page 28: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

CUDA on Other Platforms?

• CUDA programs are data parallel• Cluster of multi-cores, typically for task parallel

applications• CUDA for multi-core SMPs (M-CUDA)• CUDA on Cluster of multi-nodes having multi-

cores?

Page 29: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 29

CUDA on Clusters: Motivation

Motivation• CUDA, a natural way to write data-parallel

programs• Large Code-base (CUDA Zone)• Emergence of OpenCL • How can we leverage these on (existing)

Clusters without GPUs?

Challenges:• Mismatch in parallelism granularity• Synchronization and context switch can be more

expensive!

Page 30: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

CUDA on Multi-Cores (M-CUDA)

Memory

• Fuses threads in a thread block to a thread• OMP Parallel for executing thread blocks in parallel• Proper synchronization across kernels• Loop Transformation techniques for efficient code• Shared memory and synchronization inherent!

Page 31: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

CUDA-For-Clusters

• CUDA kernels naturally express parallelism– Block level and thread-level

• Multiple levels of parallelism in hardware too!– Across multiple nodes– And across multiple cores within a node

• A runtime system to perform this work partitioning at multiple granularities.

• Little communication between block (in data parallel programs)

• Need some mechanism for – “Global Shared memory” across multiple nodes!– Synchronization across nodes

• A 'lightweight' software distributed shared memory layer to provide necessary abstraction

Page 32: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

CUDA-For-Clusters

• Consistency only at kernel boundaries

• Relaxed consistency model for global data

• Lazy Update• CUDA memory mgmt.

functions are linked to the DSM functions.

• Runtime system to ensure partitioning of tasks across cores

• OpenMP and MPI

CUDA program

Source-to-source transforms

C program with block-level

code specification

MPI + OpenMP work distribution runtime

Software Distributed Shared Memory

Interconnect

P1

P2RAM

P1

P2RAM

P1

P2RAM

B0,B1 B2,B3 B4,B5 B6,B7

N1

N2

N3

N4

B0 B1 B2 B3

B4 B5 B6 B7

B0 B1 B2 B3

B4 B5 B6 B7

P1

P2RAM

Page 33: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

Experimental Setup

• Benchmarks from Parboil and Nvidia CUDA SDK suite

• 8-node Xeon clusters, each node having Dual Socket Quad Core (2 x 4 cores)

• Baseline M-CUDA on single node (8-cores)

Page 34: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc

Experimental Results

With Lazy Update

Page 35: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 36

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism

– Stream Programs on GPU

• Data Parallelism on GPUs and Task Parallelism on CPUs

• Conclusions

Page 36: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 37© RG@SERC,IISc 37

Stream Programming Model

• Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them.

• Exposes Pipelined parallelism and Task-level parallelism

• Synchronous Data Flow (SDF), Stream Flow Graph, StreamIT, Brook, …

• Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules

• Mapping applications to Accelerators such as GPUs and Cell BE.

Page 37: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 38© RG@SERC,IISc 38

StreamIT Example Program

Dup. Splitter

Bandpass Filter+

Amplifier

Combiner

Signal Source

Bandpass Filter+

Amplifier

2 – Band Equalizer

Page 38: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 39© RG@SERC,IISc 39

Challenges on GPUs

• Work distribution between the multiprocessors– GPUs have hundreds of processors (SMs and SIMD units)!

• Exploiting task-level and data-level parallelism– Scheduling across the multiprocessors– Multiple concurrent threads in SM to exploit DLP

• Determining the execution configuration (number of threads for each filter) that minimizes execution time.

• Lack of synchronization mechanisms between the multiprocessors of the GPU.

• Managing CPU-GPU memory bandwidth efficiently

Page 39: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 40© RG@SERC,IISc 40

Stream Graph Execution

Stream Graph Software Pipelined Execution

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

0123

4567

Pipeline Parallelism

Task Parallelism

Data Parallelism

Page 40: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 41© RG@SERC,IISc 41

• Multithreading– Identify good execution configuration to exploit the

right amount of data parallelism

• Memory– Efficient buffer layout scheme to ensure all

accesses to GPU memory are coalesced.

• Task Partition between GPU and CPU cores • Work scheduling and processor (SM)

assignment problem.– Takes into account communication bandwidth

restrictions

Our Approach

Page 41: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 42© RG@SERC,IISc 42

Compiler Framework

Execute ProfileRuns

Generate Code for Profiling

ConfigurationSelection

StreamItProgram

TaskPartitioning

TaskPartitioning

ILP Partitioner

Heuristic Partitioner

InstancePartitioning

InstancePartitioning

ModuloScheduling

CodeGeneration

CUDACode

+C Code

Page 42: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 43© RG@SERC,IISc 43

Bitoni

c

Bitoni

c-Rec

Chann

elvo

c...

DCTDES

FFT-C

FFT-F

Filter

bank FM

Mat

rixM

ult

MPE

G-Sub

set

TDE-P

P

Geom

etric

...

02468

101214161820

Mostly GPUSyn-HeurSyn-ILP

Experimental Results on Tesla

> 5

2x

> 3

2x

> 6

5x

Page 43: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 44

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism• Data Parallelism on GPUs and

Task Parallelism on CPUs– MATLAB on CPU/GPU

• Conclusions

Page 44: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 45

Compiling MATLAB to Heterogeneous Machines

• MATLAB is an array language extensively used for scientific computation

• Expresses data parallelism – Well suited for acceleration on GPUs

• Current solutions (Jacket, GPUmat) require user annotation to identify “GPU friendly” regions

• Our compiler, MEGHA (MATLAB Execution on GPU-based Heterogeneous Architectures), is fully automatic

Page 45: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 46

Compiler Overview

• Frontend constructs an SSA intermediate representation (IR) from the input MATLAB code

• Type inference is performed on the SSA IR– Needed because MATLAB is

dynamically typed• Backend identifies “GPU

friendly” kernels, decides where to run them and inserts reqd. data transfers

Code Simplification

Type Inference

SSA Construction

Frontend

Kernel Identification

Mapping and Scheduling

Data Transfer Insertion

Code GenerationBackend

Page 46: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 47

Backend : Kernel Identification

• Kernel identification identifies sets of IR statements (kernels) on which mapping and scheduling decisions are made

• Modeled as a graph clustering problem• Takes into account several costs and benefits

while forming kernels– Register utilization (Cost)– Memory utilization (Cost)– Intermediate array elimination (Benefit)– Kernel invocation overhead reduction (Benefit)

Page 47: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 48

Backend : Scheduling and Transfer Insertion• Assignment and scheduling is performed

using a heuristic algorithm– Assigns kernels to the processor which

minimizes its completion time– Uses a variant of list scheduling

• Data transfers for dependencies within a basic block are inserted during scheduling

• Transfers for inter basic block dependencies are inserted using a combination of data flow analysis and edge splitting

Page 48: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 49

Results

bscholes clos fdtd nb1d nb3d0

2

4

6

8

10

12

14

GPUmatCPU onlyCPU+ GPU

Benchmarks

Spee

dup

Ove

r MA

TLA

B

113 64

Benchmarks were run on a machine with an Intel Xeon 2.83GHz quad-core processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3. MATLAB version 7.9 was used.

Page 49: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 50

Overview

• Introduction• Trends in Processor Architecture• Programming Challenges• Data and Task Level Parallelism on Multi-cores• Exploiting Data, Task and Pipeline Parallelism• Data Parallelism on GPUs and

Task Parallelism on CPUs• Conclusions

Page 50: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011 © RG@SERC,IISc 51

Summary

• Multi-cores and Heterogeneous accelerators present tremendous research opportunity in– Architecture– High Performance Computing– Compilers– Programming Languages & Models

• Challenge is to exploit Performance• Keeping programmer Productivity high and• Portability of code across platforms

PPPis Important

Page 51: Garuda-PM July 2011 (C)RG@SERC,IISc Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in.

Garuda-PM July 2011

(C)RG@SERC,IISc

Thank You !!