SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

39
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison A Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling Ph.D(c) CS Marcos Amar´ ıs Gonz´ alez Advisor: Dr. Alfredo Goldman vel Lejbman Co-advisor: Dr. Raphael Yokoingawa de Camargo December, 2016 (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32

Transcript of SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Page 1: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

A Comparison of GPU Execution Time Prediction usingMachine Learning and Analytical Modeling

Ph.D(c) CS Marcos Amarıs GonzalezAdvisor: Dr. Alfredo Goldman vel Lejbman

Co-advisor: Dr. Raphael Yokoingawa de Camargo

December, 2016

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32

Page 2: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Timeline

1 Introduction and Motivation

2 Parallel Programming ModelsBSP-based Analytical Model for GPUs

3 Machine Learning Techniques

4 ComparisonMethodologyResultsConclusions and Future Works

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32

Page 3: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

BSP-based model Vs. Machine Learning

1 Introduction and Motivation

2 Parallel Programming Models

3 Machine Learning Techniques

4 Comparison

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32

Page 4: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Games and Video Cards

80’ - First video driver

Evolution of the games 3D. It is nec-essary to apply textures, lights, shad-ows, reflections, etc.

It was also necessary more computingpower

For this, the video cards became tobe more flexible and powerful

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 2 / 32

Page 5: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Graphic Processing Units - GPUs

The term GPU was popularized by Nvidia in1999, who invented a GeForce 256 like the firstGPU in the world.

In 2002 the first General Purpose GPU waslaunched. The term GPGPU was created byMark Harris.

The main manufacturer of GPUs are NVIDIAand AMD. In 2005 NVIDIA launched CUDA.

Deep Learning, Virtual Reality.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 3 / 32

Page 6: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

General Purpose GPU - GPGPU

Main program execute in the CPU (host) and it is responsible to start the executionin the GPU (device).These GPUs have their own hierarchy of memory and data must be transferedthrough the PCI Express.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 4 / 32

Page 7: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

GPU Versus CPU

Nowadays GPUs are capable to perform much more efficient computingoperations than CPUs multicores.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 5 / 32

Page 8: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

CUDA, GPUs and Memory spaces

A GPU has many processors P,all processors have the same clockrate R and they are divided inMultiprocessors.A CUDA Kernel can be composedof thousands and/or millions ofthreads t.

Type On Chip Cacheable Instructions Visibility g LatencyRegisters Yes No Load/Store Thread 1 cycle

Shared-L1 Yes No Load/Store Block 5 cycles

Constant No Yes Load Kernels 100 cycles

Texture No Yes Load/Store Kernel 100 cycles

Local No Yes Load/Store Thread 100 cycles

Cache L2 No Yes Load/Store Kernel 250 cycles

Global No Yes Load/Store Kernel 500 cycles

Table: Memory types in GPUs supported by CUDA

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 6 / 32

Page 9: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

RoadMap of architectures of GPUs NVIDIA

In modern GPUs the comsumption of energy is a important restriction.Projects of GPUs are generally highly scalable.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 7 / 32

Page 10: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

RoadMap of architectures of GPUs NVIDIA

Compute Capability is a diferentiation between architectures and models ofGPUs NVIDIA.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 8 / 32

Page 11: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Compute Unified Device Architecture

CUDA - Compute Unified Device Architecture

CUDA is a extention of the language C, it allows to control the execution of gridsin a GPU and manages its memory.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 9 / 32

Page 12: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

GPU Programming Model

A GPU Aplication is organized in grids, blocks and threads. Threads are groupedin blocks and they are grouped in a grid.Linear translation to know the Id of a thread in a grid.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 10 / 32

Page 13: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Top 500 Supercomputers

Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP

GTX680: 1500 cores and 2GB, preco US$500 Theoretical maximum performance 3.0 TFLOP

Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 11 / 32

Page 14: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Top 500 Green Supercomputers ≫ $$$$$$

Ranking of the supercomputers more efficient energetically in the world.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32

Page 15: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

BSP-based model Vs. Machine Learning

1 Introduction and Motivation

2 Parallel Programming Models

3 Machine Learning Techniques

4 Comparison

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32

Page 16: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Amdahl’s law and Flynn’s Taxonomy

Flynn’s Taxonomy - 1966

Single Instruction Multiple InstructionSingle Data SISD - Sequential MISD

Multiple Data SIMD [SIMT] - GPU MIMD - Multicore

Amdahl’s law - 1967

Amdahl’s law gives the theoretical speedup of the execution of a task at fixedworkload that can be expected of a system whose resources are improved.

Speedup:S = Speed-upP = Number of ProcessorsT = Time

Sp =T1Tp

(1)

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 13 / 32

Page 17: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Parallel Random Access Machine (PRAM)

Figure: PRAM Model

It ignores lower level architectural constraints, and details, such as memoryaccess contention and overhead, synchronization overhead, interconnectionnetwork throughput, connectivity, speed limits and link bandwidths, etc.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 14 / 32

Page 18: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Bulk Synchronous Parallel Model

Figure: Super-step in the BSP model

The cost to execute the i-th super-step isthen given by:

wi + ghi + L (2)

The total execution time of the applica-tion is given by:

T =W + gH + LS (3)

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 15 / 32

Page 19: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Bulk Synchronous Parallel Model

Bulk Synchronous Parallel (BSP), introduced byValiant in 1990 Turing Award 2010.

High Level model for parallelism

Computation and communication of a Kernelfunction

We did not include the synchronization step, nei-ther communication with host memory

Optimization aspects are modeled by adjustinga single parameter λ

Leslie Valiant

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 16 / 32

Page 20: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Analytical Model Published

Divergence, optimizations in the communication and differences betweenarchitecture are adjusted by one parameter, λ1

Tk =t · (Comp+CommSM +CommGM)

R ·P · λ(4)

CommGM = (ld1 + st1 − L1− L2) · gGM + L1 · gL1 + L2 · gL2 (5)

CommSM = (ld0 + st0) · gSM (6)

comp, ld0, st0, ld1 and st1 are obtained on the source code.L1 and L2 Cache hits are captured by profiling.

1M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model topredict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32

Page 21: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

BSP-based model Vs. Machine Learning

1 Introduction and Motivation

2 Parallel Programming Models

3 Machine Learning Techniques

4 Comparison

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32

Page 22: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Machine Learning Techniques

The theoretical subject of “learning” is related to prediction.

Supervised Learning

Unsupervised Learning

3 different Machine Learning Techniques

Simple Linear Regression (LR)

Support Vector Machines (SVM)

Random Forest (RF)

In this work, we wanted to use simple models to prove that they achievereasonable predictions.Fair comparison: (Data Input - Profile Information).

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 18 / 32

Page 23: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Linear Regression (LR)

It assumes that there is approximately a linear relationship between each Xp

and Y . Mathematically, we can write the multiple linear regression modelas

Y ≈ β0 + β1X1 ++β2X2 + . . .++βpXp + ε (7)

where Xp represents the pth predictor and βp quantifies the associationbetween that variable and the response.

Figure: Example of a Linear Regression

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 19 / 32

Page 24: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Support Vector Machines (SVM)

SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.

Figure: Example of Linear and no linear kernel for SVM in classification

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32

Page 25: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Support Vector Machines (SVM)

SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.

Figure: Example of Linear and no linear kernel for SVM in regression

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32

Page 26: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Random Forest (RF)

Random Forests belong to decision tree methods, capable of performingboth regression and classification tasks.

Figure: Diagram of a tree decision

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32

Page 27: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

BSP-based model Vs. Machine Learning

1 Introduction and Motivation

2 Parallel Programming Models

3 Machine Learning Techniques

4 Comparison

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32

Page 28: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

GPUs of the Testbed

Model C.C. Memory Bus Bandwidth L2 Cores/SM Clock

GTX-680 3.0 2 GB 256-bit 192.2 GB/s 0.5 M 1536/8 1058 Mhz

Tesla-K40 3.5 12 GB 384-bit 276.5 GB/s 1.5 MB 2880/15 745 Mhz

Tesla-K20 3.5 4 GB 320-bit 200 GB/s 1 MB 2496/31 706 MHz

Titan Black 3.5 6 GB 384-bit 336 GB/s 1.5 MB 2880/15 980 Mhz

Titan 3.5 6 GB 384-bit 288.4 GB/s 1.5 MB 2688/14 876 Mhz

Quadro K5200 3.5 8 GB 256-bit 192.2 Gb/s 1 MB 2304/12 771 Mhz

Titan X 5.2 12 GB 384-bit 336.5 GB/s 3 MB 3072/24 1076 Mhz

GTX-980 5.2 4 GB 256-bit 224.3 GB/s 2 MB 2048/16 1216 Mhz

GTX-970 5.2 4 GB 256-bit 224.3 GB/s 1.75 MB 1664/13 1279 Mhz

Table: Hardware specifications of the GPUs in the testbed

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 22 / 32

Page 29: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Algorithm Testbed

9 different applications

Matrix Multiplications in 4 different optimizations:

* Global Memory - MMGU* Global Memory with coalesced accesses - MMGC* Global and Shared Memory - MMSU* Global and shared Memory with coalesced accesses - MMSC

Matrix Addition in 2 different optimizations:

* Global Memory - MAU* Global Memory with coalesced accesses - MAC

Dot Product - dotP

Vector Addition - vAdd

Maximum Subarray Problem - MSA

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 23 / 32

Page 30: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Dataset

10 Times each sample, with a confidence interval of 95%.

First Scenario - Machine Learning Vs Machine Learning

1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU.More 2000 Samples.

Second Scenario - Analytical Model Vs Machine LearningAnalytical Model

1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples.

2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples

Machine Learning - Block size 82, 162 and 322.

1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples.

2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples

MSA Blocksize 128. 96 samples per GPU. 864 Samples.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 24 / 32

Page 31: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Features of the Machine Learning Techniques

13 features were used to feed the Machine learning Techniques.

Feature Description

num of cores Number of cores per GPU

max clock rate GPU Max Clock rate

Bandwidth Theoretical Bandwidth

Input Size Size of the problem

totalLoadGM Load transaction in Global Memory

totalStoreGM Store transaction in Global Memory

TotalLoadSM Load transaction in Shared Memory

TotalStoreSM Store transaction in Global Memory

FLOPS SP Floating operation in Single Precision

BlockSize Number of threads per blocks

GridSize Number of blocks in the kernel

No. threads Number of threads in the applications

Achieved OccupancyRatio of the average active warps per active cycle to the maximum

number of warps ed on a multiprocessor.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 25 / 32

Page 32: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Use Cases of the Analytical Model

Par.Matrix Multiplication Matrix Addition

vAdd dotP MSAMMGU MMGC MMSU MMSC MAU MAC

comp N · FMA 1 · 24 1 · 96 (N/t) · 100

ld1 2 ·N 2 2 N/t

st1 1 1 1 5

ld0 0 2 ·N 0 0 N/t

st0 0 1 0 1 + log(t) 5

0

10

20

30

40

50

60

70

80

90

100

110

120

130

MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSAApplications

Lam

bda

Valu

es

Lambda Values of each one of the Applications

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 26 / 32

Page 33: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Log transformation

We first transformed the data to a log2 scale and, after performing thelearning and predictions, we returned to the original scale using a 2pred

transformation2, reducing the non-linearity effects.

Figure: Quantile-Quantile Analysis of the generated models

2B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedingsof the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 27 / 32

Page 34: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Results Machine Learning - 1st Scenario

Tesla K40

Tesla K20

Quadro

Titan

TitanBlack

TitanX

GTX 680

GTX 980

GTX 970

●●●

●●

●●

●●●●●●●●●

●●

●●●●

●●●

●●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Linear Regression of MMSC

●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Support Vector Machines of MMSC

●●

●●

●●

●●

●●

●●●●●●●●●●●●

●●●●

●●●

●●●●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Random Forest of MMSC

Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced withmany samples

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 28 / 32

Page 35: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Results Machine Learning VS Analytical Model

Analytical LM RF SVM

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m

G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970

Accuracy of the compared techniques

Analytical LM RF SVM

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m

G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970

Accuracy of the compared techniques

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 29 / 32

Page 36: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Results Machine Learning VS Analytical Model

Analytical LM RF SVM

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m

G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970

Accuracy of the compared techniques

Analytical LM RF SVM

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m

G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970

Accuracy of the compared techniques

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 30 / 32

Page 37: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Conclusions

Fair comparison.

Analytical model requires calculations

Machine learning provides more flexibility and generalization

Linear Regression can do reasonable predictions

But, ML requires a lot of label samples

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 31 / 32

Page 38: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Future Works

Irregular benchmarks (Rodinia, SHOC).

Multiple kernels our GPUS and global synchronization

One extra memory level, the CPU RAM.

Feature extraction.

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32

Page 39: SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

Thanks for your attention

Repository of the work:https://github.com/marcosamaris/svm-gpuperf

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32