SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison

A Comparison of GPU Execution Time Prediction usingMachine Learning and Analytical Modeling

Ph.D(c) CS Marcos Amarıs GonzalezAdvisor: Dr. Alfredo Goldman vel Lejbman

Co-advisor: Dr. Raphael Yokoingawa de Camargo

December, 2016

(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32


Timeline

1 Introduction and Motivation

2 Parallel Programming ModelsBSP-based Analytical Model for GPUs

3 Machine Learning Techniques

4 ComparisonMethodologyResultsConclusions and Future Works



BSP-based model Vs. Machine Learning


2 Parallel Programming Models


4 Comparison



Games and Video Cards

80’ - First video driver

Evolution of the games 3D. It is nec-essary to apply textures, lights, shad-ows, reflections, etc.

It was also necessary more computingpower

For this, the video cards became tobe more flexible and powerful



Graphic Processing Units - GPUs

The term GPU was popularized by Nvidia in1999, who invented a GeForce 256 like the firstGPU in the world.

In 2002 the first General Purpose GPU waslaunched. The term GPGPU was created byMark Harris.

The main manufacturer of GPUs are NVIDIAand AMD. In 2005 NVIDIA launched CUDA.

Deep Learning, Virtual Reality.



General Purpose GPU - GPGPU

Main program execute in the CPU (host) and it is responsible to start the executionin the GPU (device).These GPUs have their own hierarchy of memory and data must be transferedthrough the PCI Express.



GPU Versus CPU

Nowadays GPUs are capable to perform much more efficient computingoperations than CPUs multicores.



CUDA, GPUs and Memory spaces

A GPU has many processors P,all processors have the same clockrate R and they are divided inMultiprocessors.A CUDA Kernel can be composedof thousands and/or millions ofthreads t.

Type On Chip Cacheable Instructions Visibility g LatencyRegisters Yes No Load/Store Thread 1 cycle

Shared-L1 Yes No Load/Store Block 5 cycles

Constant No Yes Load Kernels 100 cycles

Texture No Yes Load/Store Kernel 100 cycles

Local No Yes Load/Store Thread 100 cycles

Cache L2 No Yes Load/Store Kernel 250 cycles

Global No Yes Load/Store Kernel 500 cycles

Table: Memory types in GPUs supported by CUDA



RoadMap of architectures of GPUs NVIDIA

In modern GPUs the comsumption of energy is a important restriction.Projects of GPUs are generally highly scalable.



RoadMap of architectures of GPUs NVIDIA

Compute Capability is a diferentiation between architectures and models ofGPUs NVIDIA.



Compute Unified Device Architecture

CUDA - Compute Unified Device Architecture

CUDA is a extention of the language C, it allows to control the execution of gridsin a GPU and manages its memory.



GPU Programming Model

A GPU Aplication is organized in grids, blocks and threads. Threads are groupedin blocks and they are grouped in a grid.Linear translation to know the Id of a thread in a grid.



Top 500 Supercomputers

Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP

GTX680: 1500 cores and 2GB, preco US$500 Theoretical maximum performance 3.0 TFLOP

Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world



Top 500 Green Supercomputers ≫ $$$$$$

Ranking of the supercomputers more efficient energetically in the world.







4 Comparison



Amdahl’s law and Flynn’s Taxonomy

Flynn’s Taxonomy - 1966

Single Instruction Multiple InstructionSingle Data SISD - Sequential MISD

Multiple Data SIMD [SIMT] - GPU MIMD - Multicore

Amdahl’s law - 1967

Amdahl’s law gives the theoretical speedup of the execution of a task at fixedworkload that can be expected of a system whose resources are improved.

Speedup:S = Speed-upP = Number of ProcessorsT = Time

Sp =T1Tp

(1)



Parallel Random Access Machine (PRAM)

Figure: PRAM Model

It ignores lower level architectural constraints, and details, such as memoryaccess contention and overhead, synchronization overhead, interconnectionnetwork throughput, connectivity, speed limits and link bandwidths, etc.



Bulk Synchronous Parallel Model

Figure: Super-step in the BSP model

The cost to execute the i-th super-step isthen given by:

wi + ghi + L (2)

The total execution time of the applica-tion is given by:

T =W + gH + LS (3)



Bulk Synchronous Parallel Model

Bulk Synchronous Parallel (BSP), introduced byValiant in 1990 Turing Award 2010.

High Level model for parallelism

Computation and communication of a Kernelfunction

We did not include the synchronization step, nei-ther communication with host memory

Optimization aspects are modeled by adjustinga single parameter λ

Leslie Valiant



Analytical Model Published

Divergence, optimizations in the communication and differences betweenarchitecture are adjusted by one parameter, λ1

Tk =t · (Comp+CommSM +CommGM)

R ·P · λ(4)

CommGM = (ld1 + st1 − L1− L2) · gGM + L1 · gL1 + L2 · gL2 (5)

CommSM = (ld0 + st0) · gSM (6)

comp, ld0, st0, ld1 and st1 are obtained on the source code.L1 and L2 Cache hits are captured by profiling.

1M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model topredict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015







4 Comparison



Machine Learning Techniques

The theoretical subject of “learning” is related to prediction.

Supervised Learning

Unsupervised Learning

3 different Machine Learning Techniques

Simple Linear Regression (LR)

Support Vector Machines (SVM)

Random Forest (RF)

In this work, we wanted to use simple models to prove that they achievereasonable predictions.Fair comparison: (Data Input - Profile Information).



Linear Regression (LR)

It assumes that there is approximately a linear relationship between each Xp

and Y . Mathematically, we can write the multiple linear regression modelas

Y ≈ β0 + β1X1 ++β2X2 + . . .++βpXp + ε (7)

where Xp represents the pth predictor and βp quantifies the associationbetween that variable and the response.

Figure: Example of a Linear Regression




SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.

Figure: Example of Linear and no linear kernel for SVM in classification




SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.

Figure: Example of Linear and no linear kernel for SVM in regression



Random Forest (RF)

Random Forests belong to decision tree methods, capable of performingboth regression and classification tasks.

Figure: Diagram of a tree decision







4 Comparison



GPUs of the Testbed

Model C.C. Memory Bus Bandwidth L2 Cores/SM Clock

GTX-680 3.0 2 GB 256-bit 192.2 GB/s 0.5 M 1536/8 1058 Mhz

Tesla-K40 3.5 12 GB 384-bit 276.5 GB/s 1.5 MB 2880/15 745 Mhz

Tesla-K20 3.5 4 GB 320-bit 200 GB/s 1 MB 2496/31 706 MHz

Titan Black 3.5 6 GB 384-bit 336 GB/s 1.5 MB 2880/15 980 Mhz

Titan 3.5 6 GB 384-bit 288.4 GB/s 1.5 MB 2688/14 876 Mhz

Quadro K5200 3.5 8 GB 256-bit 192.2 Gb/s 1 MB 2304/12 771 Mhz

Titan X 5.2 12 GB 384-bit 336.5 GB/s 3 MB 3072/24 1076 Mhz

GTX-980 5.2 4 GB 256-bit 224.3 GB/s 2 MB 2048/16 1216 Mhz

GTX-970 5.2 4 GB 256-bit 224.3 GB/s 1.75 MB 1664/13 1279 Mhz

Table: Hardware specifications of the GPUs in the testbed



Algorithm Testbed

9 different applications

Matrix Multiplications in 4 different optimizations:

* Global Memory - MMGU* Global Memory with coalesced accesses - MMGC* Global and Shared Memory - MMSU* Global and shared Memory with coalesced accesses - MMSC

Matrix Addition in 2 different optimizations:

* Global Memory - MAU* Global Memory with coalesced accesses - MAC

Dot Product - dotP

Vector Addition - vAdd

Maximum Subarray Problem - MSA



Dataset

10 Times each sample, with a confidence interval of 95%.

First Scenario - Machine Learning Vs Machine Learning

1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU.More 2000 Samples.

Second Scenario - Analytical Model Vs Machine LearningAnalytical Model

1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples.

2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples

Machine Learning - Block size 82, 162 and 322.

1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples.

2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples

MSA Blocksize 128. 96 samples per GPU. 864 Samples.



Features of the Machine Learning Techniques

13 features were used to feed the Machine learning Techniques.

Feature Description

num of cores Number of cores per GPU

max clock rate GPU Max Clock rate

Bandwidth Theoretical Bandwidth

Input Size Size of the problem

totalLoadGM Load transaction in Global Memory

totalStoreGM Store transaction in Global Memory

TotalLoadSM Load transaction in Shared Memory

TotalStoreSM Store transaction in Global Memory

FLOPS SP Floating operation in Single Precision

BlockSize Number of threads per blocks

GridSize Number of blocks in the kernel

No. threads Number of threads in the applications

Achieved OccupancyRatio of the average active warps per active cycle to the maximum

number of warps ed on a multiprocessor.



Use Cases of the Analytical Model

Par.Matrix Multiplication Matrix Addition

vAdd dotP MSAMMGU MMGC MMSU MMSC MAU MAC

comp N · FMA 1 · 24 1 · 96 (N/t) · 100

ld1 2 ·N 2 2 N/t

st1 1 1 1 5

ld0 0 2 ·N 0 0 N/t

st0 0 1 0 1 + log(t) 5

●

●

●

●

●

0

10

20

30

40

50

60

70

80

90

100

110

120

130

MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSAApplications

Lam

bda

Valu

es

Lambda Values of each one of the Applications



Log transformation

We first transformed the data to a log2 scale and, after performing thelearning and predictions, we returned to the original scale using a 2pred

transformation2, reducing the non-linearity effects.

Figure: Quantile-Quantile Analysis of the generated models

2B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedingsof the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA.



Results Machine Learning - 1st Scenario

Tesla K40

Tesla K20

Quadro

Titan

TitanBlack

TitanX

GTX 680

GTX 980

GTX 970

●●●

●●

●

●

●●

●

●

●●●●●●●●●

●●

●●●●

●

●

●●●

●

●●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Linear Regression of MMSC

●●

●●●●●●●●●●●●●●

●●

●

●●

●●●●●●●●●●●●●●

●

●

●

●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Support Vector Machines of MMSC

●

●●

●●

●

●

●

●●

●●

●●

●

●●●●●●●●●●●●

●●●●

●●●

●●●●

0.0

0.5

1.0

1.5

2.0

2.5

Acc

ura

cy T

kT

m

Random Forest of MMSC

Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced withmany samples



Results Machine Learning VS Analytical Model

Analytical LM RF SVM

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m

G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970

Accuracy of the compared techniques


0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m





Results Machine Learning VS Analytical Model


0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m




0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

MM

GU

MM

GC

MM

SU

MM

SC

MA

UM

AC

dotP

vA

dd

MS

A

Acc

ura

cy T

kT

m





Conclusions

Fair comparison.

Analytical model requires calculations

Machine learning provides more flexibility and generalization

Linear Regression can do reasonable predictions

But, ML requires a lot of label samples



Future Works

Irregular benchmarks (Rodinia, SHOC).

Multiple kernels our GPUS and global synchronization

One extra memory level, the CPU RAM.

Feature extraction.



Thanks for your attention

Repository of the work:https://github.com/marcosamaris/svm-gpuperf


https://github.com/marcosamaris/svm-gpuperf

SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

Engineering

Transcript of SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling