SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling
-
Upload
marcos-gonzalez -
Category
Engineering
-
view
39 -
download
1
Transcript of SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
A Comparison of GPU Execution Time Prediction usingMachine Learning and Analytical Modeling
Ph.D(c) CS Marcos Amarıs GonzalezAdvisor: Dr. Alfredo Goldman vel Lejbman
Co-advisor: Dr. Raphael Yokoingawa de Camargo
December, 2016
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Timeline
1 Introduction and Motivation
2 Parallel Programming ModelsBSP-based Analytical Model for GPUs
3 Machine Learning Techniques
4 ComparisonMethodologyResultsConclusions and Future Works
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Games and Video Cards
80’ - First video driver
Evolution of the games 3D. It is nec-essary to apply textures, lights, shad-ows, reflections, etc.
It was also necessary more computingpower
For this, the video cards became tobe more flexible and powerful
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 2 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Graphic Processing Units - GPUs
The term GPU was popularized by Nvidia in1999, who invented a GeForce 256 like the firstGPU in the world.
In 2002 the first General Purpose GPU waslaunched. The term GPGPU was created byMark Harris.
The main manufacturer of GPUs are NVIDIAand AMD. In 2005 NVIDIA launched CUDA.
Deep Learning, Virtual Reality.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 3 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
General Purpose GPU - GPGPU
Main program execute in the CPU (host) and it is responsible to start the executionin the GPU (device).These GPUs have their own hierarchy of memory and data must be transferedthrough the PCI Express.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 4 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Versus CPU
Nowadays GPUs are capable to perform much more efficient computingoperations than CPUs multicores.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 5 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
CUDA, GPUs and Memory spaces
A GPU has many processors P,all processors have the same clockrate R and they are divided inMultiprocessors.A CUDA Kernel can be composedof thousands and/or millions ofthreads t.
Type On Chip Cacheable Instructions Visibility g LatencyRegisters Yes No Load/Store Thread 1 cycle
Shared-L1 Yes No Load/Store Block 5 cycles
Constant No Yes Load Kernels 100 cycles
Texture No Yes Load/Store Kernel 100 cycles
Local No Yes Load/Store Thread 100 cycles
Cache L2 No Yes Load/Store Kernel 250 cycles
Global No Yes Load/Store Kernel 500 cycles
Table: Memory types in GPUs supported by CUDA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 6 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
In modern GPUs the comsumption of energy is a important restriction.Projects of GPUs are generally highly scalable.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 7 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
Compute Capability is a diferentiation between architectures and models ofGPUs NVIDIA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 8 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Compute Unified Device Architecture
CUDA - Compute Unified Device Architecture
CUDA is a extention of the language C, it allows to control the execution of gridsin a GPU and manages its memory.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 9 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Programming Model
A GPU Aplication is organized in grids, blocks and threads. Threads are groupedin blocks and they are grouped in a grid.Linear translation to know the Id of a thread in a grid.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 10 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Supercomputers
Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP
GTX680: 1500 cores and 2GB, preco US$500 Theoretical maximum performance 3.0 TFLOP
Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 11 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Green Supercomputers ≫ $$$$$$
Ranking of the supercomputers more efficient energetically in the world.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Amdahl’s law and Flynn’s Taxonomy
Flynn’s Taxonomy - 1966
Single Instruction Multiple InstructionSingle Data SISD - Sequential MISD
Multiple Data SIMD [SIMT] - GPU MIMD - Multicore
Amdahl’s law - 1967
Amdahl’s law gives the theoretical speedup of the execution of a task at fixedworkload that can be expected of a system whose resources are improved.
Speedup:S = Speed-upP = Number of ProcessorsT = Time
Sp =T1Tp
(1)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 13 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Parallel Random Access Machine (PRAM)
Figure: PRAM Model
It ignores lower level architectural constraints, and details, such as memoryaccess contention and overhead, synchronization overhead, interconnectionnetwork throughput, connectivity, speed limits and link bandwidths, etc.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 14 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Figure: Super-step in the BSP model
The cost to execute the i-th super-step isthen given by:
wi + ghi + L (2)
The total execution time of the applica-tion is given by:
T =W + gH + LS (3)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 15 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Bulk Synchronous Parallel (BSP), introduced byValiant in 1990 Turing Award 2010.
High Level model for parallelism
Computation and communication of a Kernelfunction
We did not include the synchronization step, nei-ther communication with host memory
Optimization aspects are modeled by adjustinga single parameter λ
Leslie Valiant
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 16 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Analytical Model Published
Divergence, optimizations in the communication and differences betweenarchitecture are adjusted by one parameter, λ1
Tk =t · (Comp+CommSM +CommGM)
R ·P · λ(4)
CommGM = (ld1 + st1 − L1− L2) · gGM + L1 · gL1 + L2 · gL2 (5)
CommSM = (ld0 + st0) · gSM (6)
comp, ld0, st0, ld1 and st1 are obtained on the source code.L1 and L2 Cache hits are captured by profiling.
1M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model topredict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Machine Learning Techniques
The theoretical subject of “learning” is related to prediction.
Supervised Learning
Unsupervised Learning
3 different Machine Learning Techniques
Simple Linear Regression (LR)
Support Vector Machines (SVM)
Random Forest (RF)
In this work, we wanted to use simple models to prove that they achievereasonable predictions.Fair comparison: (Data Input - Profile Information).
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 18 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Linear Regression (LR)
It assumes that there is approximately a linear relationship between each Xp
and Y . Mathematically, we can write the multiple linear regression modelas
Y ≈ β0 + β1X1 ++β2X2 + . . .++βpXp + ε (7)
where Xp represents the pth predictor and βp quantifies the associationbetween that variable and the response.
Figure: Example of a Linear Regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 19 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.
Figure: Example of Linear and no linear kernel for SVM in classification
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-rithms that depend on the data only through dot-products. The dot productcan be replaced by a kernel function which computes a dot product in somepossibly high dimensional feature space Z. It maps the input vector x intothe feature space Z.
Figure: Example of Linear and no linear kernel for SVM in regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Random Forest (RF)
Random Forests belong to decision tree methods, capable of performingboth regression and classification tasks.
Figure: Diagram of a tree decision
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPUs of the Testbed
Model C.C. Memory Bus Bandwidth L2 Cores/SM Clock
GTX-680 3.0 2 GB 256-bit 192.2 GB/s 0.5 M 1536/8 1058 Mhz
Tesla-K40 3.5 12 GB 384-bit 276.5 GB/s 1.5 MB 2880/15 745 Mhz
Tesla-K20 3.5 4 GB 320-bit 200 GB/s 1 MB 2496/31 706 MHz
Titan Black 3.5 6 GB 384-bit 336 GB/s 1.5 MB 2880/15 980 Mhz
Titan 3.5 6 GB 384-bit 288.4 GB/s 1.5 MB 2688/14 876 Mhz
Quadro K5200 3.5 8 GB 256-bit 192.2 Gb/s 1 MB 2304/12 771 Mhz
Titan X 5.2 12 GB 384-bit 336.5 GB/s 3 MB 3072/24 1076 Mhz
GTX-980 5.2 4 GB 256-bit 224.3 GB/s 2 MB 2048/16 1216 Mhz
GTX-970 5.2 4 GB 256-bit 224.3 GB/s 1.75 MB 1664/13 1279 Mhz
Table: Hardware specifications of the GPUs in the testbed
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 22 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Algorithm Testbed
9 different applications
Matrix Multiplications in 4 different optimizations:
* Global Memory - MMGU* Global Memory with coalesced accesses - MMGC* Global and Shared Memory - MMSU* Global and shared Memory with coalesced accesses - MMSC
Matrix Addition in 2 different optimizations:
* Global Memory - MAU* Global Memory with coalesced accesses - MAC
Dot Product - dotP
Vector Addition - vAdd
Maximum Subarray Problem - MSA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 23 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Dataset
10 Times each sample, with a confidence interval of 95%.
First Scenario - Machine Learning Vs Machine Learning
1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU.More 2000 Samples.
Second Scenario - Analytical Model Vs Machine LearningAnalytical Model
1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples.
2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples
Machine Learning - Block size 82, 162 and 322.
1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples.
2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples
MSA Blocksize 128. 96 samples per GPU. 864 Samples.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 24 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Features of the Machine Learning Techniques
13 features were used to feed the Machine learning Techniques.
Feature Description
num of cores Number of cores per GPU
max clock rate GPU Max Clock rate
Bandwidth Theoretical Bandwidth
Input Size Size of the problem
totalLoadGM Load transaction in Global Memory
totalStoreGM Store transaction in Global Memory
TotalLoadSM Load transaction in Shared Memory
TotalStoreSM Store transaction in Global Memory
FLOPS SP Floating operation in Single Precision
BlockSize Number of threads per blocks
GridSize Number of blocks in the kernel
No. threads Number of threads in the applications
Achieved OccupancyRatio of the average active warps per active cycle to the maximum
number of warps ed on a multiprocessor.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 25 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Use Cases of the Analytical Model
Par.Matrix Multiplication Matrix Addition
vAdd dotP MSAMMGU MMGC MMSU MMSC MAU MAC
comp N · FMA 1 · 24 1 · 96 (N/t) · 100
ld1 2 ·N 2 2 N/t
st1 1 1 1 5
ld0 0 2 ·N 0 0 N/t
st0 0 1 0 1 + log(t) 5
●
●
●
●
●
0
10
20
30
40
50
60
70
80
90
100
110
120
130
MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSAApplications
Lam
bda
Valu
es
Lambda Values of each one of the Applications
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 26 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Log transformation
We first transformed the data to a log2 scale and, after performing thelearning and predictions, we returned to the original scale using a 2pred
transformation2, reducing the non-linearity effects.
Figure: Quantile-Quantile Analysis of the generated models
2B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedingsof the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 27 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning - 1st Scenario
Tesla K40
Tesla K20
Quadro
Titan
TitanBlack
TitanX
GTX 680
GTX 980
GTX 970
●●●
●●
●
●
●●
●
●
●●●●●●●●●
●●
●●●●
●
●
●●●
●
●●
0.0
0.5
1.0
1.5
2.0
2.5
Acc
ura
cy T
kT
m
Linear Regression of MMSC
●●
●●●●●●●●●●●●●●
●●
●
●●
●●●●●●●●●●●●●●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
Acc
ura
cy T
kT
m
Support Vector Machines of MMSC
●
●●
●●
●
●
●
●●
●●
●●
●
●●●●●●●●●●●●
●●●●
●●●
●●●●
0.0
0.5
1.0
1.5
2.0
2.5
Acc
ura
cy T
kT
m
Random Forest of MMSC
Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced withmany samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 28 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning VS Analytical Model
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MM
GU
MM
GC
MM
SU
MM
SC
MA
UM
AC
dotP
vA
dd
MS
A
Acc
ura
cy T
kT
m
G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
Accuracy of the compared techniques
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MM
GU
MM
GC
MM
SU
MM
SC
MA
UM
AC
dotP
vA
dd
MS
A
Acc
ura
cy T
kT
m
G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
Accuracy of the compared techniques
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 29 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning VS Analytical Model
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MM
GU
MM
GC
MM
SU
MM
SC
MA
UM
AC
dotP
vA
dd
MS
A
Acc
ura
cy T
kT
m
G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
Accuracy of the compared techniques
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MM
GU
MM
GC
MM
SU
MM
SC
MA
UM
AC
dotP
vA
dd
MS
A
Acc
ura
cy T
kT
m
G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
Accuracy of the compared techniques
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 30 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Conclusions
Fair comparison.
Analytical model requires calculations
Machine learning provides more flexibility and generalization
Linear Regression can do reasonable predictions
But, ML requires a lot of label samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 31 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Future Works
Irregular benchmarks (Rodinia, SHOC).
Multiple kernels our GPUS and global synchronization
One extra memory level, the CPU RAM.
Feature extraction.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Thanks for your attention
Repository of the work:https://github.com/marcosamaris/svm-gpuperf
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32