Pruning Strategies in Adaptive Off-line Auto-tuningLu Li , Usman Dastgeer, Christoph Kessler...

Pruning Strategies

in Adaptive O�-line Auto-tuning

Lu Li, Usman Dastgeer, Christoph Kessler

Linköping University

1 / 40

Agenda

1 Motivation

2 Approach

3 Results

4 Conclusion

2 / 40

Problem Background

Application needs tuned for optimal performance

Performance tuning is challenging

Heterogeneous processorsCon�guration diversity

Auto-tuning needed

Performance Portability

Approach: Implementation selection for a multi-variant

component

3 / 40

Component Interface and Implementations

Implementation variantPlatforms (CPU, acceleratorcores)

Algorithms

Tunable parameter settings

Compiler transformations

Meta-dataDependencies

Resource requirements

Deployment descriptors

Performance prediction models

4 / 40

Staged Composition

5 / 40

Guiding Composition: Empirical Models

Analytical model

Empirical model

No or little understanding of the target architectureO�-line Empirical Models: Feed sampling data (trainingexamples) into a prediction model, e.g. SVM, C4.5

6 / 40

O�-line Empirical Models: Pros and Cons

+ Avoid �cold start�

+ Controllable tuning process

- Tuning overhead

�Cold start� E�ect

Example from Dastgeer et al. (ParCo'2011) on SkePU/StarPU integration,

Coulombic potential grid execution, with 3 successive executions

7 / 40

Zoom into the Cons

A closer look at o�-line overhead

Exhaustive execution is not feasiblePrevious work: Stargazer (random sampling)

Training examples (Sampling data) is vital for performance

prediction models

8 / 40

Attack the Cons: Observations

In the context of performance tuning: a concrete example

(Matrix-matrix multiplication)

Smart sampling by heuristics

9 / 40

Adaptive Sampling

Sample only vertices of a space

Recursive decomposition (cutting) for open spaces, controlled

by maximum depth, etc

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Adaptive Sampling

10 / 40

Formalization: Runtime context (e.g. Input Size)

Context Property Value Space (PVS) is a n-dimensional�nite space

A run-time context instance consists of n context propertyvalues maps to a point in the PVSPVS has 2n vertices (corner points)

Closed and open subspaces

Closed subspace: a subspace where a variant wins on allvertices. It heuristically approximate �uninteresting� subspacesOpen subspace: otherwise

Recursive space decomposition

11 / 40

Convexity Assumption from the Observation

�In real life, the world is smoothly changing, instances close by

most of the time have the same labels, and we need not worry

about all possible labelings.�

Ethem Alpaydin in �Introduction to Machine Learning, second edition�

Our Convexity Assumption

If all vertices of a subspace share the same winner, then thewinner wins in all points of the subspace statistically

12 / 40

Adaptive Sampling In the Big Picture

O�-line: adaptive sampling, tree construction

On-line (Run-time)

Load the treePrediction

Closed space: look up winner on one vertexOpen space: Euclidean-based predictor

13 / 40

Adaptive Sampling In the Big Picture

O�-line: adaptive sampling, tree construction

On-line (Run-time)Load the treePrediction

Closed space: look up winner on one vertexOpen space: Euclidean-based predictor

13 / 40

Techniques for Adaptive Sampling

Light oversampling

Sample one extra point in the middle

Detect holes

Small increase in overhead

May increase prediction accuracy

14 / 40

Further Techniques Based on Adaptive Sampling

Thresholding

Relative threshold: abs(vi − vmin)/vmin <= θStop splitting early

Reduce training overhead

May decrease prediction accuracy

15 / 40

Thresholding

Relative threshold: abs(vi − vmin)/vmin <= θStop splitting early

Reduce training overhead

May decrease prediction accuracy

15 / 40

Implementation Pruning

Only winners of the vertices will involve in the future sampling

of the subspace

Reduce overhead remarkably if many implementation variants

are available for an interface

May lead to loss of optimization potential

16 / 40

of the subspace

Try 10 variantsfor each vertex

16 / 40

of the subspace

16 / 40

of the subspace

16 / 40

of the subspace

Try 2 winner variantsfor each untested vertex

16 / 40

Platform

Name CPU GPU OS CompilerCora 16 Intel(R) Xeon(R)

CPU X5550 @2.67GHz

3 GPUs: two nVidiaTesla C2050 andone Tesla C1060

RHEL5.6

gcc 4.1.2and nvccV0.2.1221

Fermi 8 Intel(R) Xeon(R)CPU E5520 @2.27GHz

two Tesla M2050GPUs

3.2.1-2-ARCH

gcc 4.6.2and nvccV0.2.1221

17 / 40

Benchmarks

Benchmark Feature modeling Range Implementation variantsMatrix-matrixmultipli-cation(MM)

row size, columnsize of �rst ma-trix, column sizeof second matrix

(30, 30, 30) to(300, 300, 300)

Sequential implementation anda variant by loop rearrange-ment, CUDA impl., BLAS impl.,Pthread impl. and �ve of its vari-ants from loop rearrangement

Sorting(ST)

array size; discre-tization of arrayvalues distribu-tion (samplednumber of inver-sions)

(1,0) to(10k,10)

bubble sort, insertion sort,merge sort, quick sort

Path �nder(PF)

row; column (1,1) to(10k,20k)

OpenMP implementation,CUDA implementation

Backpropagation(BP)

array size (1k) to (100k) OpenMP implementation,CUDA implementation

18 / 40

Prediction Accuracy (%) on Cora: Base-line adaptive o�-line sampling

19 / 40

Prediction Accuracy (%) on Fermi: Base-line adaptive o�-line sampling

20 / 40

Other Metrics

Absolute runtime overhead: 4 - 23 µs

Relative runtime overhead: 0.2%

Average sampling rate: 0.053%

21 / 40

Test against Convexity Assumption

Accuracy(%) on closed space. Pecentage(%) on closed space. 22 / 40

Accuracy(%) on closed space. Pecentage(%) on closed space. 23 / 40

Accuracy(%) on closed space. Pecentage(%) on closed space.

24 / 40

Thresholding E�ect: Training Time

BP: Depth 4 MM: Depth 4

PF: Depth 1 ST: Depth 4 25 / 40

Thresholding E�ect: Prediction Accuracy

BP: Depth 4 MM: Depth 4

PF: Depth 1 ST: Depth 426 / 40

BP: Depth 4

MM: Depth 4

PF: Depth 1 ST: Depth 4 27 / 40

BP: Depth 4

28 / 40

Oversampling E�ect: Prediction Accuracy

Adaptive Sampling. Adaptive Sampling with light oversampling.

29 / 40

30 / 40

31 / 40

Oversampling E�ect: Training Time

32 / 40

Implementation Pruning E�ect: Training Time

Adaptive Sampling. Adaptive Sampling with impl pruning.

33 / 40

Implementation Pruning E�ect: Prediction Accuracy

Adaptive Sampling. Adaptive Sampling with impl pruning.

34 / 40

Combo E�ect: Training Time

PF STAdaptive Sampling. Adaptive Sampling with the combo of oversampling

and impl pruning.

35 / 40

Combo E�ect: Prediction Accuracy

PF STAdaptive Sampling. Adaptive Sampling with combo of oversampling

and impl pruning.36 / 40

Adaptive Sampling vs Random Sampling

PF STRandom Sampling Adaptive Sampling Adaptive Sampling with the combo

of oversampling and impl pruning.

37 / 40

PF STRandom Sampling Adaptive Sampling Adaptive Sampling with the combo

of oversampling and impl pruning.38 / 40

MMRandom Sampling Adaptive Sampling Adaptive Sampling with the combo

of oversampling and impl pruning.

39 / 40

Conclusion

Convexity Assumption holds for all benchmarks used, more to

Three techniques help to decrease training time or increaseprediction accuracy

ThresholdingLight OversamplingImplementation pruning

The right combination can combine the advantages

Adaptive sampling shows bene�ts against random sampling

More questions please contact me: lu.li@liu.se, +46704759692

40 / 40

Pruning Strategies in Adaptive Off-line Auto-tuningLu Li , Usman Dastgeer, Christoph Kessler...

Documents

Transcript of Pruning Strategies in Adaptive Off-line Auto-tuningLu Li , Usman Dastgeer, Christoph Kessler...

Linköping, Sweden - Accueil - Inria

Linköping ethnography lecture

ModelicaML - Linköping University

Linköping University Post Print Interaction of (CH2OH) …160524/FULLTEXT01.pdf · Department of Physics, Chemistry and Biology, IFM, Linköping University, S-581 83 Linköping,

Multimodal research in Linköping

Core-based SoCs Testing Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping University Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping.

NFSUN - 2011 Linköping Universitet

Welcome to Linköping University

HPC @ HP LCSC Linköping

Mr Hamid Jalil Ms Seema Ghani Ms Salwa Dastgeer Mr Ramesh ...

Linköping University

Linköping University Post Print

Fluid and Mecahtronic Systems Linköping universitycisb.org.br/wiefp2014/presentations/Session 1_Petter_Krus... · 2014-09-09 · Fluid and Mecahtronic Systems Linköping university

European Wenture Group Linköping Sweden

Kvalitet, introduktion (förmiddag) Linköping 18 maj

City of Linköping

Unit 10: Integration and Verification - Linköping University and... · Unit 10: Integration and Verification - Linköping University ... rtl ...

FACULTY OF EDUCATIONAL SCIENCES Monolingual Policy,1512697/... · 2020. 12. 28. · Monolingual Policy, Bilingual Interaction Linköping Studies in Pedagogic Practices No. 40 Linköping

Linköping University One university – two cities

Service Design Research at Linköping University