An Exascale Library for Numerically Inspired Machine Learning · 2020. 6. 24. · Example forced to...

1
ExaNIML An Exascale Library for Numerically Inspired Machine Learning Severin Reiz § , Hasan Ashraf § ,Tobias Neckel § , George Biros , Hans-Joachim Bungartz § § Technical University of Munich, University of Texas at Austin, [email protected], [email protected], [email protected], [email protected], [email protected] Introduction Motivation Significant gap in communities: Machine Learning (ML) high-performance computing (HPC) ML needs considerable computing power we need adequate software! ExaNIML: Library with algorithms with modern applications from ML community with enough concurrency for next generation distributed computing systems Approach Methods from scientific computing domain “Undervalued” near-linear complexity methods (fast multipole methods) 1 Adaptive sparse grids to mitigate curse of dimensionality 2 HPC: Exploit potential of supercomputers – Concurrency: Choose suitable algorithms for parallel computing Extract computational bottlenecks as low-level drivers in C++ or Kokkos – Performance Portability in-light of the upcoming new GPU and CPU architectures Classification with Kernel Methods Kernel Matrix Occurs in many domains . . . multi-class classification model-order reduction uncertainty quantification partial differential equations Example: Binary classification Ridge regression N data points x i R d and N binary labels y i f (x)= sign( N i=1 k (x, x i )w i ) u = f (x test )= K * w Solving a linear system: kernel matrix K often not stable, nearly singular. Solve ˜ K K + λI instead Our approach: Kernel Matrix Approximation Often K is a dense N-by-N matrix; this quadratic complexity often is the computational bottleneck To reach O (N ) algorithms it requires approximation For the majority of applications off-diagonal blocks of K admit good low-rank approximations Many ML libraries offer Kernel methods: to our knowl- edge none of them offer kernel matrix approximation Kernel Methods Key Computational Bottlenecks Geometry-oblivious Fast Multipole Method 1 Hierarchically off-diagonal low-rank Speeds up algebraic operations iSend and iRecv Telescope Skeleton Weights Pack Skel Weights Telescope Skeleton Potentials Unpack Skel Weights Reduce Skeleton Potentials l r p q {} u {β} {θ} {π} v r l p q {} {β} {θ} {π} u v u {β} r l Dependency graph for asynchronous task analysis 4-process distributed H-Matrix compression. Mixed colored sections and factors are shared for dis- tributed nodes First scalability results Multiplication 1 192-core 384 768 1536 3072 6144 Neighbor Tree Skeletonization 192-core 384 768 1536 3072 6144 GFLOPS Sync Async 12% 14% 16% 26% 27% 29% 29% 27% 26% 16% 14% 12% 60s 44s 38s 23s 27s 12s 18s 10s 13s 6s 10s 4s 52s 19s 210s 28s 12s 110s 16s 8s 55s 9s 5s 30s 5s 3s 17s 4s 2s 9s #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12 120s (y-axis) 40s (y-axis) 90s (y-axis) 30s (y-axis) Time for compression (left) and multiplication (right) for a 6-d gaussian kernel matrix of 64M- by-64M O(N ) linear solver 3 192-core 384 768 1536 3072 6144 Eciency Factorize 15.3% 20.3% 25.5% 27.5% 28.9% 30% 30% 28.9% 27.5% 25.5% 20.3% 15.3% 192-core 384 768 1536 3072 6144 Eciency Solve 7.3% 12.2% 16.2% 16.2% 16.5% 16.6% 16.6% 16.5% 16.2% 16.2% 12.2% 7.3% 1.41s 0.71s 0.36s 0.18s 0.12s 0.10s 11.07s 5.75s 3.01s 1.63s 1.02s 0.68s #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12 1.0s 0.1s Time for ULV-Factorization (left) and forward- solve (right) Dimensionality Reduction 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1 Components Relative Singular Values -6 -4 -2 0 2 4 6 -5 0 5 First Component Second Component Embedded Space 0 1 2 Reduce the dimensionality of dataset Manifold Learning Algorithms (Kernel) Principal Component Analysis Isomap algorithm Hessian local eigenmaps, ... Classification on Embedded Space Example forced to 2D manifold (plotting) Classification on lower dimensional man- ifold Sparse grid classification 4 Approximation with Sparse Grids Sparse grids Reduce number of grid points One approach: Combi-Technique Combine Full Grids (red and blue, c.f. figure on right) Suitable for 5-20 dimensions Sparse Grids in Embedded Space 1. Manifold learning algorithm for coarse embedded space 2. Fine approximation in embedded space with Sparse Grids Synergy between Point-based and Grid- based Methods Sparse Grids Interfaces Kernel Matrix Approxima- tion Manifold Learning Deep Learning ResNet, MobileNet,. . . Sparse grid library Data mining with Sparse grid density esti- mation Conclusion Method design Run prominent models from current machine learning peers Combine models with hierarchical kernel and sparse grid methods Library design – Community/reproducibility: ExaNIML library for others to play References [1]C. D. Yu, S. Reiz, and G. Biros, “Distributed-memory hierarchical compression of dense SPD matrices,” in Proceedings of the International Con- ference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, (Piscataway, NJ, USA), pp. 15:1–15:15, IEEE Press, 2018. [2] H.-J. Bungartz and M. Griebel, “Sparse grids,” Acta numerica, vol. 13, pp. 147–269, 2004. [3] D. Y. Chenhan, S. Reiz, and G. Biros, “Distributed o (n) linear solver for dense symmetric hierarchical semi-separable matrices,” in 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 1–8, IEEE, 2019. [4]B. Peherstorfer, D. Pflüger, and H.-J. Bungartz, “Density estimation with adaptive sparse grids for large data sets,” in Proceedings of the 2014 SIAM international conference on data mining, pp. 443–451, SIAM, 2014.

Transcript of An Exascale Library for Numerically Inspired Machine Learning · 2020. 6. 24. · Example forced to...

Page 1: An Exascale Library for Numerically Inspired Machine Learning · 2020. 6. 24. · Example forced to 2D manifold (plotting) Classification on lower dimensional man-ifold!Sparse grid

ExaNIMLAn Exascale Library for Numerically Inspired Machine LearningSeverin Reiz§, Hasan Ashraf§,Tobias Neckel§, George Biros†, Hans-Joachim Bungartz§§ Technical University of Munich, † University of Texas at Austin, [email protected], [email protected], [email protected], [email protected], [email protected]

IntroductionMotivation•Significant gap in communities:

Machine Learning (ML)↔ high-performance computing (HPC)•ML needs considerable computing power→ we need adequate software!•ExaNIML: Library with algorithms

– with modern applications from ML community– with enough concurrency for next generation distributed computing systems

Approach•Methods from scientific computing domain

– “Undervalued” near-linear complexity methods (fast multipole methods)1

– Adaptive sparse grids to mitigate curse of dimensionality2

•HPC: Exploit potential of supercomputers– Concurrency: Choose suitable algorithms for parallel computing– Extract computational bottlenecks as low-level drivers in C++ or Kokkos– Performance Portability in-light of the upcoming new GPU and CPU

architectures

Classification with Kernel MethodsKernel Matrix

Occurs in many domains . . .

•multi-class classification•model-order reduction• uncertainty quantification• partial differential equations

Example: Binary classification

Ridge regression

•N data points xi ∈ Rd and N binary labels yi

• f (x) = sign(N∑i=1

k(x, xi)wi)→ u = f (xtest) = K ∗ w

•Solving a linear system: kernel matrix Koften not stable, nearly singular.Solve K̃ → K + λI instead

Our approach: Kernel Matrix Approximation

•Often K is a dense N-by-N matrix; this quadraticcomplexity often is the computational bottleneck•To reach O(N) algorithms it requires approximation•For the majority of applications off-diagonal blocks ofK admit good low-rank approximations

Many ML libraries offer Kernel methods: to our knowl-edge none of them offer kernel matrix approximation

KernelM

ethodsKey Computational Bottlenecks

Geometry-oblivious Fast Multipole Method1

•Hierarchically off-diagonal low-rank•Speeds up algebraic operations

iSend and iRecv

Telescope Skeleton Weights

Pack Skel Weights

TelescopeSkeleton Potentials

Unpack SkelWeights

Reduce Skeleton Potentials

l

r

p

q

{𝛂}

u

{β}

{θ}

{π}

v

r

l

p

q

{𝛂}

{β}

{θ}

{π}

u

v

u

{β}r

l

Dependency graph for asynchronous task analysis

4-process distributed H-Matrix compression. Mixedcolored sections and factors are shared for dis-tributed nodes

First scalability resultsMultiplication1

192-core 384 768 1536 3072 6144

Neighbor Tree Skeletonization

192-core 384 768 1536 3072 6144

GFLOPS Sync Async

12%14%16%

26%27%29%29% 27% 26%

16% 14% 12%

60s 44s

38s 23s

27s 12s

18s 10s

13s 6s

10s 4s

52s 19s

210s

28s 12s

110s

16s 8s

55s

9s 5s

30s

5s 3s

17s

4s 2s 9s

#01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12

120s (y-axis) 40s (y-axis)

90s (y-axis) 30s (y-axis)

Time for compression (left) and multiplication(right) for a 6-d gaussian kernel matrix of 64M-by-64M

O(N) linear solver3

192-core 384 768 1536 3072 6144

Efficiency Factorize

15.3%20.3%

25.5%27.5%28.9%30%30% 28.9% 27.5% 25.5%20.3%

15.3%

192-core 384 768 1536 3072 6144

Efficiency Solve

7.3%12.2%

16.2%16.2%16.5%16.6%16.6% 16.5% 16.2% 16.2%12.2%

7.3%

1.41s 0.71s 0.36s 0.18s 0.12s 0.10s 11.07s 5.75s 3.01s 1.63s 1.02s 0.68s #01 #02 #03 #04 #05 #06 #07 #08 #09 #10 #11 #12

1.0s

0.1s

Time for ULV-Factorization (left) and forward-solve (right)

Dimensionality Reduction

10 20 30 40 50 60

0

0.2

0.4

0.6

0.8

1

Components

Rel

ativ

eS

ingu

lar

Valu

es

−6 −4 −2 0 2 4 6

−5

0

5

First Component

Seco

nd

Com

pon

ent

Embedded Space012

Reduce the dimensionality of datasetManifold Learning Algorithms• (Kernel) Principal Component Analysis• Isomap algorithm•Hessian local eigenmaps, ...

Classification on Embedded Space

•Example forced to 2D manifold (plotting)•Classification on lower dimensional man-

ifold

→ Sparse grid classification4

Approximation with Sparse GridsSparse grids•Reduce number of grid points•One approach: Combi-Technique

Combine Full Grids (red and blue, c.f.figure on right)•Suitable for 5-20 dimensions

Sparse Grids in Embedded Space1. Manifold learning algorithm for coarse

embedded space2. Fine approximation in embedded space

with Sparse Grids

Synergy between Point-based and Grid-based Methods

Sparse

Grids

Interfaces

Kernel Matrix Approxima-tion Manifold Learning

Deep LearningResNet, MobileNet,. . .

Sparse grid libraryData mining withSparse grid density esti-mation

Conclusion

•Method design– Run prominent models from current machine learning peers– Combine models with hierarchical kernel and sparse grid methods• Library design

– Community/reproducibility: ExaNIML library for others to play

References

[1] C. D. Yu, S. Reiz, and G. Biros, “Distributed-memory hierarchical compression of dense SPD matrices,” in Proceedings of the International Con-ference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, (Piscataway, NJ, USA), pp. 15:1–15:15, IEEE Press, 2018.

[2] H.-J. Bungartz and M. Griebel, “Sparse grids,” Acta numerica, vol. 13, pp. 147–269, 2004.

[3] D. Y. Chenhan, S. Reiz, and G. Biros, “Distributed o (n) linear solver for dense symmetric hierarchical semi-separable matrices,” in 2019 IEEE 13thInternational Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 1–8, IEEE, 2019.

[4] B. Peherstorfer, D. Pflüger, and H.-J. Bungartz, “Density estimation with adaptive sparse grids for large data sets,” in Proceedings of the 2014SIAM international conference on data mining, pp. 443–451, SIAM, 2014.