Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the...

6
Hardware Accelerated Learning at the Edge Sascha M¨ ucke, Nico Piatkowski, and Katharina Morik TU Dortmund, AI Group http://www-ai.cs.tu-dortmund.de Abstract. Specialized hardware for machine learning allows us to train highly accurate models in hours which would otherwise take days or months of computation time. The advent of recent deep learning tech- niques can largely be explained by the fact that their training and infer- ence rely heavily on fast matrix algebra that can be accelerated easily via programmable graphics processing units (GPU). Thus, vendors praise the GPU as the hardware for machine learning. However, those accelera- tors have an energy consumption of several hundred Watts. In distributed learning, each node has to meet resource constraints that exceed those of an ordinary workstation—especially when learning is performed at the edge, i.e., close to the data source. The energy consumption is typically highly restricted, and relying on high-end CPUs and GPUs is thus not a viable option. In this work, we present our new quantum-inspired ma- chine learning hardware accelerator. More precisely, we explain how our hardware approximates the solution to several NP-hard data mining and machine learning problems, including k-means clustering, maximum-a- posterior prediction, and binary support vector machine learning. Our device has a worst-case energy consumption of about 1.5 Watts and is thus especially well suited for distributed learning at the edge. Keywords: Hardware acceleration · Machine Learning · Edge device 1 Introduction Hardware acceleration for machine learning usually refers to GPU implementa- tions that can do fast linear algebra to enhance the speed of numerical compu- tations. This, however, includes the implicit assumptions that 1) the learning problem can actually benefit from fast linear algebra, i.e., the most complex parts of learning and inference can be phrased in the language of matrix-vector calculus. And 2), learning is carried out in an environment where energy supply, size, and weight of the system are mostly unrestricted. The latter assumption is indeed violated when learning has to be carried out at the edge, that is, on the device that actually measures the data. Especially in the distributed or federated learning settings, edge devices are subject to strong resource constraints. Communication efficiency [6] and compu- tational burden [7, 3] must be reduced, in order to get along with the available hardware. One way to address these issues are efficient decentralized learning

Transcript of Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the...

Page 1: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

Hardware Accelerated Learning at the Edge

Sascha Mucke, Nico Piatkowski, and Katharina Morik

TU Dortmund, AI Grouphttp://www-ai.cs.tu-dortmund.de

Abstract. Specialized hardware for machine learning allows us to trainhighly accurate models in hours which would otherwise take days ormonths of computation time. The advent of recent deep learning tech-niques can largely be explained by the fact that their training and infer-ence rely heavily on fast matrix algebra that can be accelerated easilyvia programmable graphics processing units (GPU). Thus, vendors praisethe GPU as the hardware for machine learning. However, those accelera-tors have an energy consumption of several hundred Watts. In distributedlearning, each node has to meet resource constraints that exceed those ofan ordinary workstation—especially when learning is performed at theedge, i.e., close to the data source. The energy consumption is typicallyhighly restricted, and relying on high-end CPUs and GPUs is thus nota viable option. In this work, we present our new quantum-inspired ma-chine learning hardware accelerator. More precisely, we explain how ourhardware approximates the solution to several NP-hard data mining andmachine learning problems, including k-means clustering, maximum-a-posterior prediction, and binary support vector machine learning. Ourdevice has a worst-case energy consumption of about 1.5 Watts and isthus especially well suited for distributed learning at the edge.

Keywords: Hardware acceleration · Machine Learning · Edge device

1 Introduction

Hardware acceleration for machine learning usually refers to GPU implementa-tions that can do fast linear algebra to enhance the speed of numerical compu-tations. This, however, includes the implicit assumptions that 1) the learningproblem can actually benefit from fast linear algebra, i.e., the most complexparts of learning and inference can be phrased in the language of matrix-vectorcalculus. And 2), learning is carried out in an environment where energy supply,size, and weight of the system are mostly unrestricted. The latter assumption isindeed violated when learning has to be carried out at the edge, that is, on thedevice that actually measures the data.

Especially in the distributed or federated learning settings, edge devices aresubject to strong resource constraints. Communication efficiency [6] and compu-tational burden [7, 3] must be reduced, in order to get along with the availablehardware. One way to address these issues are efficient decentralized learning

Page 2: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

2 Mucke, Piatkowski, Morik

CPU

GPU Q

A

FPGA

100

101

102

103

104

105

95

260

25,000

1.41

Pow

erco

nsu

mpti

on

[W]

Fig. 1. Power consumption of different hardware solutions for Machine Learning; CPU:Intel Core i7-9700K, GPU: Nvidia GEFORCE RTX 2080 Ti, QA (Quantum Annealer):D-Wave 2000Q, FPGA: Kintex-7 KC705

schemes [5]. However, the resource consumption of state-of-the-art hardware ac-celerators are often out of reach for edge devices.

We thus present a novel hardware architecture that can be used as a solverat the core of various data mining and machine learning techniques, some ofwhich we will explain in the following sections. Our work is inspired by the so-called quantum annealer, a hardware architecture for solving discrete optimiza-tion problems by exploiting quantum mechanical effects. In contrast to GPUsand quantum annealers, our device has a highly reduced resource consumption.The power consumption of four machine learning accelerators is shown in Fig. 1.For the CPU and GPU, we provide the thermal design power (TDP), whereasthe FPGA’s value is the total on-chip power, calculated using the Vivado DesignSuite1. We see that the actual peak consumption of CPUs and GPUs exceedsthe energy consumption of our device by several orders of magnitude. Moreover,we provide the estimated energy consumption of the D-Wave 2000Q quantumannealer. The annealer optimizes the exact same objective function as our de-vice, but the cooling and magnetic shielding required for its operation leads toan enormous energy consumption, which is very impractical for real applicationsat its current stage. Hence, its low resource requirements and versatility makesour device the ideal hardware accelerator for distributed learning at the edge.

Our approach is different to GPU programming in that our hardware is de-signed to solve a fixed class C of parametric optimization problems. “Program-ming” our device is then realized by reducing a learning problem to a member

1 https://www.xilinx.com/products/design-tools/vivado.html

Page 3: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

Hardware Accelerated Learning at the Edge 3

Fig. 2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7, Artix-7) with accompanying visualizations of board configuration and convergenceresults; right: Kintex-7 KC705 Evaluation Kit

of C and transferring only the resulting coefficients β. The optimization step,e.g. model training, is performed entirely on the board, without any additionalcommunication cost.

In our demo setup (shown in Fig. 2) we will showcase several machine learningtasks in a live setting on multiple devices, accompanied by live visualizations ofthe learning progress and the results.

The underlying idea of using a non-universal compute-architecture for ma-chine learning is indeed not new: State-of-the-art quantum annealers rely onthe very same problem formulation. There, optimization problem are encodedas potential energy between qubits – the global minimum of a loss functioncan be interpreted as the quantum state of lowest energy [4]. The fundamen-tally non-deterministic nature of quantum methods makes the daunting task oftraversing an exponentially large solution space feasible. However, their practicalimplementation is a persisting challenge, and the development of actual quan-tum hardware is still in its infancy. The latest flagship, theD-Wave 2000Q, canhandle problems with 64 fully connected bits2, which is by far not sufficient forrealistic problem sizes.

Nevertheless, the particular class of optimization problems that quantumannealers can solve is well understood which motivates its use for hardwareaccelerators outside of the quantum world.

2 Boolean Optimization

A pseudo-Boolean function (PBF) is any function f : Bn 7→ R that assigns areal value to a fixed-length binary vector. Every PBF on n binary variables canbe uniquely expressed as a polynomial of some degree d ≤ n with real-valuedcoefficients [2]. Quadratic Unconstrained Binary Optimization (QUBO) is theproblem of finding an assignment x∗ of n binary variables that is minimal withrespect to a second degree Boolean polynomial:

x∗ = arg minx∈Bn

n∑i=1

i∑j=1

βijxixj

2 https://www.dwavesys.com/sites/default/files/mwj_dwave_qubits2018.pdf

Page 4: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

4 Mucke, Piatkowski, Morik

It has been shown that all higher-degree pseudo-Boolean optimization problemscan be reduced to quadratic problems [2]. For this reason a variety of well-known optimization problems like (Max-)3SAT and prime factorization, but alsoML-related problems like clustering, maximum-a-posterior (MAP) estimation inMarkov Random Fields and binary constrained SVM learning can be reduced toQUBO or its Ising-variant (where x ∈ {−1,+1}n). In our demo, we will explainthe impact of different reductions in terms of runtime and quality of differentlearning tasks.

3 Evolutionary QUBO Solver

If no specialized algorithm is known for a particular hard combinatorial op-timization problem, randomized search heuristics, like simulated annealing orevolutionary algorithms (EA), provide a generic way to generate good solutions.

Inspired by biological evolution, EAs employ recombination and mutationon a set of “parent” solutions to produce a set of “offspring” solutions. A lossfunction, also called fitness function in the EA-context, is used to select those so-lutions which will constitute the next parent generation. This process is repeateduntil convergence or a pre-specified time-budget is exhausted [8].

Motivated by the inherently parallel nature of digital circuits, we developeda highly customizable (µ+λ)-EA architecture on FPGA hardware implementedusing the VHDL language3. Here, “customizable” implies that different typesof FPGA hardware, from small to large, can be used. This is done by allowingthe end-user to customize the maximal problem dimension n, the number ofparent solutions µ, the number of offspring solutions λ and the number of bitsper coefficient βij . In case of low-budget FPGA, this allows us to either allocatemore FPGA resources for parallel computation (µ and λ) or for the problem size(n and β). We will show how to generate and run chip designs for low-budget aswell as high-end devices in our demo.

4 Exemplary Learning Tasks

“Programming” our devices reduces to determining the corresponding coeffi-cients β and uploading them to the FPGA via our Python interface. We willexplain how this is done for various machine learning tasks, two of which we willexplain below:

A prototypical data mining problem is k-means clustering which is alreadyNP-hard for k = 2. To derive β for a 2-means clustering problem, we use themethod devised in [1], where each bit indicates whether the corresponding datapoint belongs to cluster 1 or cluster 2 — the problem dimension is thus n = |D|.The coefficients are then derived from the centered Gramian G over the meanadjusted data. To keep as much precision as possible, we stretch the parametersto use the full range of b bits before rounding, so the final formula is βij =

3 https://www.ics.uci.edu/~jmoorkan/vhdlref/Synario%20VHDL%20Manual.pdf

Page 5: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

Hardware Accelerated Learning at the Edge 5

bαGij + 0.5c with α = (2b−1 − 1)/maxi,j |Gij |. Exemplary results on the UCIdata sets Iris and Sonar are shown in fig. 3 (top).

-2.5x107

-2x107

-1.5x107

-1x107

-5x106

0

5x106

0 20 40 60 80 100

QU

BO

Obj

ectiv

e

Time (ms)

κ = 1/nκ = 2/nκ = 3/nκ = 4/nκ = 5/nκ = 6/nκ = 7/nκ = 8/nκ = 9/nκ = 10/n

-1.8x107

-1.6x107

-1.4x107

-1.2x107

-1x107

-8x106

-6x106

-4x106

-2x106

0

2x106

0 20 40 60 80 100

QU

BO

Obj

ectiv

e

Time (ms)

κ = 1/nκ = 2/nκ = 3/nκ = 4/nκ = 5/nκ = 6/nκ = 7/nκ = 8/nκ = 9/nκ = 10/n

-90

-80

-70

-60

-50

-40

-30

-20

-10

0

10

0 200 400 600 800 1000

QU

BO

Obj

ectiv

e

Time (ms)

κ = 1/nκ = 2/nκ = 3/nκ = 4/nκ = 5/nκ = 6/nκ = 7/nκ = 8/nκ = 9/nκ = 10/n

-100

-90

-80

-70

-60

-50

-40

-30

-20

-10

0

0 200 400 600 800 1000

QU

BO

Obj

ectiv

e

Time (ms)

κ = 1/nκ = 2/nκ = 3/nκ = 4/nκ = 5/nκ = 6/nκ = 7/nκ = 8/nκ = 9/nκ = 10/n

Fig. 3. QUBO loss value over time with different mutation rates, each averaged over 10runs. Uncertainty indicated by transparent areas. Top-left: 2-means on Iris (n = 150,d = 4). Top-right: 2-means on Sonar (n = 208, d = 61). Bottom-left: MRF-MAP withedge encoding. Bottom-right: MRF-MAP with vertex encoding.

Another typical NP-hard ML problem is to determine the most likely config-uration of variables in a Markov Random Field, known as the MAP predictionproblem [9]. Similar to efficiency differences between programs for classical uni-versal computers, providing a different QUBO problem encoding has implicationsfor the efficiency of our device. To demonstrate this effect, we will perform a livecomparison of different encodings in terms of convergence behavior.

One possible solution for the MRF-MAP problem is to encode the assign-ments of all Xv as a concatenation of one-hot encodings

(X1 = xi11 , . . . , Xm = ximm ) 7→ 0 . . . 010 . . . 0︸ ︷︷ ︸|X1|

. . . 0 . . . 010 . . . 0︸ ︷︷ ︸|Xm|

,

where m = |V | and xik is the i-th value in Xk. The weights −θuv=xy are encodedinto the quadratic coefficients; if two different bits belong to the same variableencoding, a penalty weight is added between them to maintain a valid one-hot.The negative sign is added to turn MAP into a minimization problem.

For a different possible solution, we may assign bits buv=xy to all non-zeroweights θuv=xy between specific values x, y of two variables Xu, Xv, indicatingthat Xu = x and Xv = y. Again, to avoid multiple assignments of the same

Page 6: Hardware Accelerated Learning at the Edge · 2019. 8. 6. · Hardware Accelerated Learning at the Edge 3 Fig.2. Exemplary visualization of the demo setup; left: multiple FPGAs (Kintex-7,

6 Mucke, Piatkowski, Morik

variable we introduce penalty weights between pairs of edges. We can see infig. 3 (bottom) that both approaches lead to a different convergence behavior.

In addition to k-means and MRF-MAP, the demo will include binary SVMlearning, binary MRF parameter learning, and others.

A video demonstrating how to use our system by solving a clustering problemcan be found here:

https://youtu.be/Xj5xx-eO1Mk

Acknowledgments This research has been funded by the Federal Ministryof Education and Research of Germany as part of the competence center formachine learning ML2R (01S18038A).

References

1. Bauckhage, C., Ojeda, C., Sifa, R., Wrobel, S.: Adiabatic quantum computing forkernel k=2 means clustering. In: Proceedings of the LWDA 2018. pp. 21–32 (2018)

2. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete applied mathemat-ics 123(1-3), 155–225 (2002)

3. Caldas, S., Konecny, J., McMahan, H.B., Talwalkar, A.: Expanding the reach of fed-erated learning by reducing client resource requirements. CoRR abs/1812.07210(2018), http://arxiv.org/abs/1812.07210

4. Kadowaki, T., Nishimori, H.: Quantum annealing in the transverse ising model.Physical Review E 58(5), 5355 (1998)

5. Kamp, M., Adilova, L., Sicking, J., Huger, F., Schlicht, P., Wirtz, T., Wrobel, S.: Ef-ficient decentralized deep learning by dynamic model averaging. In: Machine Learn-ing and Knowledge Discovery in Databases (ECMLPKDD). pp. 393–409 (2018).https://doi.org/10.1007/978-3-030-10925-7 24

6. Konecny, J., McMahan, H.B., Yu, F.X., Richtarik, P., Suresh, A.T., Bacon,D.: Federated learning: Strategies for improving communication efficiency. CoRRabs/1610.05492 (2016), http://arxiv.org/abs/1610.05492

7. Piatkowski, N., Lee, S., Morik, K.: Integer undirected graphical models for resource-constrained systems. Neurocomputing 173, 9–23 (2016)

8. Schwefel, H.P., Rudolph, G.: Contemporary evolution strategies. In: European con-ference on artificial life. pp. 891–907. Springer (1995)

9. Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, andvariational inference. F+Trends in Machine Learning 1(1–2), 1–305 (2008)