arXiv:1608.08225v4 [cond-mat.dis-nn] 3 Aug 2017 · Dept. of Physics, Harvard University, Cambridge,...

Why does deep and cheap learning work so well?∗

Henry W. Lin, Max Tegmark, and David RolnickDept. of Physics, Harvard University, Cambridge, MA 02138

Dept. of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139 andDept. of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139

(Dated: July 21 2017)

We show how the success of deep learning could depend not only on mathematics but also onphysics: although well-known mathematical theorems guarantee that neural networks can approxi-mate arbitrary functions well, the class of functions of practical interest can frequently be approxi-mated through “cheap learning” with exponentially fewer parameters than generic ones. We explorehow properties frequently encountered in physics such as symmetry, locality, compositionality, andpolynomial log-probability translate into exceptionally simple neural networks. We further arguethat when the statistical process generating the data is of a certain hierarchical form prevalentin physics and machine-learning, a deep neural network can be more efficient than a shallow one.We formalize these claims using information theory and discuss the relation to the renormalizationgroup. We prove various “no-flattening theorems” showing when efficient linear deep networks can-not be accurately approximated by shallow ones without efficiency loss; for example, we show thatn variables cannot be multiplied using fewer than 2n neurons in a single hidden layer.

I. INTRODUCTION

Deep learning works remarkably well, and has helped dra-matically improve the state-of-the-art in areas rangingfrom speech recognition, translation and visual objectrecognition to drug discovery, genomics and automaticgame playing [1, 2]. However, it is still not fully under-stood why deep learning works so well. In contrast toGOFAI (“good old-fashioned AI”) algorithms that arehand-crafted and fully understood analytically, many al-gorithms using artificial neural networks are understoodonly at a heuristic level, where we empirically know thatcertain training protocols employing large data sets willresult in excellent performance. This is reminiscent of thesituation with human brains: we know that if we traina child according to a certain curriculum, she will learncertain skills — but we lack a deep understanding of howher brain accomplishes this.

This makes it timely and interesting to develop new an-alytic insights on deep learning and its successes, whichis the goal of the present paper. Such improved under-standing is not only interesting in its own right, and forpotentially providing new clues about how brains work,but it may also have practical applications. Better under-standing the shortcomings of deep learning may suggestways of improving it, both to make it more capable andto make it more robust [3].

∗Published in Journal of Statistical Physics:https://link.springer.com/article/10.1007/

s10955-017-1836-5

A. The swindle: why does “cheap learning” work?

Throughout this paper, we will adopt a physics perspec-tive on the problem, to prevent application-specific de-tails from obscuring simple general results related to dy-namics, symmetries, renormalization, etc., and to exploituseful similarities between deep learning and statisticalmechanics.

The task of approximating functions of many variables iscentral to most applications of machine learning, includ-ing unsupervised learning, classification and prediction,as illustrated in Figure 1. For example, if we are inter-ested in classifying faces, then we may want our neuralnetwork to implement a function where we feed in an im-age represented by a million greyscale pixels and get asoutput the probability distribution over a set of peoplethat the image might represent.

When investigating the quality of a neural net, there areseveral important factors to consider:

• Expressibility: What class of functions can theneural network express?

• Efficiency: How many resources (neurons, param-eters, etc.) does the neural network require to ap-proximate a given function?

• Learnability: How rapidly can the neural networklearn good parameters for approximating a func-tion?

This paper is focused on expressibility and efficiency,and more specifically on the following well-known [4–6]problem: How can neural networks approximate func-tions well in practice, when the set of possible functionsis exponentially larger than the set of practically possiblenetworks? For example, suppose that we wish to classify

arX

iv:1

608.

0822

5v4

[co

nd-m

at.d

is-n

n] 3

Aug

201

7

https://link.springer.com/article/10.1007/s10955-017-1836-5

https://link.springer.com/article/10.1007/s10955-017-1836-5

2

Unsupervisedlearning

Generation Classification

p(x)

p(y |x)p(x |y)

FIG. 1: In this paper, we follow the machine learning conven-tion where y refers to the model parameters and x refers tothe data, thus viewing x as a stochastic function of y (pleasebeware that this is the opposite of the common mathematicalconvention that y is a function of x). Computer scientists usu-ally call x a category when it is discrete and a parameter vec-tor when it is continuous. Neural networks can approximateprobability distributions. Given many samples of random vec-tors y and x, unsupervised learning attempts to approximatethe joint probability distribution of y and x without mak-ing any assumptions about causality. Classification involvesestimating the probability distribution for y given x. Theopposite operation, estimating the probability distribution ofx given y is often called prediction when y causes x by beingearlier data in a time sequence; in other cases where y causesx, for example via a generative model, this operation is some-times known as probability density estimation. Note that inmachine learning, prediction is sometimes defined not as asoutputting the probability distribution, but as sampling fromit.

megapixel greyscale images into two categories, e.g., catsor dogs. If each pixel can take one of 256 values, thenthere are 2561000,000 possible images, and for each one,we wish to compute the probability that it depicts a cat.This means that an arbitrary function is defined by a listof 2561000,000 probabilities, i.e., way more numbers thanthere are atoms in our universe (about 1078).

Yet neural networks with merely thousands or millionsof parameters somehow manage to perform such classi-fication tasks quite well. How can deep learning be so“cheap”, in the sense of requiring so few parameters?

We will see in below that neural networks perform a com-binatorial swindle, replacing exponentiation by multipli-cation: if there are say n = 106 inputs taking v = 256values each, this swindle cuts the number of parametersfrom vn to v×n times some constant factor. We will showthat this success of this swindle depends fundamentallyon physics: although neural networks only work well foran exponentially tiny fraction of all possible inputs, thelaws of physics are such that the data sets we care aboutfor machine learning (natural images, sounds, drawings,text, etc.) are also drawn from an exponentially tiny frac-tion of all imaginable data sets. Moreover, we will seethat these two tiny subsets are remarkably similar, en-

abling deep learning to work well in practice.

The rest of this paper is organized as follows. In Sec-tion II, we present results for shallow neural networkswith merely a handful of layers, focusing on simplifica-tions due to locality, symmetry and polynomials. In Sec-tion III, we study how increasing the depth of a neuralnetwork can provide polynomial or exponential efficiencygains even though it adds nothing in terms of expres-sivity, and we discuss the connections to renormaliza-tion, compositionality and complexity. We summarizeour conclusions in Section IV.

II. EXPRESSIBILITY AND EFFICIENCY OFSHALLOW NEURAL NETWORKS

Let us now explore what classes of probability distribu-tions p are the focus of physics and machine learning, andhow accurately and efficiently neural networks can ap-proximate them. We will be interested in probability dis-tributions p(x|y), where x ranges over some sample spaceand y will be interpreted either as another variable beingconditioned on or as a model parameter. For a machine-learning example, we might interpret y as an element ofsome set of animals {cat,dog, rabbit, ...} and x as the vec-tor of pixels in an image depicting such an animal, so thatp(x|y) for y = cat gives the probability distribution of im-ages of cats with different coloring, size, posture, viewingangle, lighting condition, electronic camera noise, etc. Fora physics example, we might interpret y as an element ofsome set of metals {iron, aluminum, copper, ...} and x asthe vector of magnetization values for different parts ofa metal bar. The prediction problem is then to evaluatep(x|y), whereas the classification problem is to evaluatep(y|x).

Because of the above-mentioned “swindle”, accurate ap-proximations are only possible for a tiny subclass of allprobability distributions. Fortunately, as we will explorebelow, the function p(x|y) often has many simplifyingfeatures enabling accurate approximation, because it fol-lows from some simple physical law or some generativemodel with relatively few free parameters: for exam-ple, its dependence on x may exhibit symmetry, localityand/or be of a simple form such as the exponential ofa low-order polynomial. In contrast, the dependence ofp(y|x) on y tends to be more complicated; it makes nosense to speak of symmetries or polynomials involving avariable y = cat.

Let us therefore start by tackling the more complicatedcase of modeling p(y|x). This probability distributionp(y|x) is determined by the hopefully simpler functionp(x|y) via Bayes’ theorem:

p(y|x) =p(x|y)p(y)∑y′ p(x|y′)(y′)

, (1)

3

Physics Machine learningHamiltonian Surprisal − ln pSimple H Cheap learningQuadratic H Gaussian pLocality SparsityTranslationally symmetric H ConvnetComputing p from H SoftmaxingSpin BitFree energy difference KL-divergenceEffective theory Nearly lossless data distillationIrrelevant operator NoiseRelevant operator Feature

TABLE I: Physics-ML dictionary.

where p(y) is the probability distribution over y (animalsor metals, say) a priori, before examining the data vectorx.

A. Probabilities and Hamiltonians

It is useful to introduce the negative logarithms of two ofthese probabilities:

Hy(x) ≡ − ln p(x|y),

µy ≡ − ln p(y). (2)

Statisticians refer to − ln p as “self-information” or “sur-prisal”, and statistical physicists refer to Hy(x) as theHamiltonian, quantifying the energy of x (up to an arbi-trary and irrelevant additive constant) given the param-eter y. Table I is a brief dictionary translating betweenphysics and machine-learning terminology. These defini-tions transform equation (1) into the Boltzmann form

p(y|x) =1

N(x)e−[Hy(x)+µx], (3)

where

N(x) ≡∑y

e−[Hy(x)+µy ]. (4)

This recasting of equation (1) is useful because theHamiltonian tends to have properties making it sim-ple to evaluate. We will see in Section III that it alsohelps understand the relation between deep learning andrenormalization[7].

B. Bayes theorem as a softmax

Since the variable y takes one of a discrete set of values,we will often write it as an index instead of as an argu-ment, as py(x) ≡ p(y|x). Moreover, we will often find itconvenient to view all values indexed by y as elementsof a vector, written in boldface, thus viewing py, Hy and

µy as elements of the vectors p, H and µ, respectively.Equation (3) thus simplifies to

p(x) =1

N(x)e−[H(x)+µ], (5)

using the standard convention that a function (in thiscase exp) applied to a vector acts on its elements.

We wish to investigate how well this vector-valued func-tion p(x) can be approximated by a neural net. A stan-dard n-layer feedforward neural network maps vectorsto vectors by applying a series of linear and nonlineartransformations in succession. Specifically, it implementsvector-valued functions of the form [1]

f(x) = σnAn · · ·σ2A2σ1A1x, (6)

where the σi are relatively simple nonlinear operatorson vectors and the Ai are affine transformations of theform Aix = Wix+bi for matrices Wi and so-called biasvectors bi. Popular choices for these nonlinear operatorsσi include

• Local function: apply some nonlinear function σ toeach vector element,

• Max-pooling: compute the maximum of all vectorelements,

• Softmax: exponentiate all vector elements and nor-malize them to so sum to unity

σ(x) ≡ ex∑i eyi. (7)

(We use σ to indicate the softmax function and σ to in-dicate an arbitrary non-linearity, optionally with certainregularity requirements).

This allows us to rewrite equation (5) as

p(x) = σ[−H(x)− µ]. (8)

This means that if we can compute the Hamiltonian vec-tor H(x) with some n-layer neural net, we can evaluatethe desired classification probability vector p(x) by sim-ply adding a softmax layer. The µ-vector simply becomesthe bias term in this final layer.

C. What Hamiltonians can be approximated byfeasible neural networks?

It has long been known that neural networks are univer-sal1 approximators [8, 9], in the sense that networks with

1 Neurons are universal analog computing modules in much thesame way that NAND gates are universal digital computing mod-

4

virtually all popular nonlinear activation functions σ(x)can approximate any smooth function to any desired ac-curacy — even using merely a single hidden layer. How-ever, these theorems do not guarantee that this can beaccomplished with a network of feasible size, and the fol-lowing simple example explains why they cannot: Thereare 22

n

different Boolean functions of n variables, so anetwork implementing a generic function in this class re-quires at least 2n bits to describe, i.e., more bits thanthere are atoms in our universe if n > 260.

The fact that neural networks of feasible size are nonethe-less so useful therefore implies that the class of functionswe care about approximating is dramatically smaller. Wewill see below in Section II D that both physics and ma-chine learning tend to favor Hamiltonians that are poly-nomials2 — indeed, often ones that are sparse, symmetricand low-order. Let us therefore focus our initial investi-gation on Hamiltonians that can be expanded as a powerseries:

Hy(x) = h+∑i

hixi+∑i≤j

hijxixj+∑i≤j≤k

hijkxixjxk+· · · .

(9)If the vector x has n components (i = 1, ..., n), then thereare (n+ d)!/(n!d!) terms of degree up to d.

1. Continuous input variables

If we can accurately approximate multiplication using asmall number of neurons, then we can construct a net-work efficiently approximating any polynomial Hy(x) byrepeated multiplication and addition. We will now seethat we can, using any smooth but otherwise arbitrarynon-linearity σ that is applied element-wise. The popularlogistic sigmoid activation function σ(x) = 1/(1 + e−x)will do the trick.

Theorem: Let f be a neural network of the form f =A2σA1, where σ acts elementwise by applying somesmooth non-linear function σ to each element. Let the

ules: any computable function can be accurately evaluated bya sufficiently large network of them. Just as NAND gates arenot unique (NOR gates are also universal), nor is any particularneuron implementation — indeed, any generic smooth nonlinearactivation function is universal [8, 9].

2 The class of functions that can be exactly expressed by a neu-ral network must be invariant under composition, since addingmore layers corresponds to using the output of one functionas the input to another. Important such classes include lin-ear functions, affine functions, piecewise linear functions (gen-erated by the popular Rectified Linear unit “ReLU” activationfunction σ(x) = max[0, x]), polynomials, continuous functionsand smooth functions whose nth derivatives are continuous. Ac-cording to the Stone-Weierstrass theorem, both polynomials andpiecewise linear functions can approximate continuous functionsarbitrarily well.

input layer, hidden layer and output layer have sizes 2, 4and 1, respectively. Then f can approximate a multipli-cation gate arbitrarily well.

To see this, let us first Taylor-expand the function σaround the origin:

σ(u) = σ0 + σ1u+ σ2u2

2+O(u3). (10)

Without loss of generality, we can assume that σ2 6= 0:since σ is non-linear, it must have a non-zero secondderivative at some point, so we can use the biases inA1 to shift the origin to this point to ensure σ2 6= 0.Equation (10) now implies that

m(u, v) ≡ σ(u+v)+σ(−u−v)−σ(u−v)−σ(−u+v)

4σ2

= uv[1 +O

(u2 + v2

)], (11)

where we will term m(u, v) the multiplication approxima-tor. Taylor’s theorem guarantees that m(u, v) is an arbi-trarily good approximation of uv for arbitrarily small |u|and |v|. However, we can always make |u| and |v| arbi-trarily small by scaling A1 → λA1 and then compensat-ing by scaling A2 → λ−2A2. In the limit that λ → ∞,this approximation becomes exact. In other words, ar-bitrarily accurate multiplication can always be achievedusing merely 4 neurons. Figure 2 illustrates such a multi-plication approximator. (Of course, a practical algorithmlike stochastic gradient descent cannot achieve arbitrarilylarge weights, though a reasonably good approximationcan be achieved already for λ−1 ∼ 10.)

Corollary: For any given multivariate polynomial andany tolerance ε > 0, there exists a neural network of fixedfinite size N (independent of ε) that approximates thepolynomial to accuracy better than ε. Furthermore, Nis bounded by the complexity of the polynomial, scalingas the number of multiplications required times a factorthat is typically slightly larger than 4.3

This is a stronger statement than the classic universaluniversal approximation theorems for neural networks [8,9], which guarantee that for every ε there exists someN(ε), but allows for the possibility that N(ε) → ∞ asε → 0. An approximation theorem in [10] provides anε-independent bound on the size of the neural network,but at the price of choosing a pathological function σ.

3 In addition to the four neurons required for each multiplication,additional neurons may be deployed to copy variables to higherlayers bypassing the nonlinearity in σ. Such linear “copy gates”implementing the function u → u are of course trivial to imple-ment using a simpler version of the above procedure: using A1

to shift and scale down the input to fall in a tiny range whereσ′(u) 6= 0, and then scaling it up and shifting accordingly withA2.

5

u v

uv

λ -λλ

-λ-λ

λ

-λλ

σ σσσ

Continuous multiplication gate:

w 1u v

uvw

β

σ

Binary multiplication gate:

β β -2.5β

1μ μ -μ-μ

λ-2

4σ”(0)μ

FIG. 2: Multiplication can be efficiently implemented by sim-ple neural nets, becoming arbitrarily accurate as λ→ 0 (left)and β → ∞ (right). Squares apply the function σ, circlesperform summation, and lines multiply by the constants la-beling them. The “1” input implements the bias term. Theleft gate requires σ′′(0) 6= 0, which can always be arranged bybiasing the input to σ. The right gate requires the sigmoidalbehavior σ(x) → 0 and σ(x) → 1 as x → −∞ and x → ∞,respectively.

2. Discrete input variables

For the simple but important case where x is a vectorof bits, so that xi = 0 or xi = 1, the fact that x2i = ximakes things even simpler. This means that only termswhere all variables are different need be included, whichsimplifies equation (9) to

Hy(x) = h+∑i

hixi+∑i<j

hijxiyj+∑i<j<k

hijkxixjxk+· · · .

(12)The infinite series equation (9) thus gets replaced bya finite series with 2n terms, ending with the termh1...nx1 · · ·xn. Since there are 2n possible bit strings x,the 2n h−parameters in equation (12) suffice to exactlyparametrize an arbitrary function Hy(x).

The efficient multiplication approximator above multi-plied only two variables at a time, thus requiring multi-ple layers to evaluate general polynomials. In contrast,H(x) for a bit vector x can be implemented using merelythree layers as illustrated in Figure 2, where the middlelayer evaluates the bit products and the third layer takesa linear combination of them. This is because bits al-low an accurate multiplication approximator that takesthe product of an arbitrary number of bits at once, ex-ploiting the fact that a product of bits can be triviallydetermined from their sum: for example, the productx1x2x3 = 1 if and only if the sum x1 +x2 +x3 = 3. Thissum-checking can be implemented using one of the mostpopular choices for a nonlinear function σ: the logisticsigmoid σ(x) = 1

1+e−x which satisfies σ(x) ≈ 0 for x� 0

and σ(x) ≈ 1 for x � 1. To compute the product ofsome set of k bits described by the set K (for our exam-ple above, K = {1, 2, 3}), we let A1 and A2 shift and

stretch the sigmoid to exploit the identity

∏i∈K

xi = limβ→∞

σ

[−β

(k − 1

2−∑x∈K

xi

)]. (13)

Since σ decays exponentially fast toward 0 or 1 as β is in-creased, modestly large β-values suffice in practice; if, forexample, we want the correct answer to D = 10 decimalplaces, we merely need β > D ln 10 ≈ 23. In summary,when x is a bit string, an arbitrary function py(x) can beevaluated by a simple 3-layer neural network: the mid-dle layer uses sigmoid functions to compute the productsfrom equation (12), and the top layer performs the sumsfrom equation (12) and the softmax from equation (8).

D. What Hamiltonians do we want toapproximate?

We have seen that polynomials can be accurately approx-imated by neural networks using a number of neuronsscaling either as the number of multiplications required(for the continuous case) or as the number of terms (forthe binary case). But polynomials per se are no panacea:with binary input, all functions are polynomials, andwith continuous input, there are (n + d)!/(n!d!) coeffi-cients in a generic polynomial of degree d in n variables,which easily becomes unmanageably large. We will nowdiscuss situations in which exceptionally simple polyno-mials that are sparse, symmetric and/or low-order playa special role in physics and machine-learning.

1. Low polynomial order

The Hamiltonians that show up in physics are not ran-dom functions, but tend to be polynomials of very loworder, typically of degree ranging from 2 to 4. The sim-plest example is of course the harmonic oscillator, whichis described by a Hamiltonian that is quadratic in bothposition and momentum. There are many reasons whylow order polynomials show up in physics. Two of themost important ones are that sometimes a phenomenoncan be studied perturbatively, in which case, Taylor’stheorem suggests that we can get away with a low orderpolynomial approximation. A second reason is renor-malization: higher order terms in the Hamiltonian of astatistical field theory tend to be negligible if we onlyobserve macroscopic variables.

At a fundamental level, the Hamiltonian of the stan-dard model of particle physics has d = 4. There aremany approximations of this quartic Hamiltonian thatare accurate in specific regimes, for example the Maxwellequations governing electromagnetism, the Navier-Stokesequations governing fluid dynamics, the Alven equa-tions governing magnetohydrodynamics and various Ising

6

models governing magnetization — all of these approxi-mations have Hamiltonians that are polynomials in thefield variables, of degree d ranging from 2 to 4.

This means that the number of polynomial coefficients inmany examples is not infinite as in equation (9) or expo-nential in n as in equation (12), merely of order O(n4).

There are additional reasons why we might expect low or-der polynomials. Thanks to the Central Limit Theorem[11], many probability distributions in machine-learningand statistics can be accurately approximated by multi-variate Gaussians, i.e., of the form

p(x) = eh+∑i hjxi−

∑ij hijxixj , (14)

which means that the Hamiltonian H = − ln p is aquadratic polynomial. More generally, the maximum-entropy probability distribution subject to constraints onsome of the lowest moments, say expectation values of theform 〈xα1

1 xα22 · · ·xαnn 〉 for some integers αi ≥ 0 would lead

to a Hamiltonian of degree no greater than d ≡∑i αi

[12].

Image classification tasks often exploit invariance undertranslation, rotation, and various nonlinear deformationsof the image plane that move pixels to new locations. Allsuch spatial transformations are linear functions (d = 1polynomials) of the pixel vector x. Functions implement-ing convolutions and Fourier transforms are also d = 1polynomials.

Of course, such arguments do not imply that we shouldexpect to see low order polynomials in every application.If we consider some data set generated by a very simpleHamiltonian (say the Ising Hamiltonian), but then dis-card some of the random variables, the resulting distribu-tion will in general become quite complicated. Similarly,if we do not observe the random variables directly, butobserve some generic functions of the random variables,the result will generally be a mess. These arguments,however, might indicate that the probability of encoun-tering a Hamiltonian described by a low-order polyno-mial in some application might be significantly higherthan what one might expect from some naive prior. Forexample, a uniform prior on the space of all polynomi-als of degree N would suggest that a randomly chosenpolynomial would almost always have degree N , but thismight be a bad prior for real-world applications.

We should also note that even if a Hamiltonian is de-scribed exactly by a low-order polynomial, we would notexpect the corresponding neural network to reproduce alow-order polynomial Hamiltonian exactly in any practi-cal scenario for a host of possible reasons including lim-ited data, the requirement of infinite weights for infiniteaccuracy, and the failure of practical algorithms such asstochastic gradient descent to find the global minimumof a cost function in many scenarios. So looking at theweights of a neural network trained on actual data maynot be a good indicator of whether or not the underlyingHamiltonian is a polynomial of low degree or not.

2. Locality

One of the deepest principles of physics is locality: thatthings directly affect only what is in their immediatevicinity. When physical systems are simulated on a com-puter by discretizing space onto a rectangular lattice, lo-cality manifests itself by allowing only nearest-neighborinteraction. In other words, almost all coefficients inequation (9) are forced to vanish, and the total numberof non-zero coefficients grows only linearly with n. Forthe binary case of equation (9), which applies to magne-tizations (spins) that can take one of two values, localityalso limits the degree d to be no greater than the num-ber of neighbors that a given spin is coupled to (since allvariables in a polynomial term must be different).

Again, the applicability of these considerations to partic-ular machine learning applications must be determinedon a case by case basis. Certainly, an arbitrary transfor-mation of a collection of local random variables will re-sult in a non-local collection. (This might ruin locality incertain ensembles of images, for example). But there arecertainly cases in physics where locality is still approxi-mately preserved, for example in the simple block-spinrenormalization group, spins are grouped into blocks,which are then treated as random variables. To a highdegree of accuracy, these blocks are only coupled to theirnearest neighbors. Such locality is famously exploited byboth biological and artificial visual systems, whose firstneuronal layer performs merely fairly local operations.

3. Symmetry

Whenever the Hamiltonian obeys some symmetry (is in-variant under some transformation), the number of in-dependent parameters required to describe it is furtherreduced. For instance, many probability distributions inboth physics and machine learning are invariant undertranslation and rotation. As an example, consider a vec-tor x of air pressures yi measured by a microphone attimes i = 1, ..., n. Assuming that the Hamiltonian de-scribing it has d = 2 reduces the number of parametersN from ∞ to (n + 1)(n + 2)/2. Further assuming lo-cality (nearest-neighbor couplings only) reduces this toN = 2n, after which requiring translational symmetryreduces the parameter count to N = 3. Taken together,the constraints on locality, symmetry and polynomial or-der reduce the number of continuous parameters in theHamiltonian of the standard model of physics to merely32 [13].

Symmetry can reduce not merely the parameter count,but also the computational complexity. For example, ifa linear vector-valued function f(x) mapping a set of nvariables onto itself happens to satisfy translational sym-metry, then it is a convolution (implementable by a con-volutional neural net; “convnet”), which means that it

7

can be computed with n log2 n rather than n2 multipli-cations using Fast Fourier transform.

III. WHY DEEP?

Above we investigated how probability distributions fromphysics and computer science applications lent them-selves to “cheap learning”, being accurately and effi-ciently approximated by neural networks with merely ahandful of layers. Let us now turn to the separate ques-tion of depth, i.e., the success of deep learning: whatproperties of real-world probability distributions causeefficiency to further improve when networks are madedeeper? This question has been extensively studied froma mathematical point of view [14–16], but mathemat-ics alone cannot fully answer it, because part of the an-swer involves physics. We will argue that the answer in-volves the hierarchical/compositional structure of gener-ative processes together with inability to efficiently “flat-ten” neural networks reflecting this structure.

A. Hierarchical processess

One of the most striking features of the physical worldis its hierarchical structure. Spatially, it is an objecthierarchy: elementary particles form atoms which in turnform molecules, cells, organisms, planets, solar systems,galaxies, etc. Causally, complex structures are frequentlycreated through a distinct sequence of simpler steps.

Figure 3 gives two examples of such causal hierarchiesgenerating data vectors y0 7→ y1 7→ ... 7→ yn that arerelevant to physics and image classification, respectively.Both examples involve a Markov chain4 where the prob-ability distribution p(yi) at the ith level of the hierarchyis determined from its causal predecessor alone:

pi = Mipi−1, (15)

where the probability vector pi specifies the probabil-ity distribution of p(yi) according to (pi)y ≡ p(yi) andthe Markov matrix Mi specifies the transition probabili-ties between two neighboring levels, p(yi|yi−1). Iteratingequation (15) gives

pn = MnMn−1 · · ·M1p0, (16)

so we can write the combined effect of the the entiregenerative process as a matrix product.

4 If the next step in the generative hierarchy requires knowledgeof not merely of the present state but also information of thepast, the present state can be redefined to include also this in-formation, thus ensuring that the generative process is a Markovprocess.

In our physics example (Figure 3, left), a set of cosmo-logical parameters y0 (the density of dark matter, etc.)determines the power spectrum y1 of density fluctuationsin our universe, which in turn determines the patternof cosmic microwave background radiation y2 reachingus from our early universe, which gets combined withforeground radio noise from our Galaxy to produce thefrequency-dependent sky maps (y3) that are recorded bya satellite-based telescope that measures linear combina-tions of different sky signals and adds electronic receivernoise. For the recent example of the Planck Satellite[17], these datasets yi, y2, ... contained about 101, 104,108, 109 and 1012 numbers, respectively.

More generally, if a given data set is generated by a (clas-sical) statistical physics process, it must be described byan equation in the form of equation (16), since dynamicsin classical physics is fundamentally Markovian: classi-cal equations of motion are always first order differentialequations in the Hamiltonian formalism. This techni-cally covers essentially all data of interest in the machinelearning community, although the fundamental Marko-vian nature of the generative process of the data may bean in-efficient description.

Our toy image classification example (Figure 3, right)is deliberately contrived and over-simplified for peda-gogy: y0 is a single bit signifying “cat or dog”, whichdetermines a set of parameters determining the animal’scoloration, body shape, posture, etc. using approxiateprobability distributions, which determine a 2D imagevia ray-tracing, which is scaled and translated by ran-dom amounts before a randomly generated backgroundis added.

In both examples, the goal is to reverse this generative hi-erarchy to learn about the input y ≡ y0 from the outputyn ≡ x, specifically to provide the best possibile estimateof the probability distribution p(y|y) = p(y0|yn) — i.e.,to determine the probability distribution for the cosmo-logical parameters and to determine the probability thatthe image is a cat, respectively.

B. Resolving the swindle

This decomposition of the generative process into a hier-archy of simpler steps helps resolve the“swindle” paradoxfrom the introduction: although the number of parame-ters required to describe an arbitrary function of the in-put data y is beyond astronomical, the generative processcan be specified by a more modest number of parameters,because each of its steps can. Whereas specifying anarbitrary probability distribution over multi-megapixelimages x requires far more bits than there are atomsin our universe, the information specifying how to com-pute the probability distribution p(x|y) for a microwavebackground map fits into a handful of published journalarticles or software packages [18–24]. For a megapixel

8

y0=y

y1

y2

y3

>y3=T3(x)

>y2=T2(x)

>y0=T0(x)

y4

M4

M1

M3

M2

>y1=T1(x)

x=y4

add

fore

grou

nds

sim

ulat

esk

y m

ap

n, n , Q, T/ST

Ω, Ω , Λ, τ, hb

Pixel 1 Pixel 2 ΔT6422347 6443428 -454.8413141592 2718281 141.4218454543 9345593 654.7661004356 8345388 -305.567 ... ... ...

TELESCOPEDATA

CMB SKYMAP

TRANSFORMEDOBJECT

RAY-TRACEDOBJECT

CATEGORY LABEL

FINAL IMAGE

SOLIDWORKSPARAMTERS

POWERSPECTRUM

COSMO-LOGICALPARAMTERS

cat or dog?

take

line

arco

mbi

natio

ns,

add

nois

eray trace

select background select color,

shape & posture

f 3f 2

f 1f 0

gene

rate

fluct

uatio

nsscale &

translateMULTΙ-FREQUENCYMAPS

param 1 param 2 param 36422347 6443428 -454.8413141592 2718281 141.4218454543 9345593 654.7661004356 8345388 -305.567 ... ... ...

y0=y

y1

y2

y3

y4

M4

M1

M3

M2

FIG. 3: Causal hierarchy examples relevant to physics (left) and image classification (right). As information flows down thehierarchy y0 → y1 → ... → yn = y, some of it is destroyed by random Markov processes. However, no further information islost as information flows optimally back up the hierarchy as yn−1 → ...→ y0. The right example is deliberately contrived andover-simplified for pedagogy; for example, translation and scaling are more naturally performed before ray tracing, which inturn breaks down into multiple steps.

image of a galaxy, its entire probability distribution isdefined by the standard model of particle physics withits 32 parameters [13], which together specify the pro-cess transforming primordial hydrogen gas into galaxies.

The same parameter-counting argument can also be ap-plied to all artificial images of interest to machine learn-ing: for example, giving the simple low-information-content instruction “draw a cute kitten” to a randomsample of artists will produce a wide variety of images ywith a complicated probability distribution over colors,postures, etc., as each artist makes random choices at aseries of steps. Even the pre-stored information aboutcat probabilities in these artists’ brains is modest in size.

Note that a random resulting image typically containsmuch more information than the generative process cre-ating it; for example, the simple instruction “generatea random string of 109 bits” contains much fewer than109 bits. Not only are the typical steps in the genera-tive hierarchy specified by a non-astronomical number ofparameters, but as discussed in Section II D, it is plausi-ble that neural networks can implement each of the stepsefficiently.5

5 Although our discussion is focused on describing probability dis-tributions, which are not random, stochastic neural networks

9

A deep neural network stacking these simpler networkson top of one another would then implement the entiregenerative process efficiently. In summary, the data setsand functions we care about form a minuscule minority,and it is plausible that they can also be efficiently im-plemented by neural networks reflecting their generativeprocess. So what is the remainder? Which are the datasets and functions that we do not care about?

Almost all images are indistinguishable from randomnoise, and almost all data sets and functions are in-distinguishable from completely random ones. This fol-lows from Borel’s theorem on normal numbers [26], whichstates that almost all real numbers have a string of dec-imals that would pass any randomness test, i.e., areindistinguishable from random noise. Simple parame-ter counting shows that deep learning (and our humanbrains, for that matter) would fail to implement almostall such functions, and training would fail to find anyuseful patterns. To thwart pattern-finding efforts. cryp-tography therefore aims to produces random-looking pat-terns. Although we might expect the Hamiltonians de-scribing human-generated data sets such as drawings,text and music to be more complex than those describingsimple physical systems, we should nonetheless expectthem to resemble the natural data sets that inspired theircreation much more than they resemble random func-tions.

C. Sufficient statistics and hierarchies

The goal of deep learning classifiers is to reverse the hi-erarchical generative process as well as possible, to makeinferences about the input y from the output x. Let usnow treat this hierarchical problem more rigorously usinginformation theory.

Given P (y|x), a sufficient statistic T (x) is defined by theequation P (y|x) = P (y|T (x)) and has played an impor-tant role in statistics for almost a century [27]. All theinformation about y contained in x is contained in thesufficient statistic. A minimal sufficient statistic [27] issome sufficient statistic T∗ which is a sufficient statisticfor all other sufficient statistics. This means that if T (y)is sufficient, then there exists some function f such thatT∗(y) = f(T (y)). As illustrated in Figure 3, T∗ can bethought of as a an information distiller, optimally com-pressing the data so as to retain all information relevantto determining y and discarding all irrelevant informa-tion.

can generate random variables as well. In biology, spiking neu-rons provide a good random number generator, and in machinelearning, stochastic architectures such as restricted Boltzmannmachines [25] do the same.

The sufficient statistic formalism enables us to state somesimple but important results that apply to any hierarchi-cal generative process cast in the Markov chain form ofequation (16).

Theorem 2: Given a Markov chain described by ournotation above, let Ti be a minimal sufficient statistic ofP (yi|yn). Then there exists some functions fi such thatTi = fi ◦Ti+1. More casually speaking, the generative hi-erarchy of Figure 3 can be optimally reversed one step ata time: there are functions fi that optimally undo eachof the steps, distilling out all information about the levelabove that was not destroyed by the Markov process.Here is the proof. Note that for any k ≥ 1, the “back-wards” Markov property P (yi|yi+1, yi+k) = P (yi|yi+1)follows from the Markov property via Bayes’ theorem:

P (yi|yi+k, yi+1) =P (yi+k|yi, yi+1)P (yi|yi+1)

P (yi+k|yi+1)

=P (yi+k|yi+1)P (yi|yi+1)

P (yi+k|yi+1)

= P (yi|yi+1).

(17)

Using this fact, we see that

P (yi|yn) =∑yi+1

P (yi|yi+1yn)P (yi+1|yn)

=∑yi+1

P (yi|yi+1)P (yi+1|Ti+1(yn)).(18)

Since the above equation depends on yn only throughTi+1(yn), this means that Ti+1 is a sufficient statistic forP (yi|yn). But since Ti is the minimal sufficient statistic,there exists a function fi such that Ti = fi ◦ Ti+1.

Corollary 2: With the same assumptions and notationas theorem 2, define the function f0(T0) = P (y0|T0) andfn = Tn−1. Then

P (y0|yn) = (f0 ◦ f1 ◦ · · · ◦ fn) (yn). (19)

The proof is easy. By induction,

T0 = f1 ◦ f2 ◦ · · · ◦ Tn−1, (20)

which implies the corollary.

Roughly speaking, Corollary 2 states that the structure ofthe inference problem reflects the structure of the genera-tive process. In this case, we see that the neural networktrying to approximate P (y|x) must approximate a com-positional function. We will argue below in Section III Fthat in many cases, this can only be accomplished effi-ciently if the neural network has & n hidden layers.

In neuroscience parlance, the functions fi compress thedata into forms with ever more invariance [28], contain-ing features invariant under irrelevant transformations(for example background substitution, scaling and trans-lation).

10

Let us denote the distilled vectors yi ≡ fi(yi+1), whereyn ≡ y. As summarized by Figure 3, as information flowsdown the hierarchy y = y0 → y1 → ...exn = x, some of it is destroyed by random processes.However, no further information is lost as informationflows optimally back up the hierarchy as y → yn−1 →...→ y0.

D. Approximate information distillation

Although minimal sufficient statistics are often difficultto calculate in practice, it is frequently possible to comeup with statistics which are nearly sufficient in a certainsense which we now explain.

An equivalent characterization of a sufficient statistic isprovided by information theory [29, 30]. The data pro-cessing inequality [30] states that for any function f andany random variables x, y,

I(x, y) ≥ I(x, f(y)), (21)

where I is the mutual information:

I(x, y) =∑x,y

p(x, y) logp(x, y)

p(x)p(y). (22)

A sufficient statistic T (x) is a function f(x) for which “≥”gets replaced by “=” in equation (21), i.e., a functionretaining all the information about y.

Even information distillation functions f that are notstrictly sufficient can be very useful as long as they distillout most of the relevant information and are computa-tionally efficient. For example, it may be possible to tradesome loss of mutual information with a dramatic reduc-tion in the complexity of the Hamiltonian; e.g., Hy (f(x))may be considerably easier to implement in a neural net-work than Hy (x). Precisely this situation applies to thephysical example described in Figure 3, where a hierar-chy of efficient near-perfect information distillers fi havebeen found, the numerical cost of f3 [23, 24], f2 [21, 22],f1 [19, 20] and f0 [17] scaling with the number of in-puts parameters n as O(n), O(n3/2), O(n2) and O(n3),respectively. More abstractly, the procedure of renormal-ization, ubiquitous in statistical physics, can be viewedas a special case of approximate information distillation,as we will now describe.

E. Distillation and renormalization

The systematic framework for distilling out desired in-formation from unwanted “noise” in physical theories isknown as Effective Field Theory [31]. Typically, the de-sired information involves relatively large-scale features

that can be experimentally measured, whereas the noiseinvolves unobserved microscopic scales. A key part ofthis framework is known as the renormalization group(RG) transformation [31, 32]. Although the connectionbetween RG and machine learning has been studied or al-luded to repeatedly [7, 33–36], there are significant mis-conceptions in the literature concerning the connectionwhich we will now attempt to clear up.

Let us first review a standard working definition of whatrenormalization is in the context of statistical physics,involving three ingredients: a vector y of random vari-ables, a course-graining operation R and a requirementthat this operation leaves the Hamiltonian invariant ex-cept for parameter changes. We think of y as the micro-scopic degrees of freedom — typically physical quantitiesdefined at a lattice of points (pixels or voxels) in space.Its probability distribution is specified by a HamiltonianHy(x), with some parameter vector y. We interpret themap R : y → y as implementing a coarse-graining6 ofthe system. The random variable R(y) also has a Hamil-tonian, denoted H ′(R(y)), which we require to have thesame functional form as the original Hamiltonian Hy, al-though the parameters y may change. In other words,H ′(R(x)) = Hr(y)(R(x)) for some function r. Since thedomain and the range of R coincide, this map R can beiterated n times Rn = R ◦ R ◦ · · ·R, giving a Hamilto-nian Hrn(y)(R

n(x)) for the repeatedly renormalized data.Similar to the case of sufficient statistics, P (y|Rn(x)) willthen be a compositional function.

Contrary to some claims in the literature, effective fieldtheory and the renormalization group have little to dowith the idea of unsupervised learning and pattern-finding. Instead, the standard renormalization proce-dures in statistical physics are essentially a feature ex-tractor for supervised learning, where the features typi-cally correspond to long-wavelength/macroscopic degreesof freedom. In other words, effective field theory onlymakes sense if we specify what features we are interestedin. For example, if we are given data x about the posi-tion and momenta of particles inside a mole of some liquidand is tasked with predicting from this data whether ornot Alice will burn her finger when touching the liquid,a (nearly) sufficient statistic is simply the temperatureof the object, which can in turn be obtained from somevery coarse-grained degrees of freedom (for example, one

6 A typical renormalization scheme for a lattice system involvesreplacing many spins (bits) with a single spin according to somerule. In this case, it might seem that the map R could notpossibly map its domain onto itself, since there are fewer degreesof freedom after the coarse-graining. On the other hand, if welet the domain and range of R differ, we cannot easily talk aboutthe Hamiltonian as having the same functional form, since therenormalized Hamiltonian would have a different domain thanthe original Hamiltonian. Physicists get around this by takingthe limit where the lattice is infinitely large, so that R maps aninfinite lattice to an infinite lattice.

11

could use the fluid approximation instead of working di-rectly from the positions and momenta of ∼ 1023 par-ticles). But without specifying that we wish to predict(long-wavelength physics), there is nothing natural aboutan effective field theory approximation.

To be more explicit about the link between renormaliza-tion and deep-learning, consider a toy model for naturalimages. Each image is described by an intensity fieldφ(r), where r is a 2-dimensional vector. We assume thatan ensemble of images can be described by a quadraticHamiltonian of the form

Hy(φ) =

∫ [y0φ

2 + y1(∇φ)2 + y2(∇2φ

)2+ · · ·

]d2r.

(23)Each parameter vector y defines an ensemble of images;we could imagine that the fictitious classes of images thatwe are trying to distinguish are all generated by Hamilto-nians Hy with the same above form but different parame-ter vectors y. We further assume that the function φ(r) isspecified on pixels that are sufficiently close that deriva-tives can be well-approximated by differences. Deriva-tives are linear operations, so they can be implementedin the first layer of a neural network. The translationalsymmetry of equation (23) allows it to be implementedwith a convnet. If can be shown [31] that for any course-graining operation that replaces each block of b×b pixelsby its average and divides the result by b2, the Hamil-tonian retains the form of equation (23) but with theparameters yi replaced by

y′i = b2−2iyi. (24)

This means that all parameters yi with i ≥ 2 decay expo-nentially with b as we repeatedly renormalize and b keepsincreasing, so that for modest b, one can neglect all butthe first few yi’s. What would have taken an arbitrarilylarge neural network can now be computed on a neuralnetwork of finite and bounded size, assuming that weare only interested in classifying the data based only onthe coarse-grained variables. These insufficient statisticswill still have discriminatory power if we are only inter-ested in discriminating Hamiltonians which all differ intheir first few Ck. In this example, the parameters y0and y1 correspond to “relevant operators” by physicistsand “signal” by machine-learners, whereas the remain-ing parameters correspond to “irrelevant operators” byphysicists and “noise” by machine-learners.

The fixed point structure of the transformation in this ex-ample is very simple, but one can imagine that in morecomplicated problems the fixed point structure of var-ious transformations might be highly non-trivial. Thisis certainly the case in statistical mechanics problemswhere renormalization methods are used to classify var-ious phases of matters; the point here is that the renor-malization group flow can be thought of as solving thepattern-recognition problem of classifying the long-rangebehavior of various statistical systems.

In summary, renormalization can be thought of as a typeof supervised learning7, where the large scale propertiesof the system are considered the features. If the desiredfeatures are not large-scale properties (as in most ma-chine learning cases), one might still expect the a gener-alized formalism of renormalization to provide some intu-ition to the problem by replacing a scale transformationwith some other transformation. But calling some pro-cedure renormalization or not is ultimately a matter ofsemantics; what remains to be seen is whether or not se-mantics has teeth, namely, whether the intuition aboutfixed points of the renormalization group flow can pro-vide concrete insight into machine learning algorithms.In many numerical methods, the purpose of the renor-malization group is to efficiently and accurately evaluatethe free energy of the system as a function of macroscopicvariables of interest such as temperature and pressure.Thus we can only sensibly talk about the accuracy ofan RG-scheme once we have specified what macroscopicvariables we are interested in.

F. No-flattening theorems

Above we discussed how Markovian generative modelscause p(x|y) to be a composition of a number of sim-pler functions fi. Suppose that we can approximate eachfunction fi with an efficient neural network for the rea-sons given in Section II. Then we can simply stack thesenetworks on top of each other, to obtain an deep neuralnetwork efficiently approximating p(x|y).

But is this the most efficient way to represent p(x|y)?Since we know that there are shallower networks thataccurately approximate it, are any of these shallow net-works as efficient as the deep one, or does flattening nec-essarily come at an efficiency cost?

To be precise, for a neural network f defined by equa-tion (6), we will say that the neural network f`ε is theflattened version of f if its number ` of hidden layers issmaller and f`ε approximates f within some error ε (as

7 A subtlety regarding the above statements is presented by theMulti-scale Entanglement Renormalization Ansatz (MERA) [37].MERA can be viewed as a variational class of wave functionswhose parameters can be tuned to to match a given wave func-tion as closely as possible. From this perspective, MERA is as anunsupervised machine learning algorithm, where classical proba-bility distributions over many variables are replaced with quan-tum wavefunctions. Due to the special tensor network struc-ture found in MERA, the resulting variational approximationof a given wavefunction has an interpretation as generating anRG flow. Hence this is an example of an unsupervised learningproblem whose solution gives rise to an RG flow. This is onlypossible due to the extra mathematical structure in the problem(the specific tensor network found in MERA); a generic varia-tional Ansatz does not give rise to any RG interpretation andvice versa.

12

measured by some reasonable norm). We say that f`ε isa neuron-efficient flattening if the sum of the dimensionsof its hidden layers (sometimes referred to as the number

of neurons Nn) is less than for f. We say that f`ε is asynapse-efficient flattening if the number Ns of non-zeroentries (sometimes called synapses) in its weight matricesis less than for f. This lets us define the flattening costof a network f as the two functions

Cn(f, `, ε) ≡ minf`ε

Nn(f`ε)

Nn(f), (25)

Cs(f, `, ε) ≡ minf`ε

Ns(f`ε)

Ns(f), (26)

specifying the factor by which optimal flattening in-creases the neuron count and the synapse count, respec-tively. We refer to results where Cn > 1 or Cs > 1for some class of functions f as “no-flattening theorems”,since they imply that flattening comes at a cost and ef-ficient flattening is impossible. A complete list of no-flattening theorems would show exactly when deep net-works are more efficient than shallow networks.

There has already been very interesting progress in thisspirit, but crucial questions remain. On one hand, ithas been shown that deep is not always better, at leastempirically for some image classification tasks [38]. Onthe other hand, many functions f have been found forwhich the flattening cost is significant. Certain deepBoolean circuit networks are exponentially costly to flat-ten [39]. Two families of multivariate polynomials withan exponential flattening cost Cn are constructed in[14].[6, 15, 16] focus on functions that have tree-like hierar-chical compositional form, concluding that the flatteningcost Cn is exponential for almost all functions in Sobolevspace. For the ReLU activation function, [40] finds a classof functions that exhibit exponential flattening costs; [41]study a tailored complexity measure of deep versus shal-low ReLU networks. [42] shows that given weak condi-tions on the activation function, there always exists atleast one function that can be implemented in a 3-layernetwork which has an exponential flattening cost. Fi-nally, [43, 44] study the differential geometry of shallowversus deep networks, and find that flattening is expo-nentially neuron-inefficient. Further work elucidating thecost of flattening various classes of functions will clearlybe highly valuable.

G. Linear no-flattening theorems

In the mean time, we will now see that interesting no-flattening results can be obtained even in the simpler-to-model context of linear neural networks [45], where theσ operators are replaced with the identity and all biasesare set to zero such that Ai are simply linear operators(matrices). Every map is specified by a matrix of real

(or complex) numbers, and composition is implementedby matrix multiplication.

One might suspect that such a network is so simple thatthe questions concerning flattening become entirely triv-ial: after all, successive multiplication with n differentmatrices is equivalent to multiplying by a single matrix(their product). While the effect of flattening is indeedtrivial for expressibility (f can express any linear func-tion, independently of how many layers there are), thisis not the case for the learnability, which involves non-linear and complex dynamics despite the linearity of thenetwork [45]. We will show that the efficiency of suchlinear networks is also a very rich question.

Neuronal efficiency is trivially attainable for linear net-works, since all hidden-layer neurons can be eliminatedwithout accuracy loss by simply multiplying all theweight matrices together. We will instead consider thecase of synaptic efficiency and set ` = ε = 0.

Many divide-and-conquer algorithms in numerical linearalgebra exploit some factorization of a particular ma-trix A in order to yield significant reduction in complex-ity. For example, when A represents the discrete Fouriertransform (DFT), the fast Fourier transform (FFT) algo-rithm makes use of a sparse factorization of A which onlycontains O(n log n) non-zero matrix elements instead ofthe naive single-layer implementation, which contains n2

non-zero matrix elements. As first pointed out in [46],this is an example where depth helps and, in our termi-nology, of a linear no-flattening theorem: fully flatteninga network that performs an FFT of n variables increasesthe synapse count Ns from O(n log n) to O(n2), i.e., in-curs a flattening cost Cs = O(n/ log n) ∼ O(n). Thisargument applies also to many variants and generaliza-tions of the FFT such as the Fast Wavelet Transform andthe Fast Walsh-Hadamard Transform.

Another important example illustrating the subtlety oflinear networks is matrix multiplication. More specifi-cally, take the input of a neural network to be the entriesof a matrix M and the output to be NM, where bothM and N have size n×n. Since matrix multiplication islinear, this can be exactly implemented by a 1-layer lin-ear neural network. Amazingly, the naive algorithm formatrix multiplication, which requires n3 multiplications,is not optimal: the Strassen algorithm [47] requires onlyO(nω) multiplications (synapses), where ω = log2 7 ≈2.81, and recent work has cut this scaling exponent downto ω ≈ 2.3728639 [48]. This means that fully optimizedmatrix multiplication on a deep neural network has aflattening cost of at least Cs = O(n0.6271361).

Low-rank matrix multiplication gives a more elementaryno-flattening theorem. If A is a rank-k matrix, we canfactor it as A = BC where B is a k × n matrix andC is an n × k matrix. Hence the number of synapses isn2 for an ` = 0 network and 2nk for an ` = 1-network,giving a flattening cost Cs = n/2k > 1 as long as therank k < n/2.

13

Finally, let us consider flattening a network f = AB,where A and B are random sparse n × n matrices suchthat each element is 1 with probability p and 0 with prob-ability 1− p. Flattening the network results in a matrixFij =

∑k AikBkj , so the probability that Fij = 0 is

(1 − p2)n. Hence the number of non-zero componentswill on average be

(1− (1− p2)n

)n2, so

Cs =

[1− (1− p2)n

]n2

2n2p=

1− (1− p2)n

2p. (27)

Note that Cs ≤ 1/2p and that this bound is asymptoti-cally saturated for n� 1/p2. Hence in the limit where nis very large, flattening multiplication by sparse matricesp� 1 is horribly inefficient.

H. A polynomial no-flattening theorem

In Section II, we saw that multiplication of two variablescould be implemented by a flat neural network with 4neurons in the hidden layer, using equation (11) as illus-trated in Figure 2. In Appendix A, we show that equa-tion (11) is merely the n = 2 special case of the formula

n∏i=1

xi =1

2n

∑{s}

s1...snσ(s1x1 + ...+ snxn), (28)

where the sum is over all possible 2n configurations ofs1, · · · sn where each si can take on values ±1. In otherwords, multiplication of n variables can be implementedby a flat network with 2n neurons in the hidden layer. Wealso prove in Appendix A that this is the best one cando: no neural network can implement an n-input multi-plication gate using fewer than 2n neurons in the hiddenlayer. This is another powerful no-flattening theorem,telling us that polynomials are exponentially expensiveto flatten. For example, if n is a power of two, then themonomial x1x2...xn can be evaluated by a deep networkusing only 4n neurons arranged in a deep neural networkwhere n copies of the multiplication gate from Figure 2are arranged in a binary tree with log2 n layers (the 5thtop neuron at the top of Figure 2 need not be counted, asit is the input to whatever computation comes next). Incontrast, a functionally equivalent flattened network re-quires a whopping 2n neurons. For example, a deep neu-ral network can multiply 32 numbers using 4n = 160 neu-rons while a shallow one requires 232 = 4, 294, 967, 296neurons. Since a broad class of real-world functions canbe well approximated by polynomials, this helps explainwhy many useful neural networks cannot be efficientlyflattened.

IV. CONCLUSIONS

We have shown that the success of deep and cheap (low-parameter-count) learning depends not only on mathe-

matics but also on physics, which favors certain classes ofexceptionally simple probability distributions that deeplearning is uniquely suited to model. We argued that thesuccess of shallow neural networks hinges on symmetry,locality, and polynomial log-probability in data from orinspired by the natural world, which favors sparse low-order polynomial Hamiltonians that can be efficiently ap-proximated. These arguments should be particularly rel-evant for explaining the success of machine-learning ap-plications to physics, for example using a neural networkto approximate a many-body wavefunction [49]. Whereasprevious universality theorems guarantee that there ex-ists a neural network that approximates any smooth func-tion to within an error ε, they cannot guarantee that thesize of the neural network does not grow to infinity withshrinking ε or that the activation function σ does not be-come pathological. We show constructively that given amultivariate polynomial and any generic non-linearity, aneural network with a fixed size and a generic smooth ac-tivation function can indeed approximate the polynomialhighly efficiently.

Turning to the separate question of depth, we have ar-gued that the success of deep learning depends on theubiquity of hierarchical and compositional generativeprocesses in physics and other machine-learning appli-cations. By studying the sufficient statistics of the gen-erative process, we showed that the inference problemrequires approximating a compositional function of theform f1 ◦ f2 ◦ f2 ◦ · · · that optimally distills out the in-formation of interest from irrelevant noise in a hierar-chical process that mirrors the generative process. Al-though such compositional functions can be efficientlyimplemented by a deep neural network as long as theirindividual steps can, it is generally not possible to retainthe efficiency while flattening the network. We extend ex-isting “no-flattening” theorems [14–16] by showing thatefficient flattening is impossible even for many importantcases involving linear networks. In particular, we provethat flattening polynomials is exponentially expensive,with 2n neurons required to multiply n numbers usinga single hidden layer, a task that a deep network canperform using only ∼ 4n neurons.

Strengthening the analytic understanding of deep learn-ing may suggest ways of improving it, both to make itmore capable and to make it more robust. One promis-ing area is to prove sharper and more comprehensiveno-flattening theorems, placing lower and upper boundson the cost of flattening networks implementing variousclasses of functions.

Acknowledgements: This work was supported by theFoundational Questions Institute http://fqxi.org/,the Rothberg Family Fund for Cognitive Science and NSFgrant 1122374. We thank Scott Aaronson, Frank Ban,Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, BartSelman, Viktoriya Krakovna, Krishanu Sankar and BoyaSong for helpful discussions and suggestions, Frank Ban,

http://fqxi.org/

14

Fernando Perez, Jared Jolton, and the anonymous refereefor helpful corrections and the Center for Brains, Minds,and Machines (CBMM) for hospitality.

Appendix A: The polynomial no-flattening theorem

We saw above that a neural network can compute poly-nomials accurately and efficiently at linear cost, usingonly about 4 neurons per multiplication. For example,if n is a power of two, then the monomial

∏ni=1 xi can

be evaluated using 4n neurons arranged in a binary treenetwork with log2 n hidden layers. In this appendix, wewill prove a no-flattening theorem demonstrating thatflattening polynomials is exponentially expensive:

Theorem: Suppose we are using a generic smooth ac-tivation function σ(x) =

∑∞k=0 σkx

k, where σk 6= 0 for0 ≤ k ≤ n. Then for any desired accuracy ε > 0, thereexists a neural network that can implement the function∏ni=1 xi using a single hidden layer of 2n neurons. Fur-

thermore, this is the smallest possible number of neuronsin any such network with only a single hidden layer.

This result may be compared to problems in Boolean cir-cuit complexity, notably the question of whether TC0 =TC1 [50]. Here circuit depth is analogous to numberof layers, and the number of gates is analogous to thenumber of neurons. In both the Boolean circuit modeland the neural network model, one is allowed to use neu-rons/gates which have an unlimited number of inputs.The constraint in the definition of TCi that each of thegate elements be from a standard universal library (AND,OR, NOT, Majority) is analogous to our constraint to usea particular nonlinear function. Note, however, that ourtheorem is weaker by applying only to depth 1, whileTC0 includes all circuits of depth O(1).

1. Proof that 2n neurons are sufficient

A neural network with a single hidden layer of m neu-rons that approximates a product gate for n inputs canbe formally written as a choice of constants aij and wjsatisfying

m∑j=1

wjσ

(n∑i=1

aijxi

)≈

n∏i=1

xi. (A1)

Here, we use ≈ to denote that the two sides of (A1) haveidentical Taylor expansions up to terms of degree n; as wediscussed earlier in our construction of a product gate fortwo inputs, this exables us to achieve arbitrary accuracyε by first scaling down the factors xi, then approximatelymultiplying them and finally scaling up the result.

We may expand (A1) using the definition σ(x) =∑∞k=0 σkx

k and drop terms of the Taylor expansion with

degree greater than n, since they do not affect the ap-proximation. Thus, we wish to find the minimal m suchthat there exist constants aij and wj satisfying

σn

m∑j=1

wj

(n∑i=1

aijxi

)n=

n∏i=1

xi, (A2)

σk

m∑j=1

wj

(n∑i=1

aijxi

)k= 0, (A3)

for all 0 ≤ k ≤ n− 1. Let us set m = 2n, and enumeratethe subsets of {1, . . . , n} as S1, . . . , Sm in some order.Define a network of m neurons in a single hidden layerby setting aij equal to the function si(Sj) which is −1 ifi ∈ Sj and +1 otherwise, setting

wj ≡1

2nn!σn

n∏i=1

aij =(−1)|Sj |

2nn!σn. (A4)

In other words, up to an overall normalization constant,all coefficients aij and wj equal ±1, and each weight wjis simply the product of the corresponding aij .

We must prove that this network indeed satisfies equa-tions (A2) and (A3). The essence of our proof will beto expand the left hand side of Equation (A1) and showthat all monomial terms except x1 · · · xn come in pairsthat cancel. To show this, consider a single monomialp(x) = xr11 · · ·xrnn where r1 + . . .+ rn = r ≤ n.

If p(x) 6=∏ni=1 xi, then we must show that the coef-

ficient of p(x) in σr∑mj=1 wj (

∑ni=1 aijxi)

ris 0. Since

p(x) 6=∏ni=1 xi, there must be some i0 such that ri0 = 0.

In other words, p(x) does not depend on the variable xi0 .Since the sum in Equation (A1) is over all combinationsof ± signs for all variables, every term will be canceledby another term where the (non-present) xi0 has the op-posite sign and the weight wj has the opposite sign:

σr

m∑j=1

wj

(n∑i=1

aijxi

)r

= σr∑Sj

(−1)|Sj |

2nn!σr

(n∑i=1

si(Sj)xi

)r

= σr∑Sj 63i0

[(−1)|Sj |

2nn!σr

(n∑i=1

si(Sj)xi

)r

+(−1)|Sj∪{i0}|

2nn!σr

(n∑i=1

si(Sj ∪ {i0})xi

)r ]

=∑Sj 63i0

(−1)|Sj |

2nn!

[(n∑i=1

si(Sj)xi

)r

−

(n∑i=1

si(Sj ∪ {i0})xi

)r ]

15

Observe that the coefficient of p(x) is equal in(∑ni=1 si(Sj)xi)

rand (

∑ni=1 si(Sj ∪ {i0})xi)

r, since

ri0 = 0. Therefore, the overall coefficient of p(x) in theabove expression must vanish, which implies that (A3) issatisfied.

If instead p(x) =∏ni=1 xi, then all terms have the coeffi-

cient of p(x) in (∑ni=1 aijxi)

nis n!

∏ni=1 aij = (−1)|Sj |n!,

because all n! terms are identical and there is no cancela-tion. Hence, the coefficient of p(x) on the left-hand sideof (A2) is

σn

m∑j=1

(−1)|Sj |

2nn!σn(−1)|Sj |n! = 1,

completing our proof that this network indeed approxi-mates the desired product gate.

From the standpoint of group theory, our construction in-volves a representation of the group G = Zn2 , acting uponthe space of polynomials in the variables x1, x2, . . . , xn.The group G is generated by elements gi such that gi flipsthe sign of xi wherever it occurs. Then, our constructioncorresponds to the computation

f(x1, . . . , xn) = (1−g1)(1−g2) · · · (1−gn)σ(x1+x2+. . .+xn).

Every monomial of degree at most n, with the exceptionof the product x1 · · ·xn, is sent to 0 by (1−gi) for at leastone choice of i. Therefore, f(x1, . . . , xn) approximates aproduct gate (up to a normalizing constant).

2. Proof that 2n neurons are necessary

Suppose that S is a subset of {1, . . . , n} and considertaking the partial derivatives of (A2) and (A3), respec-tively, with respect to all the variables {xh}h∈S . Then,we obtain

n!σn

|n− S|!

m∑j=1

wj

∏h∈S

ahj

(n∑

i=1

aijxi

)n−|S|

=∏h 6∈S

xh,(A5)

k!σk

|k − S|!

m∑j=1

wj

∏h∈S

ahj

(n∑

i=1

aijxi

)k−|S|

= 0, (A6)

for all 0 ≤ k ≤ n − 1. Let A denote the 2n ×m matrixwith elements

ASj ≡∏h∈S

ahj . (A7)

We will show that A has full row rank. Suppose, towardscontradiction, that ctA = 0 for some non-zero vector c.Specifically, suppose that there is a linear dependencebetween rows of A given by

r∑`=1

c`AS`,j = 0, (A8)

where the S` are distinct and c` 6= 0 for every `. Let s bethe maximal cardinality of any S`. Defining the vector dwhose components are

dj ≡ wj

(n∑i=1

aijxi

)n−s, (A9)

taking the dot product of equation (A8) with d gives

0 = ctAd =

r∑`=1

c`

m∑j=1

wj∏h∈S`

ahj

(n∑i=1

aijxi

)n−s

=∑

`|(|S`|=s)

c`

m∑j=1

wj∏h∈S`

ahj

(n∑i=1

aijxi

)n−|S`|(A10)

+∑

`|(|S`|<s)

c`

m∑j=1

wj∏h∈S`

ahj

(n∑i=1

aijxi

)(n+|S`|−s)−|S`|

.

Applying equation (A6) (with k = n+|S`|−s) shows thatthe second term vanishes. Substituting equation (A5)now simplifies equation (A10) to

0 =∑

`|(|S`|=s)

c`|n− S`|!n! σn

∏h6∈S`

xh, (A11)

i.e., to a statement that a set of monomials are linearlydependent. Since all distinct monomials are in fact lin-early independent, this is a contradiction of our assump-tion that the S` are distinct and c` are nonzero. Weconclude that A has full row rank, and therefore thatm ≥ 2n, which concludes the proof.

[1] Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436(2015).

[2] Y. Bengio, Foundations and trends R© in Machine Learn-ing 2, 1 (2009).

[3] S. Russell, D. Dewey, and M. Tegmark, AI Magazine 36

(2015).[4] R. Herbrich and R. C. Williamson, Journal of Machine

Learning Research 3, 175 (2002).[5] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and

M. Anthony, IEEE transactions on Information Theory

16

44, 1926 (1998).[6] T. Poggio, F. Anselmi, and L. Rosasco, Tech. Rep., Cen-

ter for Brains, Minds and Machines (CBMM) (2015).[7] P. Mehta and D. J. Schwab, ArXiv e-prints (2014),

1410.3831.[8] K. Hornik, M. Stinchcombe, and H. White, Neural net-

works 2, 359 (1989).[9] G. Cybenko, Mathematics of control, signals and systems

2, 303 (1989).[10] A. Pinkus, Acta Numerica 8, 143 (1999).[11] B. Gnedenko, A. Kolmogorov, B. Gnedenko, and A. Kol-

mogorov, Amer. J. Math. 105, 28 (1954).[12] E. T. Jaynes, Physical review 106, 620 (1957).[13] M. Tegmark, A. Aguirre, M. J. Rees, and F. Wilczek,

Physical Review D 73, 023505 (2006).[14] O. Delalleau and Y. Bengio, in Advances in Neural In-

formation Processing Systems (2011), pp. 666–674.[15] H. Mhaskar, Q. Liao, and T. Poggio, ArXiv e-prints

(2016), 1603.00988.[16] H. Mhaskar and T. Poggio, arXiv preprint

arXiv:1608.03287 (2016).[17] R. Adam, P. Ade, N. Aghanim, Y. Akrami, M. Alves,

M. Arnaud, F. Arroja, J. Aumont, C. Baccigalupi,M. Ballardini, et al., arXiv preprint arXiv:1502.01582(2015).

[18] U. Seljak and M. Zaldarriaga, arXiv preprint astro-ph/9603033 (1996).

[19] M. Tegmark, Physical Review D 55, 5895 (1997).[20] J. Bond, A. H. Jaffe, and L. Knox, Physical Review D

57, 2117 (1998).[21] M. Tegmark, A. de Oliveira-Costa, and A. J. Hamilton,

Physical Review D 68, 123523 (2003).[22] P. Ade, N. Aghanim, C. Armitage-Caplan, M. Arnaud,

M. Ashdown, F. Atrio-Barandela, J. Aumont, C. Bacci-galupi, A. J. Banday, R. Barreiro, et al., Astronomy &Astrophysics 571, A12 (2014).

[23] M. Tegmark, The Astrophysical Journal Letters 480, L87(1997).

[24] G. Hinshaw, C. Barnes, C. Bennett, M. Greason,M. Halpern, R. Hill, N. Jarosik, A. Kogut, M. Limon,S. Meyer, et al., The Astrophysical Journal SupplementSeries 148, 63 (2003).

[25] G. Hinton, Momentum 9, 926 (2010).

[26] M. Emile Borel, Rendiconti del Circolo Matematico diPalermo (1884-1940) 27, 247 (1909).

[27] R. A. Fisher, Philosophical Transactions of the Royal So-ciety of London. Series A, Containing Papers of a Math-ematical or Physical Character 222, 309 (1922).

[28] M. Riesenhuber and T. Poggio, Nature neuroscience 3,

1199 (2000).[29] S. Kullback and R. A. Leibler, Ann. Math. Statist.

22, 79 (1951), URL http://dx.doi.org/10.1214/aoms/

1177729694.[30] T. M. Cover and J. A. Thomas, Elements of information

theory (John Wiley & Sons, 2012).[31] M. Kardar, Statistical physics of fields (Cambridge Uni-

versity Press, 2007).[32] J. Cardy, Scaling and renormalization in statistical

physics, vol. 5 (Cambridge university press, 1996).[33] J. K. Johnson, D. M. Malioutov, and A. S. Willsky, ArXiv

e-prints (2007), 0710.0013.[34] C. Beny, ArXiv e-prints (2013), 1301.3124.[35] S. Saremi and T. J. Sejnowski, Proceedings of the

National Academy of Sciences 110, 3071 (2013),http://www.pnas.org/content/110/8/3071.full.pdf, URLhttp://www.pnas.org/content/110/8/3071.abstract.

[36] E. Miles Stoudenmire and D. J. Schwab, ArXiv e-prints(2016), 1605.05775.

[37] G. Vidal, Physical Review Letters 101, 110501 (2008),quant-ph/0610099.

[38] J. Ba and R. Caruana, in Advances in neural informationprocessing systems (2014), pp. 2654–2662.

[39] J. Hastad, in Proceedings of the eighteenth annual ACMsymposium on Theory of computing (ACM, 1986), pp.6–20.

[40] M. Telgarsky, arXiv preprint arXiv:1509.08101 (2015).[41] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio,

in Advances in neural information processing systems(2014), pp. 2924–2932.

[42] R. Eldan and O. Shamir, arXiv preprintarXiv:1512.03965 (2015).

[43] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, ArXiv e-prints (2016), 1606.05340.

[44] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, andJ. Sohl-Dickstein, ArXiv e-prints (2016), 1606.05336.

[45] A. M. Saxe, J. L. McClelland, and S. Ganguli, arXivpreprint arXiv:1312.6120 (2013).

[46] Y. Bengio, Y. LeCun, et al., Large-scale kernel machines34, 1 (2007).

[47] V. Strassen, Numerische Mathematik 13, 354 (1969).[48] F. Le Gall, in Proceedings of the 39th international sym-

posium on symbolic and algebraic computation (ACM,2014), pp. 296–303.

[49] G. Carleo and M. Troyer, arXiv preprintarXiv:1606.02318 (2016).

[50] H. Vollmer, Introduction to circuit complexity: a uniformapproach (Springer Science & Business Media, 2013).

http://dx.doi.org/10.1214/aoms/1177729694

http://dx.doi.org/10.1214/aoms/1177729694

http://www.pnas.org/content/110/8/3071.abstract

arXiv:1608.08225v4 [cond-mat.dis-nn] 3 Aug 2017 · Dept. of Physics, Harvard University, Cambridge,...

Documents

Transcript of arXiv:1608.08225v4 [cond-mat.dis-nn] 3 Aug 2017 · Dept. of Physics, Harvard University, Cambridge,...