Analog VLSI Stochastic Perturbative Learning Architectures · 2004-11-21 · Examples of early...

Analog Integrated Circuits and Signal Processing, 13, 195–209 (1997)c© 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Analog VLSI Stochastic Perturbative Learning Architectures∗

GERT [email protected]

Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218

Received May 20, 1996; Accepted October 17, 1996

Abstract. We present analog VLSI neuromorphic architectures for a general class of learning tasks, which includesupervised learning, reinforcement learning, and temporal difference learning. The presented architectures areparallel, cellular, sparse in global interconnects, distributed in representation, and robust to noise and mismatches inthe implementation. They use a parallel stochastic perturbation technique to estimate the effect of weight changes onnetwork outputs, rather than calculating derivatives based on a model of the network. This “model-free” techniqueavoids errors due to mismatches in the physical implementation of the network, and more generally allows totrain networks of which the exact characteristics and structure are not known. With additional mechanisms ofreinforcement learning, networks of fairly general structure are trained effectively from an arbitrarily suppliedreward signal. No prior assumptions are required on the structure of the network nor on the specifics of the desirednetwork response.

Key Words: Neural networks, neuromorphic engineering, reinforcement learning, stochastic approximation

1. Introduction

Carver Mead introduced “neuromorphic engineering”[1] as an interdisciplinary approach to the design of bi-ologically inspired neural information processing sys-tems, whereby neurophysiological models of percep-tion and information processing in living organisms aremapped onto analog VLSI systems that not only emu-late their functions but also resemble their structure [2].

Essential to neuromorphic systems are mechanismsof adaptation and learning, modeled after neural “plas-ticity” in neurobiology [3], [4]. Learning can bebroadly defined as a special case of adaptation wherebypast experience is used effectively in readjusting thenetwork response to previously unseen, although sim-ilar, stimuli. Based on the nature and availability of atraining feedback signal, learning algorithms for artifi-cial neural networks fall under three broad categories:unsupervised, supervised and reward/punishment (re-inforcement). Physiological experiments have re-vealed plasticity mechanisms in biology that correpondto Hebbian unsupervised learning [5], and classical

∗This work was supported by ARPA/ONR under MURI grantN00014-95-1-0409. Chip fabrication was provided through MOSIS.

(pavlovian) conditioning [6], [7] characteristic of re-inforcement learning.

Mechanisms of adaptation and learning also providea means to compensate for analog imperfections inthe physical implementation of neuromorphic systems,and fluctuations in the environment in which they op-erate. Examples of early implementations of analogVLSI neural systems with integrated adaptation andlearning functions can be found in [8]. While off-chip learning can be effective as long as training isperformed with the chip “in the loop”, chip I/O band-width limitations make this approach impractical fornetworks with large number of weight parameters. Fur-thermore, on-chip learning provides autonomous, self-contained systems able to adapt continuously in theenvironment in which they operate.

On-chip learning in analog VLSI has proven to be achallenging task for several reasons. The least of prob-lems is the need for local analog storage of the learnedparameters, for which adequate solutions exist in vari-ous VLSI technologies [34]–[39]. More significantly,learning algorithms that are efficiently implemented ongeneral-purpose digital computers do not necessarilymap efficiently onto analog VLSI hardware. Second,even if the learning algorithm supports a parallel and

196 G. Cauwenberghs

scalable architecture suitable for analog VLSI imple-mentation, inaccuracies in the implementation of thelearning functions may significantly affect the perfor-mance of the trained system. Third, the learning canonly effectively compensate for inaccuracies in the net-work implementation when their physical sources arecontained directly inside the learning feedback loop.Algorithms which assume a particular model for theunderlying characteristics of the system being trainedare expected to perform poorer than algorithms whichdirectly probe the response of the system to externaland internal stimuli. Finally, the nature of typical learn-ing algorithms make assumptions on prior knowledgewhich place heavy constraints on practical use in hard-ware. This is particularly the case for physical systemsof which the characteristics nor the optimization ob-jectives are properly defined.

This paper addresses these challenges by usingstochastic perturbative algorithms for model-free esti-mation of gradient information [9], in a general frame-work that includes reinforcement learning under de-layed and discontinuous rewards [10]–[14]. In particu-lar, we extend earlier work on stochastic error-descentarchitectures for supervised learning [15] to includecomputational primitives implementing reinforcementlearning. The resulting analog VLSI architecture re-tains desirable properties of a modular and cellularstructure, model-free distributed representation, androbustness to noise and mismatches in the implemen-tation. In addition, the architecture is applicable tothe most general of learning tasks, where an unknown“black-box” dynamical system is adapted using a exter-nal “black-box” reinforcement-based delayed and pos-sibly discrete reward signal. As a proof of principle, weapply the model-free training-free adaptive techniquesto blind optimization of a second-order noise-shapingmodulator for oversampled data conversion, controlledby a neural classifier. The only evaluative feedbackused in training the classifier is a discrete failure signalwhich indicates when some of the integrators in themodulation loop saturate.

Section 2 reviews supervised learning and stochasticperturbative techniques, and presents a correspondingarchitecture for analog VLSI implementation. The fol-lowing section covers a generalized form of reinforce-ment learning, and introduces a stochastic perturbativeanalog VLSI architecture for reinforcement learning.Neuromorphic implementations in analog VLSI andtheir equivalents in biology are the subject of section 4.Finally, section 5 concludes the findings.

2. Supervised Learning

In a metaphorical sense, supervised learning assumesthe luxury of a committed “teacher”, who constantlyevaluates and corrects the network by continuouslyfeeding it target values for all network outputs. Super-vised learning can be reformulated as an optimizationtask, where the network parameters (weights) are ad-justed to minimize the distance between the targets andactual network outputs. Generalization and overtrain-ing are important issues in supervised learning, and arebeyond the scope of this paper.

Let y(t) be the vector of network outputs with com-ponentsyi (t), and correspondinglyytarget(t) be the sup-plied target output vector. The network contains ad-justable parameters (or weights)p with componentspk,and state variablesx(t) with componentsxi (t) (whichmay contain external inputs). Then the task is to min-imize the scalar error index

E(p; t) =∑

i

|ytargeti (t)− yi (t)|ν . (1)

in the parameterspi , using a distance metric with normν > 0.

2.1. Gradient Descent

Gradient descent is the most common optimizationtechnique for supervised learning in neural networks,which includes the widely used technique of back-propagation (or “dynamic feedback”) [16] for gradientderivation, applicable to general feedforward multilay-ered networks.

In general terms, gradient descent minimizes thescalar performance indexE by specifying incremen-tal updates in the parameter vectorp according to theerror gradient∇pE :

p(t + 1) = p(t)− η ∇pE(t). (2)

One significant problem with gradient descent and itsvariants for on-line supervised learning is the com-plexity of calculating the error gradient components∂E/∂pk from a model of the system. This is especiallyso for complex systems involving internal dynamics inthe state variablesxj (t):

∂E∂pk=∑i, j

∂E(t)∂yi· ∂yi (t)

∂xj· ∂xj (t)

∂pk(3)

Analog VLSI 197

where derivation of the dependencies∂xj /∂pk overtime constitutes a significant amount of computationthat typically scales super-linearly with the dimensionof the network [15]. Furthermore, the derivation ofthe gradient in (3) assumes accurate knowledge of themodel of the network (y(t) as a function ofx(t), and re-currence relations in the state variablesx(t)). Accuratemodel knowledge cannot be assumed for analog VLSIneural hardware, due to mismatches in the physicalimplementation which can not be predicted at the timeof fabrication. Finally, often a model for the systembeing optimized may not be readily available, or maybe too complicated for practical (real-time) evaluation.In such cases, a black-box approach to optimization ismore effective in every regard. This motivates the useof the well-known technique of stochastic approxima-tion [17] for blind optimization in analog VLSI sys-tems. We apply this technique to supervised learningas well as to more advanced models of “reinforcement”learning under discrete delayed rewards. The connec-tion between stochastic approximation techniques andprinciples of neuromorphic engineering will be illus-trated further below, in contrast with gradient descent.

2.2. Stochastic Approximation Techniques

Stochastic approximation algorithms [17] have longbeen known as effective tools for constrained and un-constrained optimization under noisy observations ofsystem variables [18]. Applied to on-line minimizationof an error indexE(p), the algorithms avoid the compu-tational burden of gradient estimation by directly ob-serving the dependence of the indexE on randomly ap-plied perturbations in the parameter values. Variants onthe Kiefer-Wolfowitz algorithm for stochastic approx-imation [17], essentially similar to random-directionfinite-difference gradient descent, have been formu-lated for blind adaptive control [19], neural networks[9],[20] and the implementation of learning functionsin VLSI hardware [21], [22], [15], [23].

The broader class of neural network learning algo-rithms under this category exhibit the desirable prop-erty that the functional form of the parameter updates is“model-free”, i.e., independent of the model specificsof the network or system under optimization. Themodel-free techniques for on-line supervised learningare directly applicable to almost any observable sys-tem with deterministic, slowly varying, and possiblyunknown characteristics. Parallel implementation of

the stochastic approximation algorithms results intoefficient and modular learning architectures that mapwell onto VLSI hardware. Since those algorithms useonly direct function evaluations and no derivative infor-mation, they are functionally simple, and their imple-mentation is independent of the structure of the systemunder optimization. They exhibit robust convergenceproperties in the presence of noise in the system andmodel mismatches in the implementation.

A brief description of the stochastic error-descentalgorithm follows below, as introduced in [15] for ef-ficient supervised learning in analog VLSI. The in-tegrated analog VLSI continous-time learning systemused in [24], [25] forms the basis for the architecturesoutlined in the sections that follow.

2.3. Stochastic Supervised Learning

Let E(p) be the error functional to be minimized, withE a scalar deterministic function in the parameter (orweight) vectorp with componentspi . The stochasticalgorithm specifies incremental updates in the param-eterspi as with gradient descent (2), although using astochastic approximation to the true gradient

∂E(t)∂pi

est

= πi (t) · E(t) (4)

where the differentially perturbed error

E(t) = 1

2σ 2(E(p(t)+ π(t))− E(p(t)− π(t))) (5)

is obtained from two direct observations ofE un-der complementary activation of a parallel randomperturbation vectorπ(t) with componentsπi (t) ontothe parameter vectorp(t). The perturbation compo-nentsπi (t) are fixed in amplitude and random in sign,πi (t) = ±σ with equal probabilities for both polarities.The algorithm essentially performs gradient descent inrandom directions in the parameter space, as definedby the position of the perturbation vector.

As with exact gradient descent, iteration of the up-dates using (4) converges in the close proximity of a(local) minimum ofE , provided the perturbation am-plitudeσ is sufficiently small. The rate of convergenceis necessarily slower than gradient descent, since everyobservation (5) only reveals scalar information aboutthe gradient vector in one dimension. However, the

198 G. Cauwenberghs

amount of computation required to compute the gradi-ent at every update may outweigh the higher conver-gence rate offered by gradient descent, depending onthe model complexity of the system under optimiza-tion. When applied to on-line supervised learning inrecurrent dynamical systems, the stochastic algorithmprovides a net computational efficiency rivaling thatof exact gradient descent. Computational efficiencyis defined in terms of the total number of operationsrequired to converge,i.e., reach a certain level ofE .A formal derivation of the convergence properties ispresented in [15].

2.4. Supervised Learning Architecture

While alternative optimization techniques based onhigher-order extensions on gradient descent will cer-tainly offer superior convergence rates, the abovestochastic method achieves its relative efficiency at amuch reduced complexity of implementation. The onlyglobal operations required are the evaluations of the er-ror function in (5), which are obtained from direct ob-servations on the system under complementary activa-tion of the perturbation vector. The operations neededto generate and apply the random perturbations, and toperform the parameter update increments, are strictlylocal and identical for each of the parameter compo-nents. The functional diagram of the local parameterprocessing cell, embedded in the system under opti-mization, is shown in Figure 1. The complementaryperturbations and the corresponding error observationsare performed in two separate phases on the same sys-tem, rather than concurrently on separate replicationsof the system. The sequential activation of the comple-mentary perturbations and corresponding evaluationsof E are synchronized and coordinated with a three-phase clock:

φ0 : E(p, t)φ+ : E(p+ π, t) (6)

φ− : E(p− π, t).

This is represented schematically in Figure 1 by a mod-ulation signalφ(t), taking values{−1, 0, 1}. The extraphaseφ0 (φ(t) = 0) is not strictly needed to com-pute (5)—it is useful otherwise,e.g. to compute finitedifference estimates of second order derivatives for dy-namic optimization of the learning rateη(t).

The local operations are further simplified owing tothe binary nature of the perturbations, reducing themultiplication in (4) to an exclusive-or logical oper-ation, and the modulation byφ(t) to binary multiplex-ing. Besides efficiency of implementation, this has abeneficial effect on the overall accuracy of the imple-mented learning system, as will be explained in thecontext of VLSI circuit implementation below.

2.5. Supervised Learning in Dynamical Systems

In the above, it was assumed that the error functionalE(p) is directly observable on the system by applyingthe parameter valuespi . In the context of on-line su-pervised learning in dynamical systems, the error func-tional takes the form of the average distance of normν between the output and target signals over a movingtime window,

E(p; ti , t f ) =∫ t f

ti

∑i

|ytargeti (t ′)− yi (t

′)|νdt ′, (7)

which implicitly depends on the training sequenceytarget(t) and on initial conditions on the internal statevariables of the system. An on-line implementationprohibits simultaneous observation of the error mea-sure (7) under different instances of the parameter vec-tor p, as would be required to evaluate (5) for con-struction of the parameter updates. However, when thetraining signals are periodic and the intervalT = t f −tispans an integer multiple of periods, the measure (7)under constant parameter values is approximately in-variant to time. In that case, the two error observationsneeded in (5) can be performed in sequence, under com-plementary piecewise constant activation of the pertur-bation vector.

Stochastic error-descent with the dynamic error in-dex in (7) has been used in [24], [25] to implement thesupervised learning functions in an integrated networkof six fully interconnected continuous-time neuronswith 42 parameters. The analog VLSI chip, dissipating1.2 mW from a 5 Vsupply, learned to generate givenperiodic signals at two of the outputs, representing aquadrature-phase oscillator, in less than 1500 trainingcycles of 60 msec each.

In the context of on-line supervised learning in dy-namical systems, the requirement of periodicity on thetraining signals is a major limitation of the stochasticerror-descent algorithm. Next, this requirement will be

Analog VLSI 199

Fig. 1. Architecture implementing stochastic error-descent supervised learning. The learning cell is locally embedded in the network. Thedifferential error index is evaluated globally, and communicated to all cells.

relaxed, along with some more stringent assumptionson the nature of supervised learning. In particular, atarget training signal will no longer be necessary. In-stead, learning is based on an external reward signal thatconveys only partial and delayed information about theperformance of the network.

3. Learning From Delayed and Discrete Rewards

Supervised learning methods rely on a continuoustraining signal that gives constant feedback about thedirection in which to adapt the weights of the networkto improve its performance. This continuous signalis available in the form of target valuesytarget(t) forthe network outputsy(t). More widely applicable butalso more challenging are learning tasks where targetoutputs or other continuous teacher feedback are notavailable, but instead only a non-steady, delayed re-ward (or punishment) signal is available to evaluatethequality of the outputs (or actions) produced by thenetwork. The difficulty lies in assigning proper creditfor the reward (or punishment) to actions that whereproduced by the network in the past, and adapting thenetwork accordingly in such a way toreinforcethe net-work actions leading to reward (and conversely, avoidthose leading to punishment).

3.1. Reinforcement Learning Algorithms

Several iterative approaches to dynamic programminghave been applied to solve the credit assignment prob-lem for training a neural network with delayed rewards[10]–[14]. They all invoke an “adaptive critic element”

which is trained along with the network to predict thefuture reward signal from the present state of the net-work. We define “reinforcement learning” essentiallyas given in [11], which includes as special cases “timedifference learning” or TD(λ) [12], and, to some ex-tent, Q-learning [13] and “advanced heuristic dynamicprogramming” [14]. The equations are listed below ingeneral form to clarify the similarity with the above su-pervised perturbative learning techniques. It will thenbe shown how the above architectures are extended toallow learning from delayed and/or impulsive rewards.

Let r (t) be the discrete delayed reward signal forstate vectorx(t) of the system (componentsxj (t)). r (t)is zero when no signal is available, and is negative fora punishment. Lety(t) be the (scalar) output of thenetwork in response to an input (or state)x(t), andq(t)the predicted future reward (or “value function”) asso-ciated with statex(t) as produced by the adaptive criticelement. The action taken by the system is determinedby the polarity of the network output, sign(y(t)). Forexample, in the pole balancing experiment of [11],y(t)is hard-limited and controls the direction of the fixed-amplitude force exterted on the moving cart. Finally,letw andv (componentswi andvi ) be the weights of thenetwork and the adaptive critic element, respectively.Then the weight updates are given by

1wi (t) = wi (t + 1)− wi (t) (8)

= α r (t) · ei (t)

1vi (t) = vi (t + 1)− vi (t) (9)

= β r (t) · di (t)

where the “eligibility” functionsei (t) anddi (t) are up-

200 G. Cauwenberghs

dated as

ei (t + 1) = δei (t)+ (1− δ) sign(y(t))∂y(t)

∂wi(10)

di (t + 1) = λdi (t)+ (1− λ)∂q(t)

∂vi

and the reinforcementr (t) is given by

r (t) = r (t)+ γq(t)− q(t − 1). (11)

The parametersδ andλdefine the time span of credit as-signed byei (t) anddi (t) to actions in the past, weight-ing recent actions stronger than past actions:

e(t) = (1− δ)t−1∑

t ′=−∞δt−t ′−1 sign(y(t ′))

∂y(t ′)∂wi

(12)

d(t) = (1− λ)t−1∑

t ′=−∞λt−t ′−1∂q(t ′)

∂vi.

Similarly, the parameterγ defines the time span for theprediction of future reward byq(t), at convergence:

q(t) ≈∞∑

t ′=t+1

γ t ′−t−1r (t ′). (13)

For γ = 1 andy ≡ q, the equations reduce to TD(λ).Convergence of TD(λ) with probability one has beenproven in the general case of linear networks of theform q =∑ vi xi [26].

Learning algorithms of this type are neuromorphic inthe sense that they emulate classical (pavlovian) con-ditioning in pattern association as found in biologi-cal systems [6] and their mathematical and cognitivemodels [27], [7]. Furthermore, as shown below, thealgorithms lend themselves to analog VLSI implemen-tation in a parallel distributed architecture which, un-like more complicated gradient-based schemes, resem-bles the general structure and connectivity of biologicalneural systems.

3.2. Reinforcement Learning Architecture

While reinforcement learning does not perform gradi-ent descent of a (known) error functional, the eligibilityfunctionsei (t) anddi (t) used in the weight updates areconstructed from derivatives of output functions to theweights. The eligibility functions in equation (10) canbe explicitly restated as (low-pass filtered) gradients of

an error function

E(t) = |y(t)| + q(t) (14)

with

ei (t + 1) = δ ei (t) + (1− δ) ∂E(t)∂wi

(15)

di (t + 1) = λ di (t) + (1− λ) ∂E(t)∂vi

.

Rather than calculating the gradients in (15) from thenetwork model, we can again apply stochastic pertu-bative techniques to estimate the gradients from directevaluations on the network. Doing so, all propertiesof robustness, scalability and modularity that apply tostochastic error descent supervised learning apply hereas well. As in (4), stochastic approximation estimatesof the gradient components in (15) are

∂E(t)∂wi

est

= ωi (t) E(t) (16)

∂E(t)∂vi

est

= υi (t) E(t)

where the differential perturbed error

E(t) = 1

2σ 2(E(w+ ω, v+ υ, t)

− E(w− ω, v− υ, t)) (17)

is obtained from two-sided parallel random perturba-tion w±ω simultaneous withv±υ (|ωi | = |υi | = σ ).

A side benefit of the low-pass filtering of the gradi-ent in (15) is an improvement of the stochastic gradientestimate (16) through averaging. As with stochastic er-ror descent supervised learning, averaging reduces thevariance of the gradient estimate and produces learningincrements that are less stochastic in nature, althoughthis is not essential for convergence of the learning pro-cess [15].

Figure 2 shows the block diagram of a reinforcementlearning cell and an adaptive critic cell, with stochas-tic perturbative estimation of the gradient accordingto (16). LPδ and LPλ denote first-order low-pass fil-ters (15) with time constants determined byδ andλ.Other than the low-pass filtering and the global multi-plicative factorr (t), the architecture is identical to thatof stochastic error descent learning in Figure 1. Asbefore, the estimation ofE(t) does not require sepa-rate instances of perturbed and non-perturbed networksshown in Figure 2, and can be computed sequentially

Analog VLSI 201

Fig. 2. General architecture implementing reinforcement learningusing stochastic gradient approximation.(a) Reinforcement learn-ing cell. (b) Adaptive critic cell.

by evaluating the output of the network and adaptivecritic in three phases for every cycle oft :

φ0 : E(w, v, t)φ+ : E(w+ ω, v+ υ, t) (18)

φ− : E(w− ω, v− υ, t).In systems with a continuous-time output response, weassume that the time lag between consecutive obser-vations of the three phases ofE is not an issue, whichamounts to choosing an appropriate sampling rate fort in relation to the bandwidth of the system.

Similarities between the above cellular architecturesfor supervised learning and reinforcement learning areapparent: both correlate local perturbation valuesπi ,

ωi or υi with a global scalar indexE that encodes thedifferential effect of the perturbations on the output,and both incrementally update the weightspi , wi orvi accordingly. The main difference in reinforcementlearning is the additional gating of the correlate prod-uct with a global reinforcement signalr after temporalfiltering. For many applications, the extra overheadthat this implies in hardware resources is more thancompensated by the utility of the reward-based creditassignment mechanism, which does not require an ex-ternal teacher. An example is given below in the caseof oversampled A/D conversion.

3.3. Simulation Results

We evaluated both exact gradient and stochastive per-turbative embodiments of the reinforcement learningalgorithms on an adaptive neural classifier, controllinga higher-order noise-shaping modulator used for over-sampled A/D data conversion [30]. The order-n mod-ulator comprises a cascade ofn integratorsxi (t) oper-ating on the difference between the analog inputu(t)and the binary modulated outputy(t):

x0(t + 1) = x0(t)+ a (u(t)− y(t)) (19)

xi (t + 1) = xi (t)+ a xi−1(t), i = 1, . . .n− 1

wherea = 0.5. The control objective is to choose thebinary sequencey(t) such as to keep the excursion ofthe integration variables within bounds,|xi (t)| < xsat

[29].For the adaptive classifier, we specify a one-hidden-

layer neural network, with inputsxi (t) and outputy(t).The network hasm hidden units, a tanh(.) sigmoidalnonlinearity in the hidden layer, and a linear outputlayer. For the simulations we setn = 2 andm= 5. Thecasen = 2 is equivalent to the single pole-balancingproblem [11].

The only evaluative feedback signal used duringlearning is a failure signal which indicates when one ormore integration variables saturate,|xi (t)| ≥ xsat. Inparticular, the signalr (t) counts the number of integra-tors in saturation:

r (t) = −b∑

i

H(|xi (t)| − xsat) (20)

whereb = 10, and whereH(.) denotes a step function(H(x) = 1 if x > 0 and 0 otherwise). The adaptivecritic q(t) is implemented with a neural network of

202 G. Cauwenberghs

Fig. 3. Simulated performance of stochastic perturbative (◦) andgradient-based (×) reinforcement learning in a second-order noise-shaping modulator. Time between failures for consecutive trials fromzero initial conditions.

identical structure as fory(t). The learning parametersin (8), (10) and (11) areδ = 0.8, λ = 0.7, γ = 0.9,α = 0.05 andβ = 0.001. These values are consistentwith [11], adapted to accommodate for differences inthe time scale of the dynamics (19). The perturbationstrength in the stochastic version isσ = 0.01.

Figure 3 shows the learning performance for sev-eral trials of both versions of reinforcement learning,using exact and stochastic gradient estimates. Duringlearning, the input sequenceu(t) is random, uniform inthe range−0.5 . . .0.5. Initially, and every time failureoccurs (r (t) < 0), the integration variablesxi (t) andeligibilitiesek(t)anddk(t)are reset to zero. Qualitativedifferences observed between the exact and stochasticversions in Figure 3 are minor. Further, in all but one ofthe 20 cases tried, learning has completed (i.e., conse-quent failure is not observed in finite time) in fewer than20 consecutive trial-and-error iterations. Notice that anon-zeror (t) is only generated at failure,i.e., less than20 times, and no other external evaluative feedback isneeded for learning.

Figure 4 quantifies the effect of stochastic perturba-tive estimation of the gradients (15) on the quality of re-inforcement learning. The correlation indexc(t)mea-sures the degree of conformity in the eligibilities (bothei (t) anddi (t)) between stochastic and exact versionsof reinforcement learning. Correlation is expressed asusual on a scale from−1 to 1, with c(t) = 1 indi-cating perfect coincidence. Whilec(t) is considerably

Fig. 4. Effect of stochastic perturbative gradient estimation on thereinforcement learning.c(t) quantifies the degree of conformity inthe eligibilitiesei (t) anddi (t) between exact and stochastic versions.

less than 1 in all cases,c(t) > 0 about 95 % of thetime, meaning thaton averagethe sign of the param-eter updates (8) for exact and stochastic versions areconsistent in at least 95 % of the cases. The scatterplotc(t) vs.r (t) also illustrates how the adaptive critic pro-duces a positive reinforcementr (t) in most of the cases,even though the “reward” signalr (t) is never positiveby construction. Positive reinforcementr (t) under idleconditions ofr (t) is desirable for stability. Notice thatthe failure-driven punishment points (wherer (t) < 0)are off-scale of the graph and strongly negative.

We tried reinforcement learning on higher-ordermodulators,n = 3 and higher. Both exact and stochas-tic versions were successful forn = 3 in the majorityof cases, but failed to converge forn = 4 with thesame parameter settings. On itself, this is not surprisingsince higher-order delta-sigma modulators tend to be-come increasingly prone to unstabilities and sensitiveto small changes in parameters with increasing ordern, which is why they are almost never used in practice[30]. It is possible that more advanced reinforcementlearning techniques such as “Advanced Heuristic Dy-namic Programming” (AHDP) [14] would succeed toconverge for ordersn > 3. AHDP offers improvedlearning efficiency using a more advanced, gradient-based adaptive critic element for prediction of reward,although it is not clear at present how to map the algo-rithm efficiently onto analog VLSI.

The above stochastic perturbative architectures forboth supervised and reinforcement learning supportcommon “neuromorphs” and corresponding analog

Analog VLSI 203

VLSI implementations. Neuromorphs of learning inanalog VLSI are the subject of next section.

4. Neuromorphic Analog VLSI Learning

Neuromorphic engineering [1] combines inspira-tion from neurobiology to pursue human-engineeredhuman-like VLSI information processing systems atall levels of the design: device physics and technology,circuit topologies, and the overall architecture organiz-ing the flow of information. We address these levels inthe context of learning as defined above.

4.1. Subthreshold MOS Technology

MOS transistors operating in the subthreshold region[31] are attractive for use in medium-speed, medium-accuracy analog VLSI processing, because of the lowcurrent levels and the exponential current-voltage char-acteristics that span a wide dynamic range of cur-rents [32] (roughly from 100 fA to 100 nA for asquare device in 2µm CMOS technology). Futher-more, subthreshold MOS transistors provide a clear“neuromorph” [1], since their exponential I-V charac-teristics closely resemble the carrier transport thoughcell membranes in biological neural systems, as gov-erned by the same Boltzman statistics [33]. The expo-nential characteristics provide a variety of subthresholdMOS circuit topologies that serve as useful computa-tional primitives (such as nonlinear conductances, sig-moid nonlinearities, etc.) for compact analog VLSIimplementation of neural systems [2]. Of particularinterest are translinear subthreshold MOS circuits, de-rived from similar bipolar circuits [32]. They are basedon the exponential nature of current-voltage relation-ships, and offer attractive compact implementations ofproduct and division operations in VLSI.

4.2. Adaptation and Memory

Learning in analog VLSI systems is inherently cou-pled with the problem of storage of analog information,since after learning it is most often desirable to retainthe learned weights for an extended period of time.The same is true for biological neural systems, andmechanisms of plasticity for short-term and long-termsynaptic storage are not quite understood. In VLSI,analog weights are conveniently stored as charge or

Fig. 5. Adaptation and memory in analog VLSI: storage cell withcharge buffer.

voltage on a capacitor. A capacitive memory is gener-ically depicted in Figure 5. The stored weight chargeis preserved when brought in contact with the gate ofan MOS transistor, which serves as a buffer betweenweight storage and the implementation of the synapticfunction. An adaptive element in contact with the ca-pacitor updates the stored weight in the form of discretecharge increments

Vstored(t + 1) = Vstored(t)+ 1

C1Qadapt(t) (21)

or, equivalently, a continuous current supplying aderivative

d

dtVstored(t) = 1

CIadapt(t) (22)

where1Qadapt(t) =∫ t+1

t Iadapt(t ′)dt′.On itself, a floating gate capacitor is a near-perfect

memory. However, leakage and spontaneous decay ofthe weights result when the capacitor is in volatile con-tact with the adaptive element, such as through drainor source terminals of MOS transistors. Non-volatilememories contain adaptive elements that interface withthe floating gate capacitor by capacitive coupling acrossan insulating oxide. Charge transport through the ox-ide is controlled by tunneling and hot electron injection[34] or UV-activated conduction [35], [36]. Volatilememories are adequate for short-term storage, or re-

204 G. Cauwenberghs

Fig. 6. Charge-pump adaptive element implementing a volatilesynapse.

quire an active refresh mechanism for long-term stor-age [37], [38].

4.3. Adaptive Circuits

Charge-Pump Adaptive ElementFigure 6 shows thecircuit diagram of a charge-pump adaptive element im-plementing a volatile synapse. The circuit is a simpli-fied version of the charge pump used in [38] and [25].When enabled by ENn and ENp (at GND andVdd poten-tials, respectively), the circuit generates an incrementalupdate (21) or (22) of which the polarity is determinedby POL. The amplitude of the current supplying theincremental update is determined by gate voltagesVbn

andVbp, biased deep in subthreshold to allow fine (sub-fC) increments if needed. The increment amplitude isalso determined by the duration of the enabled current,controlled by the timing of ENn and ENp. When bothENn and ENp are set midway between GND andVdd,the current output is disabled. Notice that the switch-off transient is (virtually) free of clock feedthroughcharge injection, because the current-supplying tran-sistors are switched from their source terminals, withthe gate terminals being kept at constant voltage [38].

Stochastic Perturbative Learning CellThe circuitschematic of a learning cell implementing stochastic er-ror descent is given in Figure 7, adapted from [24], [25]in simplified form. The incremental update−ηπi E tobe performed in (5) is first decomposed in amplitudeand sign components. This allows for a hybrid digital-analog implementation of the learning cell, in whichamplitudes of certain operands are processed in analog

format, and their polarities implemented in logic. Since|πi | ≡ 1, the amplitudeη|E | is conventiently commu-nicated as a global signal to all cells, in the form of twogate voltagesVbn andVbp. The (inverted) polarityPOLis obtained as the (inverted) exclusive-or combinationof the perturbationπi and the polarity ofE . The de-composition of sign and amplitude ensures proper con-vergence of the learning increments in the presence ofmismatches and offsets in the physical implementationof the learning cell. This is a consequence of the factthat the polarities of the increments are implementedaccurately by logic-controlled circuitry, regardless ofanalog mismatches in the implementation.

The perturbationπi is applied topi in three phases(5) by capacitive coupling onto the storage nodeC.The binary state of the local perturbationπi selectsone of two global perturbation signals to couple ontoC. The perturbation signals (V+σ and its complementV−σ ) globally control the three phasesφ0,φ+ andφ− of(5), and set the perturbation amplitudeσ . The simpleconfiguration using a one-bit multiplexer is possiblebecause each perturbation component can only takeone of two values±σ .

Analog Storage Because of the volatile nature of theadaptive element used, an active refresh mechanismmay be required if long-term local storage of the weightvalues after learning is desired [37]. A robust and effi-cient mechanism is “partial incremental refresh” [38],which can be directly implemented using the adaptiveelement in Figure 7 by drivingPOL with a binary func-tion of the weight value [39]. As in [39], the binaryquantization function can be multiplexed over an arrayof storage cells, and can be implemented by retainingthe LSB from A/D/A conversion [40] of the value tobe stored.

A non-volatile equivalent of the adaptive element inFigure 6 is described in [34]. A non-volatile learningcell performing stochastic error descent can be obtainedby substituting [34] as the core adaptive element inFigure 7. Likewise, more intricate volatile and non-volatile circuits implementing stochastic reinforcementlearning can be derived from extensions on Figure 7 and[34].

Measurements The model-free nature of the stochas-tic perturbative learning algorithms does not imposeany particular conditions on the implementation ofcomputational functions required for learning. By far

Analog VLSI 205

Fig. 7. Circuit schematic of a learning cell implementing stochastic error descent, using the charge pump adaptive element.

the most critical element in limiting learning perfor-mance is the quality of the parameter update incrementsand decrements, in particular the correctness of theirpolarity. Relative fluctuations in amplitude of the learn-ing updates do not affect the learning process to firstorder, since their effect is equivalent to relative fluctu-ations in the learning rate. On the other hand, errorsin the polarity of the learning updates might adverselyaffect learning performance even at small update am-plitudes. Therefore, we investigate practical limits asimposed by the binary controlled charge-pump adap-tive element in Figures 6, which controls the polarityand amplitude of the updates for learning as well asdynamic refresh.

Measurements on a charge pump withC = 0.5 pFfabricated in a 2µm CMOS process are shown in Fig-ure 8. Under pulsed activation of ENn and ENp, the re-sulting voltage increments and decrements are recordedas a function of the gate bias voltagesVbn andVbp, forboth polarities of POL, and for three different valuesof the pulse width1t (23µsec, 1 msec and 40 msec).In all tests, the pulse period extends 2 msec beyond thepulse width. The exponential subthreshold character-istics are evident from Figure 8, with increments anddecrements spanning four orders of magnitude in am-plitude. The lower limit is mainly determined by junc-

tion diode leakage currents, as shown in Figure 8 (a)for1t = 0 (0.01 mV per 2 msec interval at room tem-perature). This is more than adequate to accommodatelearning over a typical range of learning rates. Also, thebinary control POL of the polarity of the update is effec-tive for increments and decrements down to 0.05 mVin amplitude, corresponding to charge transfers of onlya few hundred electrons.

4.4. Architecture

The general structure of neuromorphic informationprocessing systems has some properties differentiatingthem from some more conventional human-engineeredcomputing machinery, which are typically optimizedfor general-purpose sequential digital programming.Some of the desirable properties for neuromorphic ar-chitectures are: fault-tolerance and robustness to noisethrough a redundant distributed representation, robust-ness to changes in operating conditions through on-lineadaptation, real-time bandwidth through massive par-allelism, and modularity as a result of locality in spaceand time. We illustrate these properties in the two ar-chitectures for supervised and reinforcement learningin Figures 1 and 2. Since both architectures are similar,a distinction between them will not explicitly be made.

206 G. Cauwenberghs

Fig. 8. Measured characteristics of charge-pump adaptive element in 2µm CMOS withC = 0.5 pF. (a) n-type decrements (POL= 0);(b) p-type increments (POL= 1).

Analog VLSI 207

Fault-Tolerance Through Statistical Averaging in aDistributed RepresentationDirect implementation ofgradient descent, based on an explicit model of the net-work, is prone to errors due to unaccounted discrep-ancies in the network model and mismatches in thephysical implementation of the gradient. This is due tothe localized representation in the computation of thegradient as calculated from the model, in which anydiscrepancy in one part may drastically affect the finalresult. For this and other reasons, it is unlikely that biol-ogy performs gradient calculation on complex systemssuch as recurrent neural networks with continuous-timedynamics. Stochastic error descent avoids errors of var-ious kinds by physically probing the gradient onto thesystem rather than deriving it. Using simultaneous anduncorrelated parallel perturbations of the weights, theeffect of a single error on the outcome is thus signifi-cantly reduced, by virtue of the statistical nature of thecomputation. However, critical in the accuracy of theimplemented learning system is the precise derivationand faithful distribution of the global learning signalsE(t) and r (t). Stictly speaking, it is essential only toguarantee the correctpolarity and not the exact ampli-tude of the global learning signals, as implemented inFigure 7.

Robustness to Changes in the Environment ThroughOn-Line Adaptation This property is inherent to theon-line incremental nature of the studied supervisedand reinforcement learning algorithms, which trackstructural changes inE(t) or r (t) on a characteristictime scale determined by learning rate constants suchas η or α andβ. Learning rates can be reduced asconvergence is approached, as in the popular notion incognitive neuroscience that neural plasticity decreaseswith age and experience.

Real-Time Bandwidth Through ParallelismAll learn-ing operations are performed in parallel, with exceptionof the three-phase perturbation scheme (5) or (17) to ob-tain the differential indexE under sequential activationof complementary perturbationsπ and−π. We notethat the synchronous three-phase scheme is not essen-tial and could be replaced by an asynchronous pertur-bation scheme as in [9] and [41]. While this probablyresembles biology more closely, the synchronous gra-dient estimate (4) using complementary perturbationsis computationally more efficient as it cancels errorterms up to second order in the perturbation strengthσ

[17]. In the asynchronous scheme, one could envisionthe role of random noise naturally present in biolog-ical systems as a source of perturbations, although itis not clear how noise sources can be effectively iso-lated to produce the correlation measures necessary forgradient estimation.

Modular Architecture with Local ConnectivityThelearning operations are local in the sense that a needfor excessive global interconnects between distant cellsis avoided. The global signals are few in number andcommon for all cells, which implies that no signal inter-connects are requiredbetweencells across the learningarchitecture, but all global signals are communicateduniformly acrosscells instead. This allows to embedthe learning cells directly into the network (or adap-tive critic) architecture, where they interface physicallywith the synapses they adapt, as in biological systems.The common global signals include the differential in-dex E(t) and reinforcement signalr (t), besides com-mon bias levels and timing signals.E(t) andr (t) areobtained by any global mechanism that quantifies the“fitness” of the network response in terms of teachertarget values or discrete rewards (punishments). Phys-iological experiments support evidence of local (heb-bian [5]) and sparsely globally interconnected (rein-forcement [6]) mechanisms of learning and adaptationin biological neural systems [3], [4].

5. Conclusion

Neuromorphic analog VLSI architectures for a generalclass of learning tasks have been presented, along withkey components in their analog VLSI circuit imple-mentation. The architectures make use of distributedstochastic techniques for robust estimation of gradientinformation, accurately probing the effect of parame-ter changes on the performance of the network. Twoarchitectures have been presented: one implementingstochastic error-descent for supervised learning, andthe other implementing a novel stochastic variant on ageneralized form of reinforcement learning. The twoarchitectures are similar in structure, and both are suit-able for scalable and robust analog VLSI implementa-tion.

While both learning architectures can operate on(and be integrated in) arbitrary systems of which thecharacteristics and structure does not need to be known,the reinforcement learning architecture additionally

208 G. Cauwenberghs

supports a more general form of learning, using anarbitrary, externally supplied, reward or punishmentsignal. This allows the development of more powerful,generally applicable devices for “black-box” sensor-motor control which make no prior assumptions on thestructure of the network and the specifics of the desirednetwork response.

We presented simulation results that demonstrate theeffectiveness of perturbative stochastic gradient esti-mation for reinforcement learning, applied to adap-tive oversampled data conversion. The neural clas-sifier controlling the second-order noise-shaping mod-ulator was trained for optimal performance with nomore evaluative feedback than a discrete failure signalindicating whenever any of the modulation integratorssaturate. The critical part in the VLSI implementa-tion of adaptive systems of this type is the precisionof the polarity, rather than the amplitude, of the im-plemented weight parameter updates. Measurementson a binary controlled charge-pump in 2µm CMOShave demonstrated voltage increments and decrementsof precise polarity spanning four orders of magnitudein amplitude, with charge transfers down to a few hun-dred electrons.

References

1. C. A. Mead, “Neuromorphic electronic systems.”Proceedingsof the IEEE78(10), pp. 1629–1639, 1990.

2. C. A. Mead, Analog VLSI and Neural Systems. Addison-Wesley: Reading, MA, 1989.

3. G. M. Shepherd,The Synaptic Organization of the Brain. 3rded. Oxford Univ. Press: New York, NY, 1992.

4. P. S. Churchland and T. J. Sejnowski,The ComputationalBrain. MIT Press: Cambridge, MA, 1990.

5. S. R. Kelso and T. H. Brown, “Differential conditioning of as-sociative synaptic enhancement in Hippocampal brain slices.”Science232, pp. 85–87, 1986.

6. R. D. Hawkins, T. W. Abrams, T. J. Carew, and E. R. Kandell,“A cellular mechanism of classical conditioning inAplysia:activity-dependent amplification of presynaptic facilitation.”Science219, pp. 400–405, 1983.

7. P. R. Montague, P. Dayan, C. Person and T. J. Sejnowski, “Beeforaging in uncertain environments using predictive Hebbianlearning.”Nature377(6551), pp. 725–728, 1996.

8. C. A. Mead and M. Ismail, Eds.,Analog VLSI Implementationof Neural Systems. Kluwer: Norwell, MA, 1989.

9. A. Dembo and T. Kailath, “Model-free distributed learning.”IEEE Transactions on Neural Networks1(1), pp. 58–70, 1990.

10. S. Grossberg, “A neural model of attention, reinforcement, anddiscrimination learning.”International Review of Neurobiol-ogy18, pp. 263–327, 1975.

11. A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlikeadaptive elements that can solve difficult learning control prob-lems.” IEEE Transactions on Systems, Man, and Cybernetics13(5), pp. 834–846, 1983.

12. R. S. Sutton, “Learning to predict by the methods of temporaldifferences.”Machine Learning3, pp. 9–44, 1988.

13. C. Watkins and P. Dayan, “Q-Learning.”Machine Learning8,pp. 279–292, 1992.

14. P. J. Werbos, “A menu of designs for reinforcement learn-ing over time,” inNeural Networks for Control(W. T. Miller,R. S. Sutton and P. J. Werbos, eds.). MIT Press: Cambridge,MA, 1990, pp. 67–95.

15. G. Cauwenberghs, “A fast stochastic error-descent algorithmfor supervised learning and optimization,” inAdvances in Neu-ral Information Processing Systems, vol. 5. Morgan Kaufman:San Mateo, CA, 1993, pp. 244–251.

16. P. Werbos, “Beyond Regression: New Tools for Prediction andAnalysis in the Behavioral Sciences,” Ph.D. dissertation, 1974.Reprinted in P. Werbos,The Roots of Backpropagation. Wiley:New York, 1993.

17. H. J. Kushner, and D. S. Clark,Stochastic Approxima-tion Methods for Constrained and Unconstrained Systems.Springer-Verlag: New York, NY, 1978.

18. H. Robins and S. Monro, “A stochastic approximationmethod.”Annals of Mathematical Statistics22, pp. 400–407,1951.

19. J. C. Spall, “A stochastic approximation technique for gener-ating maximum likelihood parameter estimates.”Proceedingsof the 1987 American Control Conference, Minneapolis, MN,1987.

20. M. A. Styblinski and T.-S. Tang, “Experiments in nonconvexoptimization: Stochastic approximation with function smooth-ing and simulated annealing.”Neural Networks3(4), pp. 467–483, 1990.

21. M. Jabri and B. Flower, “Weight perturbation: An optimal ar-chitecture and learning technique for analog VLSI feedforwardand recurrent multilayered networks.”IEEE Transactions onNeural Networks3(1), pp. 154–157, 1992.

22. J. Alspector, R. Meir, B. Yuhas, and A. Jayakumar, “A paral-lel gradient descent method for learning in analog VLSI neu-ral networks,” inAdvances in Neural Information ProcessingSystems, vol. 5. Morgan Kaufman: San Mateo, CA, 1993,pp. 836–844.

23. B. Flower and M. Jabri, “Summed weight neuron perturba-tion: An O(n) improvement over weight perturbation,” inAdvances in Neural Information Processing Systems, vol. 5,Morgan Kaufman: San Mateo, CA, 1993, pp. 212–219.

24. G. Cauwenberghs, “A learning analog neural network chip withcontinuous-recurrent dynamics,” inAdvances in Neural Infor-mation Processing Systems, vol. 6. Morgan Kaufman: SanMateo, CA, 1994, pp. 858–865.

25. G. Cauwenberghs, “An analog VLSI recurrent neural networklearning a continuous-time trajectory.”IEEE Transactions onNeural Networks7(2), March 1996.

26. F. Pineda, “Mean-field theory for batched-TD(λ),” submittedto Neural Computation, 1996.

27. S. Grossberg and D. S. Levine, “Neural dynamics of atten-tionally modulated pavlovian conditioning: Blocking, inter-stimulus interval, and secondary reinforcement.”Applied Op-tics 26, pp. 5015–5030, 1987.

28. E. Niebur and C. Koch, “A model for the neuronal implementa-tion of selective visual attention based on temporal correlationamong neurons.”Journal of Computational Neuroscience1,pp. 141–158, 1994.

29. G. Cauwenberghs, “Reinforcement learning in a nonlinearnoise shaping oversampled A/D converter.” To appear inProc.Int. Symp. Circuits and Systems, Hong Kong, June 1997.

Analog VLSI Stochastic Perturbative Learning Architectures · 2004-11-21 · Examples of early...

Documents

Transcript of Analog VLSI Stochastic Perturbative Learning Architectures · 2004-11-21 · Examples of early...