Neural control of dopamine neurotransmission: implications for reinforcement learning

9

Click here to load reader

Transcript of Neural control of dopamine neurotransmission: implications for reinforcement learning

Page 1: Neural control of dopamine neurotransmission: implications for reinforcement learning

Neural control of dopamine neurotransmission: implicationsfor reinforcement learning

Mayank Aggarwal,1 Brian I. Hyland2 and Jeffery R. Wickens1

1Neurobiology Research Unit, Okinawa Institute of Science and Technology, 1919-1, Tancha, Onna-Son, Kunigami,Okinawa 904-0412, Japan2Department of Physiology, Medical School, Dunedin, New Zealand

Keywords: basal ganglia, GABA, striatum, temporal difference

Abstract

In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding ofreinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal gangliaprovides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct andindirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individualstriatofugal neurons innervate both globus pallidus externa and globus pallidus interna ⁄ substantia nigra (GPi ⁄ SNr). This suggeststhat the GPi ⁄ SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timingprocesses necessary for learning in the TD framework. The TD framework has been particularly successful in analysingelectrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, presentunderstanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood.Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority ofsynapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occurtogether in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play acentral role in the control of DA neuron activity and further investigation of these circuits may provide important insight into thebiological mechanisms of reinforcement learning.

Introduction

Understanding of the neural mechanisms for reinforcement learningtook a leap forward in the late 1990s, due to the convergence of resultsfrom machine learning, recordings from dopamine (DA) neurons andDA-dependent synaptic plasticity in the corticostriatal pathway. Thisconvergence led to the concept of the DA signal as a reward predictionerror, which could be calculated by the temporal difference (TD)learning algorithm. In the decades that followed, TD learning provedto be a powerful concept for interpreting DA cell activity, and hasbeen applied to the analysis of neurophysiological recordings and datafrom imaging studies with remarkable success. Although it was notoriginally introduced as a biological model, the model is predictiveand can be instantiated in a biologically plausible architecture. Thishas led to a common view that to understand TD learning is tounderstand DA cell firing. Here we consider the validity of the TDmodel as a biological model. After outlining the assumptions of theTD framework, and putative biological substrates, we examine thebiological reality of such a proposal. Although it is possible to fit some

parts of the model, there are some important assumptions that do notfit with current biological knowledge. We consider the implications ofthe biological reality for the future development of models to explainDA cell activity.

The traditional TD learning framework

We define the traditional TD learning framework as exemplified by thedescription in Montague et al. (1996). Figure 1 shows this frameworkadapted from Pan et al. (2005); we will use the terminology in thisfigure. All TD learning models have certain common features that canbe mapped on to a network, with nodes and connections. Here, a nodeis a processing point at which connections converge and ⁄ or diverge. Aconnection transmits information from one node to another and hasstates associated with it, for example an eligibility state. TD learning isnot intended as a one-to-one biological model. However, as applied tothe DA system, the TD learning framework makes a number ofassumptions about the architecture and operations of the neural systemthat performs TD learning.In the simplest possible instantiation, a node might be equated with

a neuron. However, in general, a node is more like a nucleus ornetwork of many neurons that performs an operation. Similarly, a

Correspondence: Dr J. R. Wickens, as above.E-mail: [email protected]

Received 17 November 2011, revised 22 January 2012, accepted 30 January 2012

European Journal of Neuroscience, Vol. 35, pp. 1115–1123, 2012 doi:10.1111/j.1460-9568.2012.08055.x

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing Ltd

European Journal of Neuroscience

Page 2: Neural control of dopamine neurotransmission: implications for reinforcement learning

connection could be a synapse or multiple synapses together withassociated axons and synaptic specializations, such as dendriticspines.TheTD learningmodel couldbe regarded as nomore than a description

of DA neuron firing activity. However, we would like to examine itspotential as amodel of themechanism for controllingDAneuron activity.For this to have validity, at some level of abstraction, the nodes andconnections used by TD learning should correspond to biological entitiesthat have the inputs and the outputs assumed by TD learning.The basal ganglia have emerged as a candidate host of TD learning

in the brain. We examine the possibility of a correspondence betweenthe nodes in the TD learning framework and neurons, or networks, inthe basal ganglia. First, we review the assumptions of TD learning interms of nodes and connections. These assumptions and thecorresponding biological equivalent are summarized in Table 1.Figure 1 shows the nodes and associated variables that are pertinent

to the instantiation of the temporal difference model, i.e. stimulus x(t),input weights W(i,t), reward r(t), P(t), P(t ) 1), and the temporaldifference signal TD(t) calculated from P(t) ) P(t ) 1). The firstnode, P, receives the converging input from the entire representationof the input vector. The converging input includes the current inputvector and a time-shifted representation of previous input vectors, inother words a two-dimensional array with the dimensions of the inputvector and the number of time steps in each trial. The output of P is thepredicted reward at all future times based on the current input to P.This is the inner product of the entire input matrix and the input weightmatrix. The biological equivalent of node P is not usually defined, butMontague et al. (1996) suggested the striatum and amygdala aspossibilities. The weight matrix in such case would be stored in theinput synapses to these nuclei.

The second node is the temporal difference node, TD. The temporaldifference signal, TD(t) is the negative of the time derivative of P,weighted more towards the past than the present, depending on adiscount factor. In the algorithm it is calculated by subtracting P(t)from P(t ) 1), where P(t ) 1) is obtained by a delay node. Itrepresents the reward predicted within a particular time step. Thebiological equivalent of the TD node is also not usually defined, butwe will offer some speculations later.The third node is the prediction error node, which represents the

difference between actual and predicted reward at any given moment.This is a scalar and in the TD learning framework it can have positiveor negative values and varies with each time step. This is the node thatis often equated with DA neurons, on the basis of the correlationbetween DA neuron firing rate and reward prediction error. Thisvariable is shown in Fig. 1 as DA(t). As the firing rate cannot benegative, it can be assumed that negative values are firing rates belowa tonic firing rate.The inputs X(i, j) to the system warrant additional comment. Each

input is, as stated above, a vector of input values given by theenvironment. The system stores the history of inputs in a time-shiftedrepresentation, Input(i, j). This is a register of all past inputs from theenvironment, in which the time dimension is represented by a spatialindex through a cascading sequence of arrays. For convenience we referto this as the input shift register by analogy to digital electronics. Thebiological equivalent of this may be cortical regions that project to thestriatum and amygdala (Montague et al., 1996). This is the simplestform of temporal representation, and there are other alternatives such asthe spectral timing model of Grossberg & Schmajuk (1989).The input weight matrix W(i, j) is an array of values of input

effectiveness that stores the result of learning, with a value for each

Fig. 1. Traditional TD model: X refers to the input vector from the environment; Input (i,j) refers to the complete input matrix, with j dimension representing theshift register; w(i,j) is the input weight matrix, each element of which has a one-to-one correspondence with the elements of the input matrix; k(i,j) is the eligibilitytrace matrix, where each element of the input matrix has a corresponding element in the eligibility trace matrix; P receives the input matrix along with the weightmatrix and the inputs to P hold the eligibility traces; P(t) is the prediction of all future reward at time t; P(t ) 1) is the prediction of all future rewards at time t ) 1;TD is the temporal difference node whose output is TD(t) = P(t ) 1) ) cP(t) where c is the discount factor; DA is the DA neuron which receives the inputs TD(t)and the actual reward signal r(t); DA(t) is the reward prediction error signal given by DA(t) = r(t) ) TD(t).

1116 M. Aggarwal et al.

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 3: Neural control of dopamine neurotransmission: implications for reinforcement learning

element in the shift register. The biological equivalent of the inputweights is probably the synaptic efficacy of the synapses connectingthe cortical inputs to the P nodes.

There is also an array of eligibility traces. This is a decayingrepresentation of the activity of each input matrix element, and is alsoa matrix, k(i,j). The eligibility trace is used during updating of theinput weight matrix, where it serves as a multiplier of the predictionerror signal in adjusting the weights. No biological equivalent of theeligibility trace has been experimentally identified, but it may be somekind of chemical trace (Izhikevich, 2007) or alternatively somedecaying reverberating activity of the cortical inputs (Ludvig et al.,2008).

Phenomenology of the DA signal

The phenomenology of DA cell firing activity has recently beenreviewed in the context of TD learning in several key articles(Glimcher, 2011). Here we consider DA cell firing activity patternsthat depart from the classical TD model in ways that constrain thepossible underlying mechanisms. In classical conditioning experi-ments in which a neutral cue (a tone) is repeatedly paired with areward, DA responses to reward remain long after new responses tothe predictive cue have developed (Pan et al., 2005). This can happenin the TD learning model if a large value is assumed for lambda and asmall value is assumed for alpha. The acquisition of responses to thecue in this case would be fast because of the large lambda, whereas theloss of response to the actual reward would be slow because of thesmall alpha. A prediction of the biological instantiation of TD istherefore that there is a different time course in the acquisition of theDA response to the cue compared with the decrease of the response tothe actual reward (Pan et al., 2005).

Late in training, when DA responses to the cues are well developed,the responses to primary reward decrease and eventually seem tocease. At this stage, however, presenting the reward without the cuecauses a strong response in the DA neurons (Pan et al., 2005). Thisindicates that such learning does not modify the strength of synapticinputs by which reward activates the DA neurons, because they arestill effective in trials with no cue. Rather, it points towards directsynaptic inhibition of DA neurons at the time of predicted reward,which opposes the excitation produced by the reward, and alsodepresses the response of the DA neurons when the predicted rewardis omitted. Even when electrical stimulation is used to activate DAneurons, as in brain stimulation reward, the DA response to the rewarddecreases as the reward becomes predictable (Owesson-White et al.,

2008), indicating an active inhibitory process acting at the level of theDA neuron.The foregoing observations suggest that different mechanisms may

be involved in the acquisition of DA responses to the cue and loss ofDA responses to the reward. Strictly speaking, this is not amathematical requirement, but there is other evidence pointing in thisdirection. For example, differential sensitivity of DA responses to cuesand rewards depending on the cue–reward interval also supports theexistence of independent mechanisms (Fiorillo et al., 2008).For the acquisition of new DA responses to cues, the TD learning

model provides a plausible and intuitively appealing biologicalexplanation in terms of plasticity of the input synapses to the Pneurons, although it is necessary to assume the existence of eligibilitytraces that remain speculative at present. In such a model the cueinitiates neural activity that results in DA neuron firing in response tothe cue. How this happens will not be considered further here. On theother hand, the inhibition of responses to reward does not immediatelysuggest a mechanism. In the rest of this section we will consider atsome length the mechanism for inhibition of responses to reward,because this has major implications for the instantiation of the TDlearning model in the brain, and the circuitry involved is complex andnot completely understood.Inhibition of DA neurons at the time of predicted reward implies a

process that is initiated or triggered by the cue, and involves some kindof timer, which is time-locked to and dependent on the cue. Additionalinsight is provided by the observation of DA neuron activity when twocues are given in sequence, prior to a reward. Using a cue1–cue2–reward paradigm, Pan et al. (2005) showed that there was a slightinhibition of DA neuron activity at the time of cue2 in trials when cue2was omitted. In this case, the results suggest that the timing process thatresults in inhibition at the time of cue2 is triggered by cue1. In addition,it was found that the DA response to reward is restored by the omissionof cue2. This suggests that a second timing process resulting ininhibition of DA neurons at the time of primary reward is triggered bycue2. Another piece of the puzzle is that the effectiveness of theinhibition of DA neurons is less when delays are longer. Fiorillo et al.(2008) found that DA responses to rewards were very sensitive to thecue–reward interval. From this we can conclude that the effectivenessof the timed inhibition process becomes less as the time elapsedbetween the cue and reward increases. This cannot be attributed to a lossof accuracy over the longer time intervals, because the inhibition occursat the correct time point even though it is reduced in strength.What is the mechanism for the timer process initiated by the cue that

results in inhibition of the DA neurons at the time of expected reward?

Table 1. Biological equivalents of the TD model nodes and connections

TD model concept Biological equivalent

P Nucleus or network that is afferent to the temporal difference nodeInput state at time t Action potential or period of increased firing on afferent axon to a synapse on P, within one time step,

caused by a single input event that generates an eligibility trace and contributes to the output of PInput state shift register from t to t ) x Shift register of input events with an independent input for each time step, each component of the input

vector, and the corresponding eligibility stateWeight vector Synaptic strength of inputs to PReward prediction error Dopamine cell outputTemporal difference Nucleus or network that is afferent to dopamine cells, representing reward prediction at time t (calculates a time

derivative of P); not modmulated by dopamine inputReward Excitatory input to the dopamine cellEligibility trace Decaying representation of a previously active synaptic inputLambda Non-zero time constant of the eligibility trace decay rateAlpha The learning rate for the weight vectorGamma A discount factor for future rewards

Neural control of dopamine release 1117

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 4: Neural control of dopamine neurotransmission: implications for reinforcement learning

We propose that the cue initiates neural activity patterns in a neuralnetwork that can potentially inhibit the DA neurons (for example, such anetwork might be in the dorsal striatum or nucleus accumbens). Thesepatterns are dynamic activity patterns of neural populations, whichcontinue to remain active and changing after the cue. Although theychange over time they exhibit a characteristic sequence of activity,which is repeated in response to later presentations of the same cue. Suchpatterns have been suggested by several authors (Carrillo-Reid et al.,2008; Ponzi&Wickens, 2010). The pattern started by the cue is assumedto inhibit the DA neurons. An experimental prediction of this model isthat striatal neurons should be activated in stereotyped, spatiotemporalsequences during the interval between the cue and the reward (Ponzi &Wickens, 2010). Such a prediction could be tested by searching forexcessively repeating patterns in multi-unit recordings, or alternativelyby experimentally disrupting such sequences with electrical stimulationor optogenetic excitation and inhibition.Given the above mechanism for a time delay, it is also necessary to

postulate a way to learn the correct timing of the inhibition. At thetime of the reward, we propose that the particular pattern is reinforcedby the DA released at the time of reward, on the basis of DA-mediatedsynaptic plasticity in the corticostriatal synapses (Reynolds &Wickens, 2002; Wickens, 2009). At the same time, it is plausible topropose DA-dependent strengthening of the synapses connecting thestriatum to the DA neurons. As the pattern that is reinforced by DA isonly present at the time of the reward, that pattern would bestrengthened, leading to DA neuron suppression at the time of actual(i.e. predicted) reward. As the strengthening depends on the DA signalat reward time it automatically reaches an asymptote when DA signalis completely cancelled. This speculative model makes a clearprediction that is testable using optogenetic methods (Tsai et al.,2009; Adamantidis et al., 2011). If the DA neuron response to rewardis suppressed artificially during training, according to the model,inhibition of DA neurons at the time of the primary reward should notbe acquired. Conversely, if the DA response to the cue is suppressedduring training and testing, it is predicted to have no effect on thesuppression of DA neurons at the time of predicted reward.Extinction of conditioned responses that occurs in the absence of

rewards indicates the importance of inhibition. In extinction trials,along with a reduction in response to the conditioned cue, aninhibitory response to the omitted reward develops in the DA neurons(Pan et al., 2008). Activation of gamma aminobutyric acid (GABA)Areceptors in the ventral tegmental area (VTA) is required for suchextinction, indicating that it is mediated by synaptic inhibition.In summary, we argue for the existence of a timing mechanism that

is triggered by an established cue, and synaptically inhibits the DAneurons at the time of predicted reward. We suggest that neuralnetworks in the striatum produce dynamic activity patterns thatprovide a possible timing mechanism for inhibition at the time ofreward. We now consider the control of the DA signal.

Control of the DA signal

The DA neuron plays a central role in all learning models, and so it isvery important to examine its properties, control mechanisms andpopulation behaviour. We now consider the anatomy and physiologyof the inhibition of DA neurons from the perspective of inhibitorycontrol of DA neurons by a timing sequence in the striatum.

Physiology of the DA neuron

DA neurons in the intact brain exhibit different modes of firing, whichmay be called tonic and phasic. By tonic firing we mean firing at a

low, irregular rate. This is produced by an intrinsic mechanism (Laceyet al., 1989; Wilson & Callaway, 2000; Grillner & Mercuri, 2002) thatby itself produces clock-like firing patterns, which have been observedin anaesthetized (Paladini & Tepper, 1999) and less commonly inconscious rats (Hyland et al., 2002). It is driven by slow, large-amplitude pacemaker depolarization (Grace & Bunney, 1983; Grace &Onn, 1989) ending in a single spike and followed by an afterhyper-polarization. This pacemaker-like activity underlies the highly regularfiring pattern observed in DA neurons in the in vitro brain-slicepreparation (Grace & Onn, 1989) and is presumably the underlyingdynamic of the DA neuron on which excitatory and inhibitory inputsact. Inhibitory input results in delay or prevention of action potentialfiring by causing a reset of the neuron’s membrane potential.Although regular firing is present in a proportion of neurons, in the

majority of neurons the regular firing pattern is perturbed by synapticinputs to the neurons (Hyland et al., 2002). In anaesthetized animals aburst-firing mode is seen (Grace & Bunney, 1984), which ispresumably due to spontaneous synaptic input activity, because it isnot normally seen in vitro except after some rather drastic manipu-lations (Johnson et al., 1992).Inhibitory control is extremely important and may be more

important than excitatory control. Single DAergic neurons in thesubstantia nigra pars compacta (SNc) (aspiny, smooth dendrites)receive about 8000–9000 synapses, of which approximately 65–70%are GABAergic and 30% are glutamatergic (Tepper & Lee, 2007).This is much higher than most other parts of the brain. In addition toextrinsic inputs from various nuclei to DA neurons, there existGABAergic interneurons in both VTA and SNc, which are inhibitoryin nature (Yanagihara & Hessler, 2006; Hara et al., 2007; Hazy et al.,2010). However, the interneurons account for a relatively smallfraction of the GABAergic synapses on DA neurons, with the majorityfrom extrinsic sources.Direct electrophysiological stimulation of habenula nucleus pro-

duces pauses in tonic DA firing (Shepard et al., 2006), and this effectappears to be mediated via glutamatergic projections onto GABAergicinterneurons of the VTA and SNc (Shepard et al., 2006). MidbrainGABAergic neurons are known from other studies to play a role incontrolling tonic DA firing (Floresco et al., 2003). These neuronsproject to the prefrontal cortex (PFC) and the nucleus accumbens(NAc) and receive inputs from the PFC (PFC projects to mesoac-cumbal GABAergic neurons) (Korotkova et al., 2004). They alsoreceive inhibitory inputs from NAc (Steffensen et al., 1998) and othersources such as hippocampal CA3 via the lateral septum (e.g. Luoet al., 2011).The excitatory synapses on DA and GABA VTA neurons are

different – synapses on DA neurons exhibit depression in response torepetitive activation, whereas synapses on GABA neurons showfacilitation (Korotkova et al., 2004). On the other hand, nicotinicreceptors on GABAergic neurons show faster desensitization thanthose on DA neurons in response to nicotine doses (Yin & French,2000). GABAergic neurons in the SNc and VTA display spikedurations shorter than the DA neurons, fire regularly (no burst firing)and fire at a high frequency (19 Hz) during depolarizing current steps(Korotkova et al., 2003).Consistent with the strong representation of inhibition, physiolog-

ical studies show that burst firing in DA neurons does not occur unlessinhibition is reduced. Activation of excitatory afferents to the VTAresults in burst firing, but only in those DA neurons that are alreadyspontaneously active. It has been seen that glutamate agonists causeburst firing only in those neurons which are spontaneously active(Lodge & Grace, 2006). In neurons that are inactive, presumably dueto GABA-mediated inhibition, activation of glutamatergic afferents

1118 M. Aggarwal et al.

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 5: Neural control of dopamine neurotransmission: implications for reinforcement learning

has little or no effect (Lodge & Grace, 2006). Conversely, a decreasein tonic GABAergic transmission to the VTA results in a significantand selective increase in DA neuron population activity correspondingto increases in tonic DA efflux in the NAc (Floresco et al., 2003). Ithas been suggested that only neurons not under GABA-mediatedhyperpolarization are capable of entering a burst firing mode inresponse to a glutamatergic input (Lodge & Grace, 2006).

On the other hand, reducing inhibition by itself is not sufficient toproduce burst firing. The increase in DA neuron population activitythat results from a decrease in tonic GABAergic transmission does notcause increased firing rate or burst firing (Floresco et al., 2003).Conversely, local application of N-methyl-d-aspartate receptor(NMDAR) antagonists, but not a-amino-3-hydroxy-5-methyl-4-isox-azolepropionic acid receptor (AMPAR) antagonists, significantlyreduces burst firing, suggesting that the excitatory drive of glutamateis NMDAR-mediated (Engberg et al., 1993; Paladini & Tepper, 1999;Tepper & Lee, 2007). Thus, excitatory input is necessary for burstfiring. Taken together, these findings indicate that DA neuron burstfiring requires two events to be coordinated: an increase in excitatoryinput and decrease in inhibitory input. At the level of individual DAneurons these two events may occur together by chance rather than byactive coordination, but this will not result in synchronized activationacross a population of DA neurons. We now consider what causes theburst firing observed in awake, behaving animals, for example, afteran unexpected reward.

Modulation of DA neuron firing by actual reward

The previous section has highlighted the importance of inhibitorycontrol over DA neurons in the suppression of DA neuron firing at thetime of actual reward, when the reward is predictable. We now presentarguments to take this a step further, to invoke a reduction ofinhibitory GABAergic inputs in the activation of DA neurons. Inparticular, we will show that the activation of DA neurons byunexpected rewards involves disinhibition by inhibition of GABA-ergic inputs, which is necessary in conjunction with excitation byglutamatergic inputs.

As we have noted, reducing inhibition is necessary for DA neuronsto switch into a burst-firing mode. Conversely, burst firing can beprevented by increasing inhibition. Burst firing induced by NMDAagonists was abolished by bath application of GABAA agonists(Tepper & Lee, 2007). This effect could not be reversed bydepolarizing current injection to counter the GABA-induced hyper-polarization, suggesting that GABAAR-mediated conductances maybe a critical factor. This latter observation emphasizes the importanceof direct inhibitory synaptic inputs to the DA neurons.

Recent experiments suggest that reward may be signalled bysynchrony rather than an average increase in firing rate. The averageresponse rate and signal correlations of DA neurons increase for bothrewarding and aversive stimuli whereas the noise correlation, whichtakes into account temporal dynamics of the firing pattern, increasesonly for rewarding stimuli and the reward-predicting cue (Joshuaet al., 2009). Thus, it appears that the firing patterns of the DAneurons (as a group) are different for rewarding (and rewardpredicting) and aversive (and predicting) stimuli, which could leadto different levels of DA release and different DA dynamics in theextracellular part of the targets (different uptake vs. diffusiondynamics). The TD model assumes synchronous firing by rewards.The requirements for synchrony of DA neurons across a populationare probably even more severe than the requirements for bursting in asingle neuron. The level of temporal correlation required to signalrewarding events, along with the findings of receptor-level modula-

tion, lead to the hypothesis that rewards cause both disinhibition andexcitation of the DA neurons. This implies that both a GABAAR-mediated decrease in input resistance and NMDAR activation arenecessary to induce highly synchronous burst firing of the DA neuronsat the population level.

Assumptions required for biological instantiation of TDlearning

The assumptions required for biological instantiation of TD learningcan now be considered in light of the above discussion. There are anumber of assumptions, both explicit and implicit, required forbiological instantiation of TD learning. Not all of the assumptions ofTD learning are a good fit with the biology, without special pleading.Although some of these assumptions may be met by the biology, theplaces where the fit is not good suggest that there is potential based onthe biology for more interesting computations. The assumptions areenumerated below, first stating the assumption of the TD model, andthen considering whether the biology is compatible with such anassumption.

Assumption 1. The DA signal is scalar

The TD model assumes that the DA signal is scalar. This follows fromassuming only one DA neuron in the model. If, as in the real brain,there is more than one DA neuron, to be consistent with the modelthey must all be doing the same thing at the same time. That is, theirfiring must be completely synchronized on all timescales to meet thisassumption. With respect to rewards, for which the TD model makesthis assumption, there is considerable synchrony of firing rateincreases. However, the DA neuron population is capable ofresponding to stimuli in a non-synchronized manner as well (Joshuaet al., 2009). Also, in electrophysiological studies, the proportion ofresponsive neurons is < 100%. The scalar nature of the DA signalassumed in the model also does not take into account the spatialvariation in DA concentration brought about by differences in releaseand reuptake at target sites. On the micrometre scales within thestriatum, a homogeneous distribution of DA is supported givenreasonable assumptions (Arbuthnott & Wickens, 2007). However,there are regional variations in DA release. With microdialysis,Bassareo et al. (2011) found that Pavlovian conditioned stimuli tofood elicited an increase in DA in the core, but not the shell of theNAc. They and others also found that drug-conditioned stimulidifferentially elicit DA release in subregions of the striatum (Ito et al.,2004; Bassareo et al., 2011). Zhang et al. (2009) reported that phasicDA release in response to unexpected food rewards was very sparse inthe dorsolateral striatum compared with the NAc. Brown et al. (2011)reported similar findings showing that DA release was different in theNAc shell, and dorsomedial or dorsolateral striatum, showing thatphasic DA release is not uniformly broadcast to all striatal regions inresponse to reward, but is selectively evoked in distinct regions.

Assumption 2. The DA signal at target sites corresponds directlyto the error signal, faithfully tracking the error with unlimitedtemporal resolution

In reality the DA signal includes the transformations in the stepsbetween DA neuron firing and signal transduction at the receivingneurons. The time course of DA concentration changes depends notonly on release but also on reuptake. There is considerable variation inthe density of innervation between different striatal regions. In the

Neural control of dopamine release 1119

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 6: Neural control of dopamine neurotransmission: implications for reinforcement learning

dorsolateral striatum the density of DA varicosities is 1.1 · 108 permm3, compared with 0.6 · 108 in the ventral striatum (Doucet et al.,1986). The density of DA transporters varies over a wider rangebecause the DA transporter number per synapse is less in the lessdensely innervated regions. As a consequence there are different rateconstants for the release and uptake of DA in different regions (Garris& Wightman, 1994). Furthermore, a negative prediction error,expressed as a pause or phasic reduction in firing, is assumed tocause a corresponding decrease in concentration, when in reality thetime course depends on diffusion and reuptake mechanisms as well asneuron firing rate. These observations suggest that the DA signal istransformed in different ways in different receiving areas.

Assumption 3. The reward prediction error can be positive ornegative

The physiological interpretation of this, given DA concentration orDA neuron firing rate can only be positive or zero, is that there is a setpoint above which firing activity encodes a positive value and belowwhich it encodes a negative value. However, because the rate of firingof DA neurons is low, there is a floor effect. The range over whichchange in firing rate can occur in the downward direction is less thanthat available in the upward direction.

Assumption 4. The TD signal can be positive or negative

This condition can be relaxed if there is a point above which firingactivity encodes a positive value and below which it encodes anegative value. However, to encode the TD signal in this way, the signof the TD input to the DA neurons must be negative. This is becausewe need the positive TD to silence the DA neurons at the time ofactual reward, and the negative TD signal to activate the DA neuronsin response to the cue. Thus, the TD model assumes that the cuecauses the DA neurons to fire through an immediate disinhibitionwithout invoking the excitatory inputs to the DA neurons.

Assumption 5. There is an eligibility trace if lambda is non-zero

If lambda is assumed to be zero, the TD model predicts a stepwiseshift in the timing of the prediction error signal, so that it shiftsgradually in latency towards the cue (Montague et al., 1996; Schultzet al., 1997). However, such a gradual shift in the latency of the DAneuron activity has not been seen experimentally. Rather, newconditioned responses develop at a constant, short latency after thepredictive cues within a few trials (Pan et al., 2005). Using a modelwith a non-zero, high value of lambda, it has been shown that a DAresponse can be caused to develop at the time of the cue without agradual shift from the time of actual reward. In such a case predictionerrors still occur between the cue and the reward but their magnitude isso low that they may go undetected in experimental studies (Pan et al.,2005). Thus, eligibility traces are a necessary assumption to fit the TDmodel to reality.

Assumption 6. Eligibility requires input but not output activity,and hence weight changes require only two factors, DA andinput

In the original formulation for predicting of reward by an ‘adaptivecritic element’ Barto et al. (1983) defined eligibility in a manner thatdid not depend on the output activity of the neuron-like element,which in the TD framework of Fig. 1 is the P node. For description,

we call this the ‘type 1’ eligibility rule. As the P node is simply tryingto predict reward on the basis of sensory input, a type 1 rule issufficient for the eligibility state to hold a short-term memory of therecent input, on which the prediction error (DA) operates to improvepredictions. In biological terms, a type 1 rule implies that presynapticactivity plus DA would be sufficient to change weights. However,under such conditions, experimental evidence shows that cortiocostri-atal synapses do not change (Reynolds & Wickens, 2002), suggestingthat a type 1 rule is non-biological.A type 1 rule runs into theoretical problems if there is more than one

threshold driven P-node-level input neuron. Thresholds are a biolog-ical reality, and have the effect that there is no output from a cell inresponse to subthreshold inputs. The theoretical problem stems fromthe fact that a given input may not fire the corresponding P-level inputneuron. However, it will result in an eligibility trace. This isproblematic because a DA signal produced by the activity of otherP-level neurons may then act on the eligibility trace of neurons thatdid not contribute to the DA signal. This would impair learning.In relation to situations in which a selection of output is being learnt

by an ‘actor’, such as control of a system, the eligibility trace shouldhelp to assign credit for causing the rewarded outcome. In that case,eligibility should depend on the output activity as well as the input. Inrelation to action learning, Barto et al. (1983) defined a second type ofeligibility rule, which we term ‘type 2’ eligibility rule, as the trace of aproduct of both input and output of the element. The combination of atype 2 eligibility rule with DA leads to the ‘three-factor’ learning rule(Wickens, 1993), for which there is some experimental support(Wickens et al., 1996; Reynolds et al., 2001). Related to this is aformulation based on spike-time-dependent plasticity (Izhikevich,2007). At present there is no experimental evidence to support theexistence of a synaptic eligibility trace of either type.

Assumption 7. There is a delay register for the statesprovided by a buffered memory for each stimulus that lastsas long as a trial

The problem with this is the large amount of memory consumed.Although in principle an arbitrary number of time steps could bestored in this way, this discrete element solution requires a number ofneuron equivalents that are sequentially activated, forming a two-dimensional array, the number of elements of which must increaseaccording to the trial duration divided by the temporal resolution. It isdifficult to imagine how such an array could be flexibly implementedin neural tissue so as to deal with a variety of different types of tasks.Nevertheless, a timing mechanism is required for any learning model.We have suggested that such a mechanism might be provided bydynamic activity patterns of neural populations, which continue toremain active and changing after the cue (Carrillo-Reid et al., 2008;Ponzi & Wickens, 2010). However, although this could be used totime the inhibition of DA neurons to prevent their activation bypredicted reward, it does not map on to the architecture of the TDlearning model.

Discussion

We have argued for the existence of a timing mechanism that istriggered by an established cue and that synaptically inhibits the DAneurons at the time of predicted reward. We suggested that neuralnetworks in the striatum produce dynamic activity patterns thatprovide a possible timing mechanism for inhibition at the time ofreward. Learning to inhibit the DA neurons at the time of primary

1120 M. Aggarwal et al.

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 7: Neural control of dopamine neurotransmission: implications for reinforcement learning

reward may be based on DA-dependent plasticity in the synapsesconnecting the cerebral cortex to the striatum, and the striatum to themidbrain DA neurons.

Our proposal relates particularly to the ventral striatum and thecontrol of DA neuron firing, which in reinforcement learning termswould be the ‘critic’. A similar timing process may well take place inparallel and synchronously in the dorsal striatum, triggered by thesame cues. In reinforcement learning terms the dorsal striatum wouldbe functioning as the ‘actor’. We have not considered the dorsalstriatum further in this review.

We have considered the anatomy and physiology of the inhibitionof DA neurons from the perspective of inhibitory control of DAneurons. This showed that an inverse coordination of synapticexcitation and inhibition was required for burst firing at the level ofindividual neurons, and for synchronization on the network level.

An important biological observation to emphasize is that the inputto DA neurons is predominantly inhibitory. This inhibitory networkcontrolling the DA neurons is very complex and involves intercon-nections with many afferent structures in the basal ganglia andbeyond. The striatum (dorsal striatum along with NAc) seems to play amajor part in controlling this network via these afferent structures, asshown in Fig. 2.

Traditional concepts of the basal ganglia circuit emphasize directand indirect pathways. According to this model, direct pathwayneurons project to globus pallidus interna ⁄ substantia nigra (GPi ⁄ SNr)(entopeduncular nucleus in the rat), and indirect pathway neuronsproject to the globus pallidus externa (GPe). Recent evidence in mice(Kravitz et al., 2010) and rats (Ferguson et al., 2011) using genetic

Fig. 2. Complexity of the inhibitory control of the midbrain DA neurons. This shows the direct GABAergic inputs to the DA neurons from other parts of the brain,and an example of input via GABAergic interneurons. The supporting evidence for each numbered connection is cited as follows:1, Shepard et al. (2006), Christophet al. (1986); 2, Herkenham & Nauta (1977); 3, Parent et al. (1989); 4, DeVito et al. (1980), Hassani et al. (1997), Gauthier et al. (1999); 5, DeVito & Anderson(1982); 6, Deniau et al. (1982); 7, DeVito & Anderson (1982); 8, DeVito & Anderson (1982); 9, Parent et al. (1989); 10, Deniau et al. (1996, 1982); 11, Heimeret al. (1991), Groenewegen et al. (1999); 12, Jimenez-Castellanos & Graybiel (1987); 13, Voorn et al. (1986), Jimenez-Castellanos & Graybiel (1987); 14,Heimer et al. (1991), Groenewegen et al. (1999); 15, Heimer et al. (1991), Usuda et al. (1998), Brog et al. (1993); 16, Heimer et al. (1991), Usuda et al. (1998); 17,Smith et al. (2009); 18, Smith et al. (2009), Beckstead et al. (1979); 19, DeVito et al. (1980); 20, Hassani et al. (1997), Gauthier et al. (1999).

Fig. 3. Network for TD node. Individual spiny neurons branch to provideinhibitory input to both GPe and GPi ⁄ SNr ⁄ EP neurons. The output from theGPe neurons has an inhibitory effect on neurons in the subthalamic nucleus(STN). From the STN there is an excitatory projection to GPi ⁄ SNr ⁄ EP. Thus,at the level of the GPi ⁄ SNr ⁄ EP there is convergence of a direct inhibitory inputfrom the striatum with an indirect, excitatory input via the subthalamic nucleus,both originating in the same neuron. The STN circuit may introduce a timedelay, so that the combination of a comparator and time delay performs themathematical operation of differentiation of the striatal output. This circuitrymay perform the TD operation required for TD learning.

Neural control of dopamine release 1121

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 8: Neural control of dopamine neurotransmission: implications for reinforcement learning

targeting of cell populations supports this distinction at the macro-scopic level. However, at the level of single neurons, individual spinyneurons branch to multiple sites (Kawaguchi et al., 1990; Wu et al.,2000; Levesque & Parent, 2005). In particular, the same spiny neuronsmay branch and project direct connections to both direct and indirectpathways. This architecture (see Fig. 3) provides a possible site for thedifferentiation of the predicted reward signal from the striatum withrespect to time; that is, it provides an inverting sign step with a timedelay that can sum with a non-inverting step. Such a circuit will detectchanges in the predicted reward value. When these are positive, theoutput will be inhibitory to the DA neurons as required by the TDmodel. Thus, looking at these connections in isolation provides apossible mapping of the TD model onto the biology. However, suchmapping ignores the complexity of the circuitry available to controlDA neurons. Future models of reinforcement learning may be able toexplain the advantages of the dynamic processing possibilities of thiscomplex network better than TD learning.The biological instantiation of the TD model, as commonly

imagined, assumes that the input to the DA neurons from rewardcentres is mediated by excitatory synapses on DA neurons. Wehave shown that it may be necessary to take into account theinhibitory synapses and the associated dynamic of disinhibition,because of the neurobiology of the control of DA neurons asexperimentally determined. What advantage does such an arrange-ment have? Perhaps it is the brain’s mechanism for dealing with thevariety of different goals that must compete for control ofbehaviour. It also provides a way in which various stimuli mayactivate DA neurons in quite different ways, for example bymodulating the overall synchrony of the DA neuron population viathe inhibitory inputs. Another potential advantage is that interposingan additional stage between the reward centres and the DA neuronprovides more control options. The competition between thepredicted and the expected reward can be carried out throughcompetition between disinhibition by the actual reward andinhibition by the predicted reward (the striatal patterns). It alsoprovides a way to shift the locus of competition to different nucleiprojecting to the DA neurons.In conclusion, we suggest that the inhibitory circuits impinging

directly or indirectly on the DA neurons play a central role in thecontrol of DA neuron activity and further investigation of thesecircuits may provide important insight into the biological mechanismsof reinforcement learning.

Abbreviations

DA, dopamine; EP, entopeduncular nucleus; GABA, gamma amino butyricacid; GPe, globus pallidus externa; GPi ⁄ SNr, globus pallidus interna ⁄ sub-stantia nigra; LH, lateral habenula; MB, midbrain; NAc, nucleus accumbens;NMDAR, a-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid receptor;PFC, prefrontal cortex; SNc, substantia nigra pars compacta; STN, subthalamicnucleus; TD, temporal difference; VP, ventral pallidum; VTA, ventraltegmental area.

References

Adamantidis, A.R., Tsai, H.C., Boutrel, B., Zhang, F., Stuber, G.D., Budygin,E.A., Tourino, C., Bonci, A., Deisseroth, K. & de Lecea, L. (2011)Optogenetic interrogation of dopaminergic modulation of the multiple phasesof reward-seeking behavior. J. Neurosci., 31, 10829–10835.

Arbuthnott, G.W. & Wickens, J. (2007) Space, time and dopamine. TrendsNeurosci., 30, 62–69.

Barto, A.G., Sutton, R.S. & Anderson, C.W. (1983) Neuronlike elements thatcan solve difficult learning control problems. IEEE Trans. Syst. Man Cyber.,15, 835–846.

Bassareo, V., Musio, P. & Di Chiara, G. (2011) Reciprocal responsiveness ofnucleus accumbens shell and core dopamine to food- and drug-conditionedstimuli. Psychopharmacology, 214, 687–697.

Beckstead, R.M., Domesick, V.B. & Nauta, W.J.H. (1979) Efferent connec-tions of the substantia nigra and ventral tegmental area in the rat. Brain Res.,175, 191–217.

Brog, J.S., Salyapongse, A., Deutch, A.Y. & Zahm, D.S. (1993) The patterns ofafferent innervation of the core and shell in the ‘accumbens’ part of the ratventral striatum: immunohistochemical detection of retrogradely transportedfluoro-gold. J. Comp. Neurol., 338, 255–278.

Brown, H.D., McCutcheon, J.E., Cone, J.J., Ragozzino, M.E. & Roitman, M.F.(2011) Primary food reward and reward-predictive stimuli evoke differentpatterns of phasic dopamine signaling throughout the striatum. Eur. J.Neurosci., 34, 1997–2006.

Carrillo-Reid, L., Tecuapetla, F., Tapia, D., Hernandez-Cruz, A., Galarraga, E.,Drucker-Colin, R. & Bargas, J. (2008) Encoding network states by striatalcell assemblies. J. Neurophysiol., 99, 1435–1450.

Christoph, G.R., Leonzio, R.J. & Wilcox, K.S. (1986) Stimulation of the lateralhabenula inhibits dopamine-containing neurons in the substantia nigra andventral tegmental area of the rat. J. Neurosci., 6, 613–619.

Deniau, J.M., Kitai, S.T., Donoghue, J.P. & Grofova, I. (1982) Neuronalinteractions in the substantia nigra pars reticulata through axon collaterals ofthe projection neurons. An electrophysiological and morphological study.Exp. Brain Res., 47, 105–113.

Deniau, J.M., Menetrey, A. & Charpier, S. (1996) The lamellar organization ofthe rat substantia nigra pars reticulata: segregated patterns of striatal afferentsand relationship to the topography of corticostriatal projections. Neurosci-ence, 73, 761–781.

DeVito, J.L. & Anderson, M.E. (1982) An autoradiographic study of efferentconnections of the globus pallidus in Macaca mulatta. Exp. Brain Res., 46,107–117.

DeVito, J.L., Anderson, M.E. & Walsh, K.E. (1980) A horseradish peroxidasestudy of afferent connections of the globus pallidus in Macaca mulatta. Exp.Brain Res., 38, 65–73.

Doucet, G., Descarries, L. & Garcia, S. (1986) Quantification of the dopamineinnervation in adult rat neostriatum. Neuroscience, 19, 427–445.

Engberg, G., Kling-Petersen, T. & Nissbrandt, H. (1993) GABAB-receptoractivation alters the firing pattern of dopamine neurons in the rat substantianigra. Synapse, 15, 229–238.

Ferguson, S.M., Eskenazi, D., Ishikawa, M., Wanat, M.J., Phillips, P.E., Dong,Y., Roth, B.L. & Neumaier, J.F. (2011) Transient neuronal inhibition revealsopposing roles of indirect and direct pathways in sensitization. Nat.Neurosci., 14, 22–24.

Fiorillo, C.D., Newsome, W.T. & Schultz, W. (2008) The temporal precision ofreward prediction in dopamine neurons. Nat. Neurosci., 11, 966–973.

Floresco, S.B., West, A.R., Ash, B., Moore, H. & Grace, A.A. (2003) Afferentmodulation of dopamine neuron firing differentially regulates tonic andphasic dopamine transmission. Nat. Neurosci., 6, 968–973.

Garris, P.A. & Wightman, R.M. (1994) Different kinetics govern dopaminergicneurotransmission in the amygdala, prefrontal cortex, and striatum: an in vivovoltammetric study. J. Neurosci., 14, 442–450.

Gauthier, J., Parent, M., Levesque, M. & Parent, A. (1999) The axonalarborization of single nigrostriatal neurons in rats. Brain Res., 834, 228–232.

Glimcher, P.W. (2011) Understanding dopamine and reinforcement learning:the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci. USA,108(Suppl. 3), 15647–15654.

Grace, A.A. & Bunney, B.S. (1983) Intracellular and extracellular electro-physiology of nigral dopaminergic neurons—1. Identification and charac-terization. Neuroscience, 10, 301–315.

Grace, A.A. & Bunney, B.S. (1984) The control of firing pattern in nigraldopamine neurons: burst firing. J. Neurosci., 4, 2877–2890.

Grace, A.A. & Onn, S.P. (1989) Morphology and electrophysiologicalproperties of immunocytochemically identified rat dopamine neuronsrecorded in vitro. J. Neurosci., 9, 3463–3481.

Grillner, P. & Mercuri, N.B. (2002) Intrinsic membrane properties and synapticinputs regulating the firing activity of the dopamine neurons. Behav. BrainRes., 130, 149–169.

Groenewegen, H.J., Wright, C.I., Beijer, A.V. & Voorn, P. (1999) Convergenceand segregation of ventral striatal inputs and outputs. Ann. N. Y. Acad. Sci.,877, 49–63.

Grossberg, S. & Schmajuk, N. (1989) Neural dynamics of adaptive timing andtemporal discrimination during associative learning.Neural Netw., 2, 79–102.

Hara, E., Kubikova, L., Hessler, N.A. & Jarvis, E.D. (2007) Role of themidbrain dopaminergic system in modulation of vocal brain activation bysocial context. Eur. J. Neurosci., 25, 3406–3416.

1122 M. Aggarwal et al.

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123

Page 9: Neural control of dopamine neurotransmission: implications for reinforcement learning

Hassani, O.K., Francois, C., Yelnik, J. & Feger, J. (1997) Evidence for adopaminergic innervation of the subthalamic nucleus in the rat. Brain Res.,749, 88–94.

Hazy, T.E., Frank, M.J. & O’Reilly, R.C. (2010) Neural mechanisms ofacquired phasic dopamine responses in learning. Neurosci. Biobehav. Rev.,34, 701–720.

Heimer, L., Zahm, D.S., Churchill, L., Kalivas, P.W. & Wohltmann, C. (1991)Specificity in the projection patterns of accumbal core and shell in the rat.Neuroscience, 41, 89–125.

Herkenham, M. & Nauta, W.J. (1977) Afferent connections of the habenularnuclei in the rat. A horseradish peroxidase study, with a note on the fiber-of-passage problem. J. Comp. Neurol., 173, 123–146.

Hyland, B.I., Reynolds, J.N., Hay, J., Perk, C.G. & Miller, R. (2002) Firingmodes of midbrain dopamine cells in the freely moving rat. Neuroscience,114, 475–492.

Ito, R., Robbins, T.W. & Everitt, B.J. (2004) Differential control over cocaine-seeking behavior by nucleus accumbens core and shell. Nat. Neurosci., 7,389–397.

Izhikevich, E.M. (2007) Solving the distal reward problem through linkage ofSTDP and dopamine signaling. Cereb. Cortex, 17, 2443–2452.

Jimenez-Castellanos, J. & Graybiel, A.M. (1987) Subdivisions of thedopamine-containing A8-A9-A10 complex identified by their differentialmesostriatal innervation of striosomes and extrastriosomal matrix. Neuro-science, 23, 223–242.

Johnson, S.W., Seutin, V. & North, R.A. (1992) Burst firing in dopamineneurons induced by N-methyl-D-aspartate: role of electrogenic sodiumpump. Science, 258, 665–667.

Joshua, M., Adler, A., Prut, Y., Vaadia, E., Wickens, J.R. & Bergman, H.(2009) Synchronization of midbrain dopaminergic neurons is enhanced byrewarding events. Neuron, 62, 695–704.

Kawaguchi, Y., Wilson, C.J. & Emson, P.C. (1990) Projection subtypes of ratneostriatal matrix cells revealed by intracellular injection of biocytin. J.Neurosci., 10, 3421–3438.

Korotkova, T.M., Sergeeva, O.A., Eriksson, K.S., Haas, H.L. & Brown, R.E.(2003) Excitation of ventral tegmental area dopaminergic and nondopam-inergic neurons by orexins ⁄ hypocretins. J. Neurosci., 23, 7–11.

Korotkova, T.M., Ponomarenko, A.A., Brown, R.E. & Haas, H.L. (2004)Functional diversity of ventral midbrain dopamine and GABAergic neurons.Mol. Neurobiol., 29, 243–259.

Kravitz, A.V., Freeze, B.S., Parker, P.R., Kay, K., Thwin, M.T., Deisseroth,K. & Kreitzer, A.C. (2010) Regulation of parkinsonian motor behavioursby optogenetic control of basal ganglia circuitry. Nature, 466, 622–626.

Lacey, M.G., Mercuri, N.B. & North, R.A. (1989) Two cell types in ratsubstantia nigra zona compacta distinguished by membrane properties andthe actions of dopamine and opioids. J. Neurosci., 9, 1233–1241.

Levesque, M. & Parent, A. (2005) The striatofugal fiber system in primates: areevaluation of its organization based on single-axon tracing studies. Proc.Natl Acad. Sci. USA, 102, 11888–11893.

Lodge, D.J. & Grace, A.A. (2006) The hippocampus modulates dopamineneuron responsivity by regulating the intensity of phasic neuron activation.Neurospsychopharmacology, 31, 1356–1361.

Ludvig, E.A., Sutton, R.S. & Kehoe, E.J. (2008) Stimulus representation andthe timing of reward-prediction errors in models of the dopamine system.Neural Comput., 20, 3034–3054.

Luo, A.H., Tahsili-Fahadan, P., Wise, R.A., Lupica, C.R. & Aston-Jones, G.(2011) Linking context with reward: a functional circuit from hippocampalCA3 to ventral tegmental area. Science, 333, 353–357.

Montague, P.R., Dayan, P. & Sejnowski, T.J. (1996) A framework formesencephalic dopamine systems based on predictive Hebbian learning. J.Neurosci., 16, 1936–1947.

Owesson-White, C.A., Cheer, J.F., Beyene, M., Carelli, R.M. & Wightman,R.M. (2008) Dynamic changes in accumbens dopamine correlate with

learning during intracranial self-stimulation. Proc. Natl Acad. Sci. USA, 105,11957–11962.

Paladini, C.A. & Tepper, J.M. (1999) GABA(A) and GABA(B) antagonistsdifferentially affect the firing pattern of substantia nigra dopaminergicneurons in vivo. Synapse, 32, 165–176.

Pan, W.X., Schmidt, R., Wickens, J.R. & Hyland, B.I. (2005) Dopamine cellsrespond to predicted events during classical conditioning: evidence foreligibility traces in the reward-learning network. J. Neurosci., 25, 6235–6242.

Pan, W.X., Schmidt, R., Wickens, J.R. & Hyland, B.I. (2008) Tripartitemechanism of extinction suggested by dopamine neuron activity andtemporal difference model. J. Neurosci., 28, 9619–9631.

Parent, A., Smith, Y., Filion, M. & Dumas, J. (1989) Distinct afferents tointernal and external pallidal segments in the squirrel-monkey. Neurosci.Lett., 96, 140–144.

Ponzi, A. & Wickens, J. (2010) Sequentially switching cell assemblies inrandom inhibitory networks of spiking neurons in the striatum. J. Neurosci.,30, 5894–5911.

Reynolds, J.N. & Wickens, J.R. (2002) Dopamine-dependent plasticity ofcorticostriatal synapses. Neural Netw., 15, 507–521.

Reynolds, J.N.J., Hyland, B.I. & Wickens, J.R. (2001) A cellular mechanism ofreward-related learning. Nature, 413, 67–70.

Schultz, W., Dayan, P. & Montague, P.R. (1997) A neural substrate ofprediction and reward. Science, 275, 1593–1599.

Shepard, P.D., Holcomb, H.H. & Gold, J.M. (2006) The presence of absence:habenular regulation of dopamine neurons and the encoding of negativeoutcomes. Schizophr. Bull., 32, 417–421.

Smith, K.S., Tindell, A.J., Aldridge, J.W. & Berridge, K.C. (2009) Ventralpallidum roles in reward and motivation. Behav. Brain Res., 196, 155–167.

Steffensen, S.C., Svingos, A.L., Pickel, V.M. & Henriksen, S.J. (1998)Electrophysiological characterization of GABAergic neurons in the ventraltegmental area. J. Neurosci., 18, 8003–8015.

Tepper, J.M. & Lee, C.R. (2007) GABAergic control of substantia nigradopaminergic neurons. Prog. Brain Res., 160, 189–208.

Tsai, H.C., Zhang, F., Adamantidis, A., Stuber, G.D., Bonci, A., de Lecea, L. &Deisseroth, K. (2009) Phasic firing in dopaminergic neurons is sufficient forbehavioral conditioning. Science, 324, 1080–1084.

Usuda, I., Tanaka, K. & Chiba, T. (1998) Efferent projections of the nucleusaccumbens in the rat with special reference to subdivision of the nucleus:biotinylated dextran amine study. Brain Res., 797, 73–93.

Voorn, P., Jorritsmabyham, B., Vandijk, C. & Buijs, R.M. (1986) Thedopaminergic innervation of the ventral striatum in the rat - a light-microscopic and electron-microscopic study with antibodies against dopa-mine. J. Comp. Neurol., 251, 84–99.

Wickens, J.R. (1993) A Theory of the Striatum. Pergamon Press, Oxford.Wickens, J.R. (2009) Synaptic plasticity in the basal ganglia. Behav. Brain

Res., 199, 119–128.Wickens, J.R., Begg, A.J. & Arbuthnott, G.W. (1996) Dopamine reverses the

depression of rat cortico-striatal synapses which normally follows highfrequency stimulation of cortex in vitro. Neuroscience, 70, 1–5.

Wilson, C.J. & Callaway, J.C. (2000) Coupled oscillator model of thedopaminergic neuron of the substantia nigra. J. Neurophysiol., 83, 3084–3100.

Wu, Y., Richard, S. & Parent, A. (2000) The organization of the striatal outputsystem: a single-cell juxtacellular labeling study in the rat. Neurosci. Res.,38, 49–62.

Yanagihara, S. & Hessler, N.A. (2006) Modulation of singing-related activityin the songbird ventral tegmental area by social context. Eur. J. Neurosci.,24, 3619–3627.

Yin, R. & French, E.D. (2000) A comparison of the effects of nicotine ondopamine and non-dopamine neurons in the rat ventral tegmental area: an invitro electrophysiological study. Brain Res. Bull., 51, 507–514.

Zhang, L., Doyon, W.M., Clark, J.J., Phillips, P.E. & Dani, J.A. (2009)Controls of tonic and phasic dopamine transmission in the dorsal and ventralstriatum. Mol. Pharmacol., 76, 396–404.

Neural control of dopamine release 1123

ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing LtdEuropean Journal of Neuroscience, 35, 1115–1123