Deep and Beautiful. The Reward Prediction Error · PDF file1 Deep and Beautiful. The Reward...

37
1 Deep and Beautiful. The Reward Prediction Error Hypothesis of Dopamine Abstract According to the reward-prediction error hypothesis (RPEH) of dopamine, the phasic activity of dopaminergic neurons in the midbrain signals a discrepancy between the predicted and currently experienced reward of a particular event. It can be claimed that this hypothesis is deep, elegant and beautiful, representing one of the largest successes of computational neuroscience. This paper examines this claim, making two contributions to existing literature. First, it draws a comprehensive historical account of the main steps that led to the formulation and subsequent success of the RPEH. Second, in light of this historical account, it explains in which sense the RPEH is explanatory and under which conditions it can be justifiably deemed deeper than the incentive salience hypothesis of dopamine, which is arguably the most prominent contemporary alternative to the RPEH. Keywords: Dopamine; Reward-Prediction Error; Explanatory Depth; Incentive Salience; Reinforcement Learning 1. Introduction According to the reward-prediction error hypothesis of dopamine (RPEH), the phasic activity of dopaminergic neurons in specific regions in the midbrain signals a discrepancy between the predicted and currently experienced reward of a particular event. The RPEH is widely regarded as one of the largest successes of computational neuroscience. Terrence Sejnowski, a pioneer in computational neuroscience and prominent cognitive scientist, pointed at the RPEH, when, in 2012, he was invited by the online magazine Edge.org to answer the question “What is your favorite deep, elegant, or beautiful explanation?” Several researchers in cognitive and brain sciences would agree that this hypothesis “has become the standard model [for explaining dopaminergic activity and reward-based learning] within neuroscience” (Caplin & Dean, 2008, p. 663). Even among critics,

Transcript of Deep and Beautiful. The Reward Prediction Error · PDF file1 Deep and Beautiful. The Reward...

1

Deep and Beautiful.

The Reward Prediction Error Hypothesis of Dopamine

Abstract According to the reward-prediction error hypothesis (RPEH) of dopamine, the phasic

activity of dopaminergic neurons in the midbrain signals a discrepancy between the predicted and

currently experienced reward of a particular event. It can be claimed that this hypothesis is deep,

elegant and beautiful, representing one of the largest successes of computational neuroscience. This

paper examines this claim, making two contributions to existing literature. First, it draws a

comprehensive historical account of the main steps that led to the formulation and subsequent

success of the RPEH. Second, in light of this historical account, it explains in which sense the

RPEH is explanatory and under which conditions it can be justifiably deemed deeper than the

incentive salience hypothesis of dopamine, which is arguably the most prominent contemporary

alternative to the RPEH.

Keywords: Dopamine; Reward-Prediction Error; Explanatory Depth; Incentive Salience;

Reinforcement Learning

1. Introduction

According to the reward-prediction error hypothesis of dopamine (RPEH), the phasic activity of

dopaminergic neurons in specific regions in the midbrain signals a discrepancy between the

predicted and currently experienced reward of a particular event. The RPEH is widely regarded as

one of the largest successes of computational neuroscience. Terrence Sejnowski, a pioneer in

computational neuroscience and prominent cognitive scientist, pointed at the RPEH, when, in 2012,

he was invited by the online magazine Edge.org to answer the question “What is your favorite deep,

elegant, or beautiful explanation?” Several researchers in cognitive and brain sciences would agree

that this hypothesis “has become the standard model [for explaining dopaminergic activity and

reward-based learning] within neuroscience” (Caplin & Dean, 2008, p. 663). Even among critics,

2

the “stunning elegance” and the “beautiful rigor” of the RPEH are recognized (Berridge, 2007, p.

399 and p. 403).

However, the type of information coded by dopaminergic transmission—along with its

functional role in cognition and behaviour—is very likely to go beyond reward-prediction error.

The RPEH is not the only available hypothesis about what type of information is encoded by

dopaminergic activity in the midbrain (cf., Berridge, 2007; Friston, Shiner, FitzGerald, Galea,

Adams et al., 2012; Graybiel, 2008; Wise, 2004). Current evidence does not speak univocally in

favour of this hypothesis, and disagreement remains about to what extent the RPEH is supported by

available evidence (Dayan & Niv, 2008; O’Doherty, 2012; Redgrave & Gurney, 2006). On the one

hand, it has been claimed that “to date no alternative has mustered as convincing and

multidirectional experimental support as the prediction-error theory of dopamine” (Niv &

Montague, 2009, p. 342); on the other hand, the counter-claims have been put forward that the

RPEH is an “elegant illusion” and that “[s]o far, incentive salience predictions [that is, predictions

of an alternative hypothesis about dopamine] appear to best fit the data from situations that

explicitly pit the dopamine hypotheses against each other” (Berridge, 2007, p. 424).

How has the RPEH become so successful then? What does it explain exactly? And, granted

that it is at least intuitively uncontroversial that the RPEH is beautiful and elegant, in which sense

can it be justifiably deemed deeper than alternatives? The present paper addresses these questions

by firstly reconstructing the main historical events that led to the formulation and subsequent

success of the RPEH (Section 2).

With this historical account on the background, it is elucidated what and how the RPEH

explains, contrasting it to the incentive salience hypothesis—arguably its most prominent current

alternative. It is clarified that both hypotheses are concerned only with what type of information is

encoded by dopaminergic activity. Specifically, the RPEH has the dual role of accurately describing

the dynamic profile of phasic dopaminergic activity in the midbrain during reward-based learning

3

and decision-making, and of explaining this profile by citing the representational role of

dopaminergic phasic activity. If the RPEH is true, then a mechanism composed of midbrain

dopaminergic neurons and their phasic activity carries out the task of learning what to do in the face

of expected rewards, generating decisions accordingly (Section 3).

The paper finally explicates under which conditions some explanation of learning,

motivation or decision-making phenomena based on the RPEH can be justifiably deemed deeper

than some alternative explanation based on the incentive salience hypothesis. Two accounts of

explanatory depth are considered. According to one account, deeper explanatory generalizations

have wider scope (e.g., Hempel, 1959); according to the other, deeper explanatory generalizations

show more degrees of invariance (e.g., Woodward and Hitchcock, 2003). It is argued that, although

it is premature to maintain that explanations based on the RPEH are actually deeper—in either of

these two senses of explanatory depth—than alternative explanations based on the incentive

salience hypothesis, relevant available evidence indicates that they may well be (Section 4). The

contribution of the paper to existing literature is summarised in the conclusion.

2. Reward-Prediction Error Meets Dopamine

Dopamine is a neurotransmitter in the brain.1 It has significant effects on many aspects of cognition

and behaviour, including motor control, learning, attention, motivation, decision-making and mood

1 Neurotransmitters are chemicals that carry information from one neuron to another across

synapses. Synapses are structures connecting neurons that allow one nerve cell to pass an electric or

chemical signal to one or more cells. Synapses consist of a presynaptic nerve ending, which can

contain neurotransmitters, a postsynaptic nerve ending, which can contain receptor sites for

neurotransmitters, and the synaptic cleft, which is a physical gap between the presynaptic and the

postsynaptic ending. After neurotransmitters are released by a presynaptic ending, they diffuse

4

regulation. Dopamine is implicated in pathologies such as Parkinson’s disease, schizophrenia,

attention deficit hyperactivity disorder (ADHD) and addiction. These are some of the reasons for

why so much work has been directed at understanding the type of information carried by neurons

that utilize dopamine as a neurotransmitter as well as their functional roles in cognition and

behaviour.

Neurons that use dopamine as a neurotransmitter to communicate information are called

dopamine or dopaminergic neurons. Such neurons are phylogenetically old, and found in all

mammals, birds, reptiles and insects. Dopaminergic neurons are localized in several brain networks

in the diencephalon (a.k.a. interbrain), mesencephalon (a.k.a. midbrain) and olfactory bulb

(Björklund & Dunnett, 2007). Approximately 90% of the total number of dopaminergic neurons is

in the ventral part of the midbrain, which comprises different dopaminergic networks with separate

pathways. One of these pathways is the nigrostriatal pathway. It links the substantia nigra, a

structure in the midbrain, with the striatum, which is the largest nucleus of the basal ganglia in the

forebrain and has two components: the putamen and the caudate nucleus. Another pathway is the

mesolimbic, which links the ventral tegmental area in the midbrain to structures in the forebrain,

external to the basal ganglia, such as the amygdala and the medial prefrontal cortex.

Dopamine neurons show two main patterns of firing activity, which modulates the level of

extracellular dopamine: tonic and phasic activity (Grace, 1991). Tonic activity consists of regular

firing patterns of ~1-6 Hz that maintain a slowly-changing, extracellular, base-level of extracellular

dopamine in afferent brain structures. Phasic activity consists of a sudden change in the firing rate

of dopamine neurons, which can increase up to ~20 Hz, causing a transient increase of extracellular

dopamine concentrations.

across the synaptic cleft and then bind with receptors on the postsynaptic ending, which alters the

state of the postsynaptic neuron.

5

The discovery that neurons can communicate by releasing chemicals was due to the

German-born pharmacologist Otto Loewi—Nobel Prize winner in Physiology and Medicine along

with co-recipient Sir Henry Dale—in 1921 (cf., Loewi, 1936). The discovery of dopamine as a

neurotransmitter in the brain dates 1957, and was due to the Swedish pharmacologist Arvid

Carlsson—Nobel Prize in Physiology and Medicine in 2000 along with co-recipients Eric Kandel

and Paul Greengard (cf., Carlsson, 2003). Carlsson’s work in the 1950s and 1960s paved the way to

the findings that the basal ganglia contain the highest dopamine concentrations, that dopamine

depletion is likely to impair motor function and that patients with Parkinson’s disease have

markedly reduced concentrations of dopamine in the caudate and putamen (cf. Carlsson, 1959;

1966).

Since at least the 1950s, the search for the mechanisms of reward-based learning and

motivation has been taking place. James Olds and Peter Milner set out to investigate how electrical

stimulation of certain brain areas could reinforce behaviour. They implanted electrodes in different

areas of rats’ brains and allowed them to move about a Skinner box. Rats received stimulation

whenever they pressed a lever in the box. When this stimulation was targeted at the ventral

tegmental area and basal forebrain, the rats showed signs of positive reinforcement, as they would

repeatedly press the lever up to 2,000 times per hour. These results suggested to Olds and Milner

that they had “perhaps located a system within the brain whose peculiar function is to produce a

rewarding effect on behavior” (Olds & Milner, 1954, p. 426).

The notion of “reward” here is to be understood within Thorndike’s (1911) and Skinner’s

(1938) theories of learning. As Olds and Milner put it: “In its reinforcing capacity, a stimulus

increases, decreases, or leaves unchanged the frequency of preceding responses, and accordingly it

is called a reward, a punishment, or a neutral stimulus” (Olds & Milner, 1954, p. 419). So, some

brain stimulation or some environmental stimulus is “rewarding” if animals learn to perform actions

that are reliably followed by that stimulation or stimulus.

6

Later experiments confirmed that electrical self-stimulation of specific brain regions has the

same impact on motivation as other natural rewards, like food or water for hungry or thirsty animals

(Trowill, Panksepp, & Gandelman, 1969; Crow 1972). The idea that some neurotransmitter could

be a relevant causal component of some mechanism of reward-based learning and motivation was

substantiated by pharmacological studies (Stein, 1968; 1969). Based on subsequent

pharmacological (Fibiger, 1978) and anatomical research (Lindvall & Björklund, 1974), hypotheses

about the involvement of dopaminergic neurons in this mechanism began to be formulated. In Roy

Wise’s (1978) words: “[from the available evidence] it can be concluded that dopamine plays a

specialized role in reward processes… It does seem to be the case that a dopaminergic system forms

a critical link in the neural circuitry which confers rewarding qualities on intracranial stimulation…

and on intravenous stimulant injections” (Wise, 1978, pp. 237-238).

Wise (1982) put forward one of the first hypotheses about dopamine function in cognition

and behaviour that aimed to explain a set of relevant data from anatomy, pharmacology, brain self-

stimulation, pathology and lesion studies. It was called the anhedonia hypothesis, and was advanced

based on pharmacological evidence that moderate doses of neuroleptics drugs (i.e. dopamine

antagonists)2 can disrupt behavioural phenomena during reinforcement tasks without severely

compromising motor function (cf., Costall & Naylor, 1978). The anhedonia hypothesis was

proposed as an alternative to simple motor hypotheses that claimed that the dopamine system is a

mechanism of motor control and that dopaminergic impairment causes only motor deficits (see,

e.g., Koob, 1982).

The anhedonia hypothesis stated that “the normal functioning of some unidentified

dopaminergic substrate (it could be one or more of several dopaminergic projections in the brain)

2 Drugs that block the effects of dopamine by binding to and occluding the dopamine receptor are

called dopamine antagonists.

7

and its efferent connections are necessary for the motivational phenomena of reinforcement and

incentive motivation and for the subjective experience of pleasure that usually accompanies these

phenomena” (Wise, 1982, p. 53).3 This hypothesis is committed to the claims that some network of

dopaminergic neurons to be specified is a causally relevant component of the mechanism of

reinforcement, that some network of dopaminergic neurons is necessary to feel pleasure, and that

pleasure is a necessary correlate of reinforcement.

The explanatory link between dopamine and pleasure was superficial, however. Testing the

effects of selective lesion of the mesostriatal dopamine system on rats’ reactions to different tastes,

Berridge, Vienier, and Robinson (1989) observed that the predictions of both the anhedonia and the

motor hypothesis were not borne out. It was found that the subjective experience of pleasure is not a

necessary correlate of reinforcement and that dopaminergic neurons are not necessary for pleasure

(see Wise, 2004, for a later reassessment of the evidence).

On the basis of taste-reactivity data and of the psychopharmacological effects of drug

addiction, and drawing upon earlier theories of incentive motivation (e.g., Bindra, 1974; Toates,

1986), Kent Berridge and colleagues put forward the incentive salience hypothesis of dopamine

(ISH). According to this hypothesis, dopamine release by mesencephalic structures such as the

ventral tegmental area assigns “incentive value” to objects or behavioural acts. Incentive salience is

a motivational, “magnet-like” property, which makes external stimuli or internal representations

more salient, and more likely to be wanted, approached or consumed. Attribution of incentive

3 ‘Incentive motivation’ is synonymous with ‘secondary reinforcement’ (or ‘conditioned

reinforcement’), which refers to a stimulus or situation that has acquired its function as a reinforcer

after it has been associated with a primary reinforcer such as water or food. When a stimulus

acquires incentive properties, it acquires not only the ability to elicit and maintain instrumental

behaviour, but also to attract approach and to elicit consummatory behaviour (cf., Bindra, 1974).

8

salience to a stimulus that predicts some reward makes both the stimulus and the reward “wanted”

(Robinson & Berridge, 1993; Berridge & Robinson, 1998). Since the ISH has been claimed to be

the most foremost contemporary alternative to the RPEH (e.g., Berridge, 2007), the sections below

consider it more closely, and compare it along two dimensions of explanatory depth with the RPEH.

For now, let us move to the next steps on the road to the RPEH.

In the 1980s, research on the role of dopamine in motor function remained an active topic of

research (e.g., Beninger, 1983; Stricker & Zigmond, 1986; White, 1986; see Dunnett & Robbins,

1992, for a later review). This interest was justified by earlier findings that Parkinsonian patients

display a drastic reduction of dopaminergic neurons in the striatum (Ehringer & Hornykiewicz,

1960; Hornykiewicz, 1966), associated with symptoms like tremor, hypokinesia and rigidity.

Wolfram Schultz was among the neuroscientists working on the relationship between dopamine

depletion, motor function and Parkinson’s disease (Schultz, 1983). As a way to assess this

relationship, he used single-cell recordings of dopaminergic neurons in awake monkeys while they

were performing reaching movements for food reward in response to auditory or visual stimuli

(Schultz, Ruffieux, & Aebischer, 1983; Schultz, 1986). Phasic activity of midbrain dopamine

neurons was found to be associated with the presentation of the visual or auditory stimuli that

would be followed by the food reward. Some such neurons showed phasic changes in activity also

at the time the reward was obtained. Execution of reaching movements was less significantly

associated with dopaminergic activity, indicating that activity of midbrain dopaminergic neurons

does not encode specific movement parameters. Schultz and colleagues hypothesised that such

activity carried out some more general function having to do with a change in the level of

behavioural reactivity triggered by stimuli leading to a reward.

In the following ten years, Schultz and colleagues carried out similar single-cell recording

experiments from midbrain dopaminergic neurons in the ventral tegmental area and substantia nigra

9

of awake monkeys while they repeatedly performed an instrumental or Pavlovian conditioning task4

(Schultz & Romo, 1988; Romo & Schultz, 1990; Ljungberg, Apicella & Schultz, 1992; Schultz,

Apicella & Ljungberg, 1993; Schultz, Mirenowicz & Schultz, 1994). In a typical experiment a

thirsty monkey was seated before two levers. After a visual stimulus was displayed (e.g. a light

flashing), the monkey had to press the left but not the right lever in order to receive the juice

reward. An idiosyncratic pattern of dopaminergic activity was observed during this experiment.

During the early phase of learning—when the monkey was behaving erratically—dopamine

neurons displayed a phasic burst of activity only when the reward was obtained. After a number of

trials, as the monkey had learnt the correct stimulus-action-reward association, the response of the

neurons to the reward disappeared. Now, whenever the visual stimulus was displayed, the monkey

began to show anticipatory licking behaviour, and its dopaminergic neurons showed phasic bursts

of activity associated with the presentation of the visual stimulus. If an expected juice reward was

omitted, the neurons responded with a dip of activity, below basal firing rate, at the time at which

reward would have been delivered, which suggested that dopaminergic activity is sensitive to both

the occurrence and time of the reward.

The pattern of dopaminergic activity observed in these types of tasks was explained in terms

of generic “attentional and motivational processes underlying learning and cognitive behavior”

(Schultz et al., 1993, p. 900). Schultz and colleagues did not refer to previous research by Wise and

others about the involvement of dopamine in the mechanisms of reward, motivation and learning,

nor did they refer to the growing literature on reinforcement learning from psychology and artificial

4 In instrumental (or operant) conditioning, animals learn to respond to specific stimuli in such a

way as to obtain rewards and avoid punishments. In Pavlovian (or classical) conditioning, no

response is required to get rewards and avoid punishments, since rewards and punishment come

after specific stimuli independent of the animal’s behaviour.

10

intelligence. Thus, in the early 1990s, the questions what type of information dopaminergic activity

encodes and what its causal role is in the mechanism of reward-based learning and motivation were

outstanding.

Meanwhile, by the late 1980s, Reinforcement Learning (RL) had been established as one of

the most popular computational frameworks in machine learning and artificial intelligence. RL

offers a collection of algorithms to solve the problem of learning what to do in the face of rewards

and punishments received by taking different actions in an unfamiliar environment (Sutton & Barto,

1998). One widely used RL algorithm is the temporal difference (TD) learning algorithm, whose

development is most closely associated with Rich Sutton’s (1988). The development of the TD-

algorithm was influenced by earlier theories of animal learning in mathematical psychology,

especially by a seminal paper by Bush and Mosteller (1951), which formulated a formal account of

how rewards increment the probability of a given behavioural response during instrumental

conditioning tasks. Bush and Mosteller’s account was extended by Rescorla and Wagner (1972),

whose model set the basis for the TD-learning algorithm.

The Rescorla-Wagner model is a formal model of instrumental and Pavlovian conditioning

that describes the underlying changes in associative strength between a signal (e.g., a conditioned

stimulus) and a subsequent stimulus (e.g., an unconditioned stimulus). The basic insight is similar

to the one informing the Bush-Mosteller model: learning depends on error in prediction. As

Rescorla and Wagner put it: “Organisms only learn when events violate their expectations. Certain

expectations are built up about the events following a stimulus complex; expectations initiated by

the complex and its component stimuli are then only modified when consequent events disagree

with the composite expectation” (Rescorla & Wagner, 1972, p. 75). Accordingly, learning is driven

by prediction errors, and the basic unit of learning is the conditioning trial. Change in associative

strength between a conditioned stimulus and an unconditioned stimulus is a function of differences

between what was predicted (i.e. the animal’s expectation of the unconditioned stimulus, given all

11

the conditioned stimuli present on the trial) and what actually happened (i.e. the unconditioned

stimulus) in a conditioning trial.

TD-learning extends the Rescorla-Wagner model by taking account of the timing of

different stimuli within learning trials, which in fact influences how associative strength changes.

TD-learning is driven by the difference between temporally successive estimates (or predictions) of

a certain quantity—for example, the total amount of reward expected over the future (i.e. value). At

any given time step, the estimate of this quantity is updated to bring it closer to the estimate at the

next time step. The TD-learning algorithm makes predictions about what will happen. Then it

compares these predictions with what actually happens. If the prediction is wrong, then the

difference between what was predicted and what actually happened is used for learning. This core

of TD-learning is captured by two equations. The first is an update rule:

[1] V(S)new = V(S)old + η δ(toutcome),

where V(S) denotes the value of a chosen option S, η is a learning rate parameter, and δ(toutcome) is

the temporal-difference reward-prediction error computed at each of two consecutive time steps

(tstimulus and toutcome = tstimulus + 1). The second equation defines reward-prediction error at time t as:

[2] δ(t) = r(t) + V(t) - V(t - 1),

where V(t) is the predicted value of some option at time t, and r(t) is the reward outcome obtained at

time t. The reward-prediction error at toutcome is used to update V(S), that is the value of the chosen

option. The potential of TD-learning, and of RL more generally, to build neural networks models

and help interpret some results in brain science was clear from the 1980s. As Sutton and Barto

(1998, p. 22) recall, “some neuroscience models developed at that time are well interpreted in terms

of temporal-difference learning (Hawkins and Kandel, 1984; Byrne, Gingrich, and Baxter, 1990;

Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986),” however, the connection between dopamine

and TD-learning had still to be explicitly made.

12

In the early 1990s, Read Montague and Peter Dayan were working in Terry Sejnowski’s

Computational Neurobiology lab at the Salk Institute in San Diego. Dayan’s PhD at the University

of Edinburgh, in artificial intelligence and computer science, focused on RL, while some of

Montague’s work, both as a graduate student and postdoc in biophysics and neuroscience, focused

on models of self-organization and learning in neural networks. They both approached questions

about brains and neural circuits by asking what computational functions they carry out (cf., Dayan,

1994). One spring day in 1991 TD-learning and dopamine got connected (Montague, 2007, pp. 108-

109). Dayan came across one of Schultz and colleagues’ articles, which presented data from

recordings of dopamine neurons’ activity during an instrumental learning task. By examining the

plots in the article showing the firing patterns of the monkeys’ dopamine neurons, Dayan and

Montague recognized the signature of TD-learning. The similarity between the reward-prediction

error signal used in TD-learning algorithms and Schultz and colleagues’ recordings was striking.

Activity of dopamine neurons appeared to encode a reward-prediction error signal.

Montague, Dayan and Sejnowski began writing a paper that interpreted Schultz and

colleagues’ results within the computational framework of RL. Their project was to provide a

unifying TD-learning model that could explain the neurophysiological and behavioural regularities

observed in Schultz and colleagues’ experiments. In a short abstract, Quartz, Dayan, Montague and

Sejnowski (1992) laid out the insight that on-going activity in dopaminergic neurons in the

midbrain can encode comparisons of expected and actual reward outcomes, which drive learning

and decision-making. The insight was articulated by Montague, Dayan, Nowlan, Pouget, and

Sejnowski (1993), and presented at the Conference on Neural Information Processing Systems

(NIPS)—an annual event bringing together researchers interested in biological and artificial

learning systems. In that paper a statement of the connection between TD-learning and Schultz and

colleagues’ pattern of results was made: Some “diffuse modulatory systems… appear to deliver

reward and/or salience signals to the cortex and other structures to influence learning in the adult.

13

Recent data (Ljunberg et al. 1992) suggest that this latter influence is qualitatively similar to that

predicted by Sutton and Barto’s (1981, 1987) classical conditioning theory” (Montague et al., 1993,

p. 970). However, theoretical neuroscience—computational neuroscience, particularly—was in its

infancy at the time, and it had not been recognized as a field integral to neuroscience yet (cf.,

Abbott, 2008). The paper that Montague, Dayan and Sejnowski had originally set out to publish in

1991 was getting rejected by every major journal in neuroscience, partly because the field was

dominated by experimentalists (Montague, 2007, p. 285).

Dayan and Montague approached the issue from a different angle. The foraging behaviour of

honeybees was known to follow the pattern of TD-learning; and evidence from single-cell

recordings and from intracellular current injections indicated that bees use a neurotransmitter,

octopamine, to achieve TD-learning (Hammer, 1993). Motivated by these findings, Dayan and

Montague developed a model of the foraging behaviour of honeybees (Montague, Dayan, Person, &

Sejnowski, 1995). First, they identified a type of neural architecture, which might be common to

both vertebrates and invertebrates, and could implement TD-learning. Second, they argued for the

biological feasibility of this neurocomputational architecture, noting that bees’ diffuse

octopaminergic system is suited to carry out TD-learning. Finally, they showed that a version of the

TD-learning algorithm running on the neurocomputational architecture they specified could produce

some learning and decision-making phenomena displayed by bees during foraging. Montague and

colleagues highlighted that: “There is good evidence for similar predictive responses in primate

mesencephalic dopaminergic systems. Hence, dopamine delivery in primates may be used by target

neurons to guide action selection and learning, suggesting the conservation of an important

14

functional principle, albeit differing in its detailed implementation” (Montague et al., 1995, p.

728).5

By the mid-1990s, other research groups recognized that the activity of dopaminergic

neurons in tasks of the sort used by Schultz and colleagues can be accurately described as

implementing some reward-prediction error algorithm. Friston, Tononi, Reeke, Sporns, and

Edelman (1994) considered value-dependent plasticity in the brain in the context of evolution and

adaptive behaviour. They hypothesised that the ascending neuromodulatory systems, and—in light

of Ljunberg et al.’s (1992) findings—the dopaminergic system in particular, are core components of

some value-based mechanism, whose processes are selective for rewarding stimuli. Houk, Adams

and Barto (1995) put forward a hypothesis about the computational architecture of the basal

5 This conclusion might engender confusion, as it is not obvious how the suggestion that a

‘functional principle’ is ‘conserved’ across evolution should be understood. When two unrelated

organisms share some trait, this is often taken as evidence that the trait is homologous (i.e., derived

from a common ancestor). But this is never sufficient evidence of homology. A proper phylogenetic

reconstruction of the trait (involving more than two species at the very least) is necessary for

establishing homology. Given the available evidence, Montague and colleagues’ suggestion is

better understood in terms of analogy rather than homology. Two (probably more) species—

including honeybees, macaque monkeys and other primates—might have independently evolved

distinct diffuse neurotransmitter systems that have analogous (similar) functional properties. The

similarity is not due to a common ancestor. Rather, the similarity is due to convergent evolution:

both species faced similar environmental challenges and selective pressures, which implicates that

TD-learning is an adaptive strategy to solve a certain class of learning and decision-making

problems that recur across species. I am grateful to an anonymous referee for drawing my attention

to this point.

15

ganglia, where dopaminergic neurons would control learning and bias the selection of actions by

computing reward-prediction errors.

The neuroscience community started to pay closer attention to the relationship between TD-

learning and dopamine. After five years, Montague, Dayan and Sejnowski had their original paper

published in The Journal of Neuroscience (Montague, Dayan, & Sejnowski, 1996). In this paper,

after having noted with Wise’s (1982) that dopamine neurons are involved in a number of cognitive

and behavioural functions, they examined Schultz and colleagues’ results. These results indicated

that whatever is encoded in dopaminergic signals should be capable of explaining four sets of data.

“(1) The activities of these neurons do not code simply for the time and magnitude of reward

delivery. (2) Representations of both sensory stimuli (lights, tones) and rewarding stimuli (juice)

have access to driving the output of dopamine neurons. (3) The drive from both sensory and reward

representations to dopamine neurons is modifiable. (4) Some of these neurons have access to a

representation of the expected time of reward delivery” (Montague et al., 1996, p. 1938).

Montague, Dayan and Sejnowski (1996) emphasised an underappreciated aspect of Schultz

and colleagues’ results: dopaminergic neurons are sensitive not only to the expected and actual

experienced magnitude of reward, but also to the precise temporal relationships between the

occurrence of a reward-predictor and the occurrence of the actual reward. This aspect was crucial to

draw the connection between TD-computation and dopaminergic activity. For it suggested that

dopamine neurons should be able to represent relationships between reward-predictors, predictions

of both the likely time and magnitude of a future reward, and the actual experienced time and

magnitude of the reward. The core of Montague and colleagues’ (1996) consists in laying out the

computational framework of reinforcement learning and bringing it to bear on neurophysiological

and behavioural evidence related to dopamine so as to connect neural function to cognitive

function. By means of modelling and computer simulations, they showed that the type of algorithm

that can solve learning tasks of a certain kind could accurately and compactly describe the

16

behaviour of many dopaminergic neurons in the midbrain: “the fluctuating delivery of dopamine

from the VTA [i.e., ventral tegmental area] to cortical and subcortical target structures in part

delivers information about prediction errors between the expected amount of reward and the actual

reward” (Ibid., p. 1944, emphasis in original). One year later, in 1997, Montague and Dayan

published another similar paper co-authored with Schultz in Science, which has remained the

reference about the RPEH.

3. Reward-Prediction Error and Incentive Salience: What Do They Explain?

In light of Montague et al. (1996) and Schultz et al. (1997), the RPEH can now be more precisely

characterised. The hypothesis states that the phasic firing of dopaminergic neurons in the ventral

tegmental area and substantia nigra “in part” encodes reward-prediction errors. Montague and

colleagues did not claim that all type of activity in all dopaminergic neurons encode only (or in all

circumstances) reward-prediction errors. Their hypothesis is about “a particular relationship

between the causes and effects of mesencephalic dopaminergic output on learning and behavioural

control” (Montague, Dayan, Sejnowski, 1996, p. 1944). This relationship may hold for a certain

type of activity of some dopaminergic neurons during certain kinds of learning and decision-making

tasks. The claim is not that dopaminergic neurons encode only reward-prediction errors. The claim

is neither that prediction errors can only be computed by dopaminergic activity, nor that all learning

and action selection is carried out through reward-prediction errors or dependant on dopaminergic

activity.

The RPEH relates dynamic patterns of activity in specific structures of the brain to a precise

computational function. As reward-prediction errors are differences between experienced and

expected rewards, whether or not dopamine neurons respond to a particular reward depends on

whether or not this reward was expected at all, on its expected magnitude, and on the expected time

of its delivery. Thus, the hypothesis relates dopaminergic responses to two types of variables:

17

reward and belief (or expectation) about the magnitude and time of delivery of a reward that may be

obtained in a situation. Accordingly, the RPEH can be understood as relating dopaminergic

responses to probability distributions over prizes (or lotteries), from which a prize with a certain

magnitude is obtained at a given time (Caplin & Dean, 2008; Caplin, Dean, Glimcher, & Rutledge,

2010).

The hypothesis has the dual role of accurately describing the dynamic profile of phasic

dopaminergic activity in the midbrain during reward-based learning and decision-making, and of

explaining this profile by citing the representational role of dopaminergic phasic activity. Thus, the

RPEH addresses two distinct questions. First, how are some of the regularities in dopaminergic

neurons’ firing patterns accurately and compactly described? Second, what is the computational

function carried out by those firing patterns? By answering the second question, the RPEH

furnishes the basis for a neurocomputational explanation of reward-based learning and decision-

making.

Neurocomputational explanations explain cognitive phenomena and behaviour (e.g.,

blocking and second-order conditioning6) by identifying and describing relevant mechanistic

components (e.g., dopaminergic neurons), their organized activities (e.g., dopaminergic neurons’

phasic firings), the computational routines they perform (e.g., computations of reward-prediction

errors) and the informational architecture of the system that carries out those computations (e.g., the

actor-critic architecture, which implements TD-learning and maps onto separable neural

components, see, e.g., Joel, Niv, & Ruppin, 2002; Balleine, Daw, & O’Doherty, 2009). Neural

6 In classical conditioning, blocking is the phenomenon that little or no conditioning occurs to a new

stimulus if it is combined with a previously conditioned stimulus during the conditioning process.

Second-order conditioning is a phenomenon where a conditional response is acquired by a neutral

stimulus, when the latter is paired with a stimulus that has previously been conditioned.

18

computation can be understood as the transformation—via sensory input and patterns of activity of

other neural populations—of neural firings according to algorithmic rules that are sensitive only to

certain properties of neural firings (Churchland & Sejnowski, 1992; Colombo, 2013; Piccinini &

Bahar, 2012).

If RPEH is true, then a neurocomputational mechanism composed of midbrain dopaminergic

neurons and their phasic activity carries out the task of learning what to do in the face of expected

rewards and punishments, generating decisions accordingly. Currently, several features of this

mechanism remain to be identified. So, RPEH-based explanations of reward-learning and decision-

making are currently gappy (Dayan & Niv, 2008; O’Doherty, 2012). Nonetheless, several cognitive

and brain scientists would agree that some RPEH-based neurocomputational mechanism to be

specified can adequately explain many learning and decision-making phenomena in a manner not

only more beautiful but also deeper than available alternatives.

The current most prominent available alternative to the RPEH is probably the incentive

salience hypothesis (ISH). This hypothesis states that firing of dopaminergic neurons in a larger

mesocorticolimbic system mediates only incentive salience attribution. In Berridge’s words:

“Dopamine mediates only a ‘wanting’ component, by mediating the dynamic attribution of

incentive salience to reward-related stimuli, causing them and their associated reward to become

motivationally ‘wanted’” (Berridge, 2007, p. 408).

The ISH relates dopaminergic activations to a psychological construct: incentive salience

(a.k.a. “wanting”). Thereby, it answers the question what the causal role of dopamine is in reward-

related behaviour. By answering this question, the ISH furnishes the basis for a neuropsychological

explanation of reward-based motivation and decision-making. The ISH is committed to the claim

that dopaminergic firing codes incentive salience that bestows stimuli or internal representations

with the properties of being appetitive and attention-grabbing. Incentive salience attributions need

not be conscious and need not involve feelings of pleasure (a.k.a. “liking”). Dopaminergic activity

19

is necessary to motivate actions aimed at some goal, as it would be a core component of a

mechanism of motivation (or mechanism of “wanting”). Dopaminergic neurons are not relevant

components of the mechanism of reward-based learning: “to say dopamine acts as a prediction error

to cause new learning may be to make a causal mistake about dopamine’s role in learning: it

might… be called a ‘dopamine prediction error’” (Berridge, 2007, p. 399). So, the ISH can be

considered an alternative to the RPEH because it denies two central claims of the RPEH: first, it

denies that dopamine encodes reward-prediction errors; second, it denies that dopamine is a core

component of a mechanism of reward-based learning.

Similarly to the RPEH, the explanation grounded in the ISH is gappy: it includes dopamine

as a core component of the mechanism of incentive salience attribution and motivation, but it leaves

several explanatory gaps (see, e.g., Berridge, 2007, note 8). Particularly, the hypothesis is under-

constrained in at least three ways that make it less precise than the RPEH. First, it does not precisely

identify the relevant anatomical location of the dopaminergic components; second, it is

uncommitted as to possible different roles of phasic and tonic dopaminergic signals; finally, it is not

formalised by a single computational model that could yield quantitative predictions.

As they are formulated, both the RPEH and ISH are only concerned with what type of

information is encoded by dopaminergic activity. Nonetheless, putting forward different claims

about what dopaminergic neurons do, these hypotheses motivate different dopamine-centred

explanations of phenomena related to learning, motivation and decision-making. Although these

dopamine-centred explanations are currently tentative and incomplete, and so it might be premature

to argue that one explanation is actually deeper than the other, it is worthwhile explicating under

which conditions a RPEH-based explanation can be justifiably deemed deeper than an alternative

ISH-based explanation, while pointing at relevant available evidence.

4. Explanatory Depth, Reward-Prediction Error and Incentive Salience

20

A number of accounts of explanatory depth have recently been proposed in philosophy of science

(e.g., Woodward & Hitchcock, 2003; Strevens, 2009; Weslake, 2010). While significantly different,

these accounts agree that explanatory depth is a feature of generalizations that express the

relationship between an explanans and an explanandum.

According to Woodward and Hitchcock (2003), in order to be genuinely explanatory, a

generalization should exhibit patterns of counterfactual dependence relating the explanans to the

explanandum. Explanatory generalizations need not be laws or exceptionless regularities. They

should enable us to answer what-if-things-had-been-different questions that show what the

explanandum phenomenon depends upon. These questions concern the ways in which the

explanandum would change under changes or interventions on its explanans, where the key feature

of such interventions is that they do not causally affect the explanandum except through their effect

on the explanans (Woodward, 2003). The degree of depth of an explanatory generalization is a

function of the range of the counterfactual questions concerning possible changes in the target

system that it can answer. Given two competitive explanatory generalizations G1 and G2, if G1 is

invariant (or continue to hold) under a wider range of possible interventions or changes than G2,

then G1 is deeper than G2.7 Call this the “invariance account of explanatory depth.”

According to a different view (cf., Hempel, 1959), explanatory generalizations should

enable us to track pervasive uniformities in nature by being employable to account for a wide range

of phenomena displayed by several types of possible systems. The depth of an explanatory

7 Woodward and Hitchcock (2003, sec. 3) distinguish a number of ways in which a generalization

may be more invariant than another. For the purposes of this paper suffices it to point out that what

they share is that they spell out different ways in which an explanatory generalization enables us to

answer what-if-things-had-been-different-questions.

21

generalization is a function of the range of possible systems to which it can apply.8 For a hypothesis

to apply to a target system is for the hypothesis to accurately describe the relevant structures and

dynamics of the system—where what is relevant and what is not is jointly determined by the causal

structure of the real-world system under investigation, the scientists’ varying epistemic interests and

purposes in relation to that system, and the scientists’ audience. Given two competitive explanatory

generalizations G1 and G2, if G1 can be applied to a wider range of possible systems or phenomena

than G2, then G1 is deeper than G2. So, deeper explanatory generalizations have wider scope. Call

this: the “scope account of explanatory depth.”9

4.1. Depth as scope, reward-prediction error and incentive salience

If some RPEH-based explanatory generalization can be applied to a wider range of possible

phenomena or systems than some alternative ISH-based explanatory generalization, then the RPEH-

based generalization is deeper according to the scope account of explanatory depth. What is

available evidence relevant to assess this claim?

ISH-based explanations have been most directly applied to rats’ behaviour and to the

phenomenon of addiction in rodents and humans. In the late 1980s early 1990s, incentive salience

was offered to explain the differential effects on “liking” (i.e. the experience of pleasure) and

“wanting” (i.e. incentive salience) of pharmacological manipulations of dopamine in rats during

taste-reactivity tasks (Berridge et al., 1989). Since then incentive salience has been used to explain

8 Kitcher (1989) put forward a similar idea. His view, however, is that depth is a function of the

range of actual situations to which it can apply. For a discussion of some of the problems raised by

this view, see Woodward and Hitchcock (2003, sec. 4).

9 It bears mention that the verdict is still out on how these two views on depth relate to one another

(see, e.g., Strevens, 2004 for a discussion of this issue).

22

results from electrophysiological and pharmacological experiments that manipulated dopaminergic

activity in mesocorticolimbic areas of rats performing Pavlovian or instrumental conditioning tasks

(cf. Berridge & Robinson, 1998; Peciña, Cagniard, Berridge, Aldridge, & Zhuang, 2003; Tindell,

Berridge, Zhang, Peciña, & Aldridge, 2005; Wyvell, & Berridge, 2000).

Most ISH-explanations applied to humans concern a relatively small set of phenomena

observed in addiction and Parkinson’s disease (Robinson & Berridge, 2008; O’Sullivan et al.,

2011). From the viewpoint of incentive salience, addiction to certain substances or behaviours is

caused by over-attribution of incentive salience. Compulsive behaviour would depend on an

excessive attribution of incentive salience to drug-rewards and their cues, due to hypersensitivity or

“sensitization” (i.e. an increase in a drug effect caused by repeated drug administration) in

mesocortical dopaminergic projections. Sensitized dopaminergic systems would then cause

pathological incentive motivation for drugs or other stimuli.

It may appear that a RPEH-based explanation has obviously wider scope than an ISH-based

explanation. For TD-learning has been applied to many biological and artificial systems (see e.g.

Sutton & Barto, 1998, ch. 11). TD-learning seems to be widespread in nature. For instance, recall

that while Montague et al. (1995) argued that release of octopamine by a specific neuron in the

honeybee brain may signal a reward-prediction error, they also suggested that the same type of

“functional principle” guiding learning and action selection may well be conserved across species.

However, if honeybees, primates and other species share an analogous TD-learning

mechanism, or if many artificial systems implement TD-learning, this is not evidence for the wider

explanatory scope of a RPEH-based explanation. Rather, it is evidence for the wider explanatory

scope of RL, and particularly of TD-learning. The RPEH and the ISH are about dopamine. So,

relevant evidence for wider scope should involve dopaminergic neurons and their activity.

RPEH-based explanations of learning and decision-making apply at least to rats, monkeys,

and humans. The RPEH was formulated by comparing monkey electrophysiological data during

23

instrumental and Pavlovian conditioning tasks to the dynamics of a TD reward-prediction error

signal (Montague et al., 1996; Schultz et al., 1997). Since then, single-cell experiments with

monkeys have strengthen the case for a quantitatively accurate correspondence between phasic

dopaminergic firings in the midbrain and TD reward-prediction errors (Bayer & Glimcher, 2005;

Bayer, Lau & Glimcher, 2007). Recordings from the ventral tegmental area of rats that performed a

dynamic odour-discrimination task indicate that the RPEH generalizes to that species as well

(Roesch, Calu, & Schoenbaum, 2007). Finally, a growing number of studies using functional

magnetic imaging (fMRI) in humans engaged in decision-making and learning tasks have shown

that activity in dopaminergic target areas such as the striatum and the orbitofrontal cortex correlates

with reward-prediction errors of TD-models (Berns, McClure, Pagnoni, & Montague, 2001;

Knutson, Adams, Fong, & Hommer, 2001; McClure, Berns, & Montague, 2003a; O’Doherty,

Dayan, Friston, Critchley, & Dolan, 2003). These findings are in fact coherent with the RPEH,

since fMRI measurements seem to reflect the incoming information that an area is processing, and

striatal and cortical areas such as the orbitofrontal cortex are primary recipients of dopaminergic

input from the ventral tegmental area (cf., McClure & D’Ardenne, 2009; Niv & Schoenbaum,

2008).10

Some RPEH-based explanation is employable to account for many phenomena related to

learning and decision-making. Among the cognitive phenomena and behaviour for which some

RPEH-based explanation has been put forward are: habitual vs. goal-directed behaviour (Daw, Niv,

10

It is currently not possible to use fMRI data to assess the dopaminergic status of the MRI signal.

Reliable information about the precise source of the fMRI signal is difficult to gather for the cortex,

let alone the basal ganglia, particularly because neuromodulators can be vasoactive themselves (see

Kishida et al., 2011, and Zaghloul et al., 2009 on methodologies for the measurement of sub-second

dopamine release in humans).

24

& Dayan, 2005; Tricomi, Balleine, & O’Doherty, 2009), working memory (O’Reilly & Frank,

2006), performance monitoring (Holroyd & Coles, 2002), pathological gambling (Ross, 2010), and

a variety of psychiatric conditions including depression (e.g., Huys, Vogelstein, & Dayan, 2008; for

a review of computational psychiatry see Montague, Dolan, Friston, & Dayan, 2012).

Most relevant here, a RPEH-based explanation of incentive salience itself has also been

proposed, which indicates that different hypotheses about dopamine may well cohere to much

greater extent than might be supposed once they are properly formalized (McClure, Daw, &

Montague, 2003b). According to this proposal, incentive salience corresponds to expected future

reward, and dopamine—as suggested by the RPEH—serves the dual role of learning to predict

future reward and of biasing action selection towards stimuli predictive of reward. McClure and

colleagues demonstrated that some of the phenomena explained by the ISH such as the dissociation

between wanted and liked objects directly follow from the role in biasing action selection that

dopamine possesses according to the RPEH. Dopamine release would assign incentive salience to

stimuli or actions by increasing the likelihood of choosing some action that leads to reward. So,

dopamine receptor antagonism would reduce the probability of selecting any action, because

estimated values for each available option would also decrease.

If this proposal correctly captures the concept of incentive salience—which has been

debated (Zhang, Berridge, Tindell, Smith, & Aldridge, 2009)—then there would be telling reason to

believe that some RPEH-based explanation has indeed wider scope than some ISH-based

explanation. We would be in the position to make a direct comparison between them, and the ISH-

based explanation would be entailed by a more general RPEH-based explanation. So, for any

possible target system to which the RPEH-based explanation applies, there will be an ISH-based

explanation that applies to the same system, but not vice versa.

4.2. Depth as invariance, reward-prediction error and incentive salience

25

If some RPEH-based generalization is invariant (or continue to hold) under a wider range of

possible interventions or changes than an alternative ISH-based generalization, then the RPEH-

based generalization is deeper according to the invariance account of explanatory depth. These

interventions—recall—should not causally affect the explanandum phenomena except through their

effect on the dopamine-centred mechanism whose behaviour is described by the explanatory

generalization.

In order to assess the relative degree of depth of alternative RPEH-based and ISH-based

explanations, relevant interventions should be on particular mechanisms found in a particular

biological lineage, such as some dopamine-centred mechanism found in primates. Interventions on

merely analogous mechanisms found across biological lineages will not provide evidence relevant

for depth-as-invariance.

It should also be noted that the degree of precision of the RPEH is higher than that of the

ISH. Unlike the ISH, the RPEH makes claims specific to dopaminergic phasic activity, and to

dopaminergic neurons in the ventral tegmental area and in the substantia nigra. It may be thought

that this means that the range of interventions on dopaminergic activity relevant to assess depth is

narrower for a RPEH-based explanation than for an ISH-based explanation: While for the ISH-

based explanation relevant interventions may be on both phasic and tonic activity, or on

dopaminergic neurons in mesocorticolimbic areas besides the ventral tegmental area and the

substantia nigra, for the RPEH-based explanation relevant interventions will not concern tonic

activity or dopaminergic neurons in any mesocorticolimbic circuit. This thought is mistaken,

however. For the ISH remains silent about the specific range of interventions relevant to assess the

invariance of an ISH-based explanation. So, the range of interventions relevant to assess the depth

of an ISH-based explanation does not strictly contain the range of interventions relevant to assess

the depth of a RPEH-based explanation.

26

Finally, unlike the RPEH, the ISH lacks an uncontroversial formalization that could be

implemented in the design of an experiment and yield precise, quantitative predictions. So, a

RPEH-based explanation may be deeper than an alternative ISH-based explanation, even if they are

both invariant under interventions on e.g. phasic dopaminergic activity in the ventral tegmental

area. For the RPEH-based explanation will yield more accurate answers about how the

explanandum phenomenon will change.

One set of available relevant evidence concerns dopamine manipulations by drugs. If some

dopamine-centred explanatory generalization, based on either the RPEH or ISH, correctly gives

information about how some target behaviour would change, had dopaminergic signalling been

enhanced or reduced, then the generalization will show some degree of depth-as-invariance. Before

illustrating this idea with two examples, some caveats are in order. One caveat is that

neuromodulators like dopamine have multiple, complex, and poorly understood effects on target

neural circuits and on cognition at different spatial and temporal scales (Dayan, 2012). Knowledge

is lacking about the precise effects on neurophysiology and behaviour of drugs that intervene on

dopaminergic signals. Moreover—as mentioned above—RPEH-based and ISH-based explanations

are currently tentative and gappy, partly precisely because the effects of interventions on

dopaminergic systems are currently not well-understood. Being tentative and gappy, it remains

controversial to what extent such explanations would be always indeed fundamentally different (cf.

e.g. McClure et al. 2003b; Niv & Montague, 2009, pp. 341-342).

To probe the explanatory link between dopaminergic signalling and reward-based learning

and decision-making, Pessiglione, Seymour, Flandin, Dolan, and Frith (2006) used an instrumental

learning task that involved monetary gains and losses, in combination with a pharmacological

manipulation of dopaminergic signalling, as well as computational and functional imaging

techniques, in healthy humans. Either haloperidol (an antagonist of dopamine receptors) or L-

DOPA (a metabolic precursor of dopamine) was administered to different groups of participants.

27

The effects of these manipulations were examined on both brain activity and choice behaviour. It

was found that L-DOPA enhanced activity in the striatum—a main target of dopaminergic

signalling—while haloperidol diminished it, which suggested that the magnitude of reward-

prediction error signals targeting the striatum was enhanced (or diminished) by treatment with L-

DOPA (or haloperidol). Choice behaviour was found to be systematically modulated by these

manipulations. L-DOPA improved learning performance towards monetary gains, while haloperidol

decreased it; that is: participants treated with L-DOPA were more likely than participants treated

with haloperidol to choose stimuli associated with greater reward. Computational modelling results

demonstrated that differences in reward-prediction error magnitude were sufficient for a TD-

learning model to predict the effects of the manipulations on choice behaviour.

Some of the caveats spelled out above apply to Pessiglione et al.’s study. Pessiglione and

colleagues acknowledged that it was not obvious how the drugs they administered affected different

aspects of dopaminergic signalling with respect to e.g. tonic versus phasic firing, or distinct

dopamine receptors. Notably, they did not consider whether their interventions might have affected

learning behaviour (i.e. one of their target explananda) through effects also on motivation or

attention. Nonetheless, evidence of the sort they provided is relevant to assess the relative depth of

RPEH-based explanatory generalizations. For it can demonstrate that reward-based learning and

decision-making are often modulated by reward-prediction errors encoded by dopaminergic

activity. And it may indicate that the relation between RPEs and dopaminergic activity on the one

hand, and choice behaviour during some reinforcement learning tasks, on the other, shows some

degree of invariance.

There are fewer human studies to probe the link between dopaminergic manipulation with

drugs, and incentive salience attribution and motivation. One such study concerned the mechanism

of sexual motivation. Oei, Rombouts, Soeter, van Gerven, & Both (2012) investigated how

dopamine modulates activation in the ventral striatum, which together with its dopaminergic

28

pathways is suggested to be a component of a larger mesocorticolimbic mechanism of incentive

salience, during subconscious processing of sexual stimuli.

Incentive salience is thought to be an essential property of sexual stimuli that would

motivate behavioural approach tendencies, capture attention, and elicit urges to pursue sex. An ISH-

based explanation of sexual motivation would claim that sexual desire is produced by processes in a

large mesocorticolimbic network driven by the release of dopamine into striatal targets.

Dopaminergic activations would bestow incentive salience to sexual cues and sexual unconditioned

stimuli, making these stimuli “wanted” and attention grabbing.

Oei and colleagues used fMRI combined with dopaminergic manipulations through

administration of L-DOPA and haloperidol in healthy participants to probe the amplification (or

depression) of the incentive salience of unconsciously perceived sexual cues. It was found that L-

DOPA significantly enhanced activation in the ventral striatum and dorsal anterior cingulate—a

brain region involved in cognitive control, action selection, emotional processing and motor

control—when sexual stimuli were subliminally presented, in contrast to emotionally neutral and

emotionally negative stimuli. Haloperidol, instead, decreased activation in those areas when the

sexual stimuli were presented. It was concluded that the processing of sexual incentive stimuli is

sensitive to pharmacological manipulations of dopamine levels in the midbrain.

These findings provide some evidence relevant to assess the degree of depth of an ISH-

based explanation of sexual motivation because they would indicate that such explanation might be

invariant over changes of the magnitude of dopaminergic signals in the mesocorticolimbic network,

which would enable sexual motivation, as well as over changes in conscious perception of the

sexual incentive stimuli. However—as Oei and colleagues acknowledged—these results did not

speak on whether dopamine-dependent regulation of incentive salience attribution is related to

increases (or decreases) in sexual desire or behavioural approach tendencies. Neither did they

discriminate whether or not the dopaminergic changes they observed could be predicted by the

29

reward-prediction signals in a TD-learning algorithm. Hence, studies such as this one leave open the

possibilities that the dopaminergic intervention did not in fact affect attention or motivation to

pursue sexual behaviour (i.e. target explananda phenomena of an ISH-based explanation of sexual

motivation), and that the intervention could have affected sexual motivation through the effects of

reward-prediction error neural computations underlying reinforcement learning.

5. Conclusion

This paper has made two types of contributions to existing literature, which should be of interest to

both historians and philosophers of cognitive science. First, the paper has provided a comprehensive

historical overview of the main steps that have led to the formulation of the RPEH. Second, in light

of this historical overview, it has made explicit what precisely the RPEH and the ISH explain, and

under which circumstances neurocomputational explanations of learning and decision-making

phenomena based on the RPEH can be justifiably deemed deeper than explanations based on the

ISH.

From the historical overview, it emerges that the formulation and subsequent success of the

RPEH depend, at least partly, on its capacity of combining several threads of research across

psychology, neuroscience and machine learning. By bringing the computational framework of RL

to bear on neurophysiological and behavioural sets of data that have been gathered about

dopaminergic neurons since 1960s, the RPEH connects dopamine’s neural function to cognitive

function in a quantitatively precise and compact fashion.

It should now be clear that the RPEH, as well as the ISH, which is arguably its current main

alternative, are hypotheses about the type of information encoded by dopaminergic activity. As

such, they do not explain by themselves why or how people and other animals display certain types

of phenomena related to learning, decision-making or motivation. Nonetheless, putting forward

30

different claims about what dopaminergic neurons encode, these hypotheses furnish the basis for

distinct dopamine-centred explanations of those phenomena.

The paper has examined some such explanations, contrasting them along two dimensions of

explanatory depth. It has not been established that RPEH-based explanations are actually deeper—

in either of the two senses of explanatory depth considered—than alternative explanations based on

the ISH. For the dopamine-centred explanations that the two hypotheses motivate are currently

tentative and incomplete. Nonetheless, from the relevant available evidence discussed in the paper,

there are grounds to tentatively believe that currently, for at least some phenomenon related to

learning, decision-making or motivation, some RPEH-based explanation has wider scope or has

more degrees of invariance than some ISH-based alternative explanation.

Acknowledgements I am sincerely grateful to Aistis Stankevicius, Charles Rathkopf, Peter Dayan,

and especially to Gregory Radick, editor of this journal, and to two anonymous referees, for their

encouragement, constructive criticisms and helpful suggestions. The work on this project was

supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the priority program “New

Frameworks of Rationality” ([SPP 1516]). The usual disclaimers about any remaining error or

mistake in the paper apply.

References

Abbott, L.F. (2008). Theoretical neuroscience rising. Neuron, 60, 489-495.

Balleine, B.W., Daw, N.D., & O’Doherty J.P. (2009). Multiple forms of value learning and the function of

dopamine. In Neuroeconomics: Decision Making and the Brain, ed. Paul W. Glimcher, Colin F.

Camerer, Ernst Fehr, and Russell A. Poldrack, 367-388, New York: Academic Press.

31

Bayer, H.M., & Glimcher, P.W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction

error signal. Neuron, 47(1), 129–141.

Bayer, H. M., Lau, B., & Glimcher, P. W. (2007). Statistics of midbrain dopamine neuron spike trains in the

awake primate. Journal of Neurophysiology, 98(3), 1428–1439.

Berns, G. S., McClure, S. M., Pagnoni, G., & Montague, P. R. (2001). Predictability modulates human brain

response to reward. Journal of Neuroscience, 21(8), 2793–2798.

Berridge, K. C. (2007) The debate over dopamine’s role in reward: the case for incentive salience.

Psychopharmacology (Berl), 191, 391-431.

Berridge, K.C., and Robinson, T.E., (1998). What is the role of dopamine in reward: hedonic impact, reward

learning, or incentive salience? Brain Research Reviews, 28, 309-369.

Berridge, K.C., Venier, I.L., & Robinson, T.E. (1989). Taste Reactivity Analysis of 6-OHDA aphagia

without impairment of taste reactivity: Implications for theories of dopamine function. Behavioral

Neuroscience, 103, 36-45.

Bindra, D.A. (1974). A motivational view of learning, performance, and behavior modification.

Psychological Review 81:199-213.

Björklund, A., & Dunnett, S.B. (2007). Dopamine neuron systems in the brain: an update

Trends in Neurosciences, 30(5):194-202.

Bush, R.R., & Mosteller, F. (1951). A mathematical model for simple learning. Psychological Review, 58,

313–323.

Caplin, A., Dean, M., Glimcher, P.W., & Rutledge, R.B. (2010). Measuring beliefs and rewards: A

neuroeconomic approach. Quarterly Journal of Economics, 125(3), 923-960.

Caplin, A. & Dean, M. (2008). Dopamine, Reward Prediction Error, and Economics. Quarterly Journal of

Economics, 123: 2, 663-701.

Carlsson A. (2003) A half-century of neurotransmitter research: impact on neurology and psychiatry, in

Nobel Lectures. Physiology or Medicine, 1996–2000 (Jörnvall H ed) pp 308-309, World Scientific

Publishing Co., Singapore.

Carlsson, A. (1966). Morphologic and dynamic aspects of dopamine in the central nervous system. In: Costa

E., Côté L.J., Yahr M.D., editors. Biochemistry and pharmacology of the basal ganglia. Hewlett, NY:

Raven Press, pp. 107–113.

Carlsson A (1959). The occurrence, distribution, and physiological role of catecholamines in the nervous

system. Pharmacological Reviews, 11, 490-493.

Churchland, P.S. & Sejnowski, T.J. (1992). The Computational Brain. Cambridge, MA, MIT Press.

32

Colombo, M. (2013). Constitutive relevance and the personal/subpersonal distinction. Philosophical

Psychology, 26, 547-570.

Costall, B. & Naylor, R. J. (1979). Behavioural aspects of dopamine agonists and antagonists. In A. S. Horn,

J. Korf and B.H.C. Westerink (Eds.), The Neurobiology of Dopamine, Academic Press, London, pp.

555-576.

Crow, T.J. (1972). A map of the rat mesencephalon for electrical selfstimulation. Brain Research, 36, 265-

273.

Dayan, P. & Niv, Y. (2008). Reinforcement learning: The good, the bad and the ugly. Current Opinion in

Neurobiology, 18, 185-196.

Dayan P. (1994) Computational modelling. Current Opinion in Neurobiology 4(2):212-217.

Dayan, P. (2012). Twenty-five lessons from computational neuromodulation. Neuron,76, 240-256.

Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal

systems for behavioral control. Nature Neuroscience 8:1704–1711

Dunnett, S.B., & Robbins, T.W. (1992). The functional role of mesotelencephalic dopamine systems.

Biological reviews of the Cambridge Philosophical Society, 67(4), 491-518.

Ehringep, H. and Hornykiewicz (1960) Verteilung von Noradrenalin und Dopamin (3Hydroxytyramin) im

Gehirn des Menschen und ihr Verhalten bci Erkrankungen des extrapyramidalen Systems. Klinisch

Wochenschrift, 38, 1236-1239.

Fibiger, H.C. (1978). Drugs and reinforcement mechanisms: A critical review of the catecholamine theory.

Annual Review of Pharmacology & Toxicology, 18, 37-56.

Friston K.J., Shiner T., FitzGerald T., Galea J.M., Adams R., Brown H., Dolan R.J., Moran R., Stephan K.E.,

& Bestmann, S. (2012). Dopamine, affordance and active inference. PLoS Computational Biology

8:e1002327. doi: 10.1371/journal.pcbi.1002327.

Friston, K.J., Tononi, G., Reeke, G. N., Sporns, O. & Edelman, G. M. (1994). Value-dependent selection in

the brain: simulation in a synthetic neural model. Neuroscience, 59, 229-243.

Glimcher, P.W. (2011). Understanding dopamine and reinforcement learning: The dopamine reward

prediction error hypothesis. Proceeding of the National Academy of Science USA, 108 (Suppl. 3),

15647–15654.

Grace, A.A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system

responsivity: a hypothesis for the etiology of schizophrenia. Neuroscience, 41(1), 1-24.

Graybiel, A.M. (2008). Habits, rituals and the evaluative brain. Annual Review of Neuroscience, 31, 359-387.

Hammer, M. (1993). An identified neuron mediates the unconditioned stimulus in associative olfactory

learning in honeybees. Nature, 366, 59-63.

33

Hempel, C.G. (1959). The Logic of Functional Analysis. In Symposium on Sociological Theory, ed. L.

Gross, 271–87. New York: Harper & Row. Repr. with revisions. 1965. In Aspects of Scientific

Explanation and Other Essays in the Philosophy of Science, 297–330. New York: Free Press.

Holroyd, C.B., & Coles, M.G.H. (2002). The neural basis of human error processing: Reinforcement

learning, dopamine, and the error-related negativity. Psychological Review, 109(4), 679–709.

Hornykiewiczl, O. (1966) Dopamine (3-hydroxytyramine) and brain function. Pharmacological Reviews,18,

925-964.

Huys QJM, Vogelstein J & Dayan P (2008). Psychiatry: Insights into depression through normative decision-

making models. NIPS 2008.

Joel, D., Niv, Y. and Ruppin, E. (2002). Actor—critic models of the basal ganglia: new anatomical and

computational perspectives, Neural Networks, 15, 535-47.

Kishida, K.T., Sandberg, S.S., Lohrenz, T., Comair, Y.G., Saez, I.G., Phillips, P.E.M., & Montague, P.R.

(2011). Sub-Second Dopamine Detection in Human Striatum. PLoS ONE.6(8): e23291.

Kitcher, P. (1989). Explanatory Unification and the Causal Structure of the World. In Scientific Explanation,

ed. P. Kitcher and W. Salmon, Minneapolis: University of Minnesota Press, pp. 410–505.

Knutson, B., Adams, C. M., Fong, G. W., & Hommer, D. (2001). Anticipation of increasing monetary

reward selectively recruits nucleus accumbens. Journal of Neuroscience, 21(16), RC159.

Koob, G.F. (1982). The dopamine anhedonia hypothesis: a pharmacological phrenology. Behav. Brain Sci. 5,

63-64.

Lindvall, O., & Björklund, A. (1974). The organization of the ascending catcholamine neuron systems in the

rat brain as revealed by the glyoxylic acid fluoresence method. Acta Physiologica Scandinavica Suppl.

412, 1-48.

Ljungberg T, Apicella P, & Schultz W (1992). Responses of monkey dopamine neurons during learning of

behavioral reactions. Journal of Neurophysiology, 67, 145-163.

Loewi, O. (1936) The chemical transmission of nerve action. Nobel Lecture. Reprinted in Nobel Lectures,

Physiology or Medicine, vol. 2 (1922–1941), pp. 416–432. Amsterdam: Elsevier, 1965. Available

online at: URL < http://www.nobelprize.org/nobel_prizes/medicine/laureates/1936/loewi-

lecture.html>

McClure, S.M., & D’Ardenne, K. (2009). Computational neuroimaging: monitoring reward learning with

blood flow. In: Dr. Jean-Claude Dreher and Léon Tremblay, editors, Handbook of Reward and

Decision Making. Oxford: Academic Press, 2009, pp. 229-247.

McClure, S.M., Berns, G. S., & Montague, P. R. (2003a). Temporal prediction errors in a passive learning

task activate human striatum. Neuron, 38(2), 339–346.

34

McClure, S.M., Daw, N.D., & Montague, P.R., (2003b). A computational substrate for incentive salience.

Trends in Neuroscience 26(8), 423-428.

Mirenowicz J., & Schultz W. (1994). Importance of unpredictability for reward responses in primate

dopamine neurons. Journal of Neurophysiology, 72(2), 1024-1027.

Montague, P.R. (2007). Your Brain is Almost Perfect: How we make Decisions. New York: Plume.

Montague, P.R., Dolan, R.J., Friston, K.J., & Dayan, P. (2012). Computational psychiatry. Trends in

Cognitive Sciences, 16, 72-80.

Montague, P.R., Dayan, P., & Sejnowski, T.J. (1996). A framework for mesencephalic dopamine systems

based on predictive Hebbian learning. Journal of Neuroscience, 16(5): 1936-1947.

Montague, P.R., Dayan, P, Person, C, & Sejnowski, T.J. (1995). Bee foraging in uncertain environments

using predictive Hebbian learning. Nature 377, 725-728.

Montague, PR, Dayan, P, Nowlan, SJ, Pouget, A & Sejnowski, TJ (1993). Using aperiodic reinforcement for

directed self-organization. In Advances in Neural Information Processing Systems 5, SJ Hanson, JD

Cowan, CL Giles (Eds), San Mateo (CA): Morgan Kaufmann 969-977.

Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology 53(3): 139-154.

Niv, Y., & Montague, P.R. (2009) Theoretical and empirical studies of learning. In Neuroeconomics:

Decision Making and the Brain, eds Glimcher PW, et al. (Academic Press, New York), pp 329–249.

Niv, Y. & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences, 12(7), 265-

272.

O’Doherty, J. P. (2012). Beyond simple reinforcement learning: the computational neurobiology of reward-

learning and valuation. The European Journal of Neuroscience, 35(7), 987-990.

O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. (2003). Temporal difference learning model

accounts for responses in human ventral striatum and orbitofrontal cortex during Pavlovian appetitive

learning. Neuron, 38, 329-337.

Oei, N.Y., Rombouts, S.A., Soeter, S.P., van Gerven, J.M., Both, S. (2012). Dopamine modulates reward

system activity during subconscious processing of sexual stimuli. Neuropsychopharmacology, 37,

1729-1737.

Olds, J., & Milner PM. (1954). Positive reinforcement produced by electrical stimulation of septal area and

other regions of rat brain. Journal of Comparative & Physiological Psychology,47, 419-27.

O’Reilly, R. C., & Frank, M. J. (2006). Making working memory work: A computational model of learning

in prefrontal cortex and basal ganglia. Neural Computation, 18, 283-328.

35

O’Sullivan, S.S., Wu, K., Politis, M., Lawrence, A.D., Evans, A.H., Bose, S.K., Djamshidian, A., Lees, A.J.

& Piccini, P. (2011) Cue-induced striatal dopamine release in Parkinson’s disease-associated

impulsive-compulsive behaviours. Brain, 134, 969-997.

Peciña, S., Cagniard, B., Berridge, K.C., Aldridge, J.W. & Zhuang, X. (2003). Hyperdopaminergic mutant

mice have higher ‘wanting’ but not ‘liking’ for sweet rewards. Journal of Neuroscience, 23, 9395–

9402

Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., & Frith, C.D. (2006). Dopamine-dependent

prediction errors underpin reward-seeking behaviour in humans. Nature, 442, 1042-1045.

Piccinini, G., & Bahar, S. (2012). Neural Computation and the Computational Theory of Cognition.

Cognitive Science.

Quartz SR, Dayan P, Montague PR, & Sejnowski TJ (1992). Expectation learning in the brain using diffuse

ascending projections. Society for Neuroscience Abstracts 18:1210.

Redgrave, P. & Gurney, K. (2006). The short-latency dopamine signal: a role in discovering novel actions?

Nature Reviews Neuroscience, 7:967-975.

Rescorla R.A., & Wagner A.R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of

reinforcement and nonreinforcement. In: Classical Conditioning II: Current Research and Theory

(Eds Black AH, Prokasy WF) New York: Appleton Century Crofts, pp. 64-99.

Roesch, M. R., Calu, D. J., & Schoenbaum, G. (2007). Dopamine neurons encode the better option in rats

deciding between differently delayed or sized rewards. Nature Neuroscience, 10(12), 1615–1624.

Robinson, T.E. & Berridge, K.C. (1993). The neural basis of drug craving. An incentive-sensitization theory

of addiction. Brain Research Reviews, 18, 247-291.

Robinson, T.E., Berridge, K.C.,2008. Review. The incentive sensitization theory of addiction: some current

issues. Philosophical Transactions of the Royal Society B: Biological Sciences 363 (1507), 3137–

3146.

Robinson, S., Sandstrom, S.M., Denenberg, V.H., & Palmiter, R.D. (2005) Distinguishing whether dopamine

regulates liking, wanting, and/or learning about rewards. Behavioral Neuroscience, 119, 5–15.

Romo, R., & Schultz W. ( 1990). Dopamine neurons of the monkey midbrain: contingencies of responses to

active touch during self-initiated arm movements. Journal of Neurophysiology, 63, 592-606.

Ross, D. (2010). Economic Models of Pathological Gambling. In D. Ross, H. Kincaid, D. Spurrett, & P.

Collins (Eds.), What is Addiction? (pp. 131-158). Cambridge (MA): MIT Press.

Schultz, W., Dayan, P., & Montague, P.R. (1997). A neural substrate of prediction and reward. Science, 275,

1593-1599.

36

Schultz, W., Apicella P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and

conditioned stimuli during successive steps of learning a delayed response task. Journal of

Neuroscience, 13, 900-913.

Schultz W., & Romo, R. (1990). Dopamine neurons of the monkey midbrain: contingencies of responses to

stimuli eliciting immediate behavioral reactions. Journal of Neurophysiology, 63, 607-624.

Schultz W. (1986). Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey.

Journal of Neurophysiology, 56, 1439 -1461.

Schultz, W., Ruffieux, A., & Aebischer, P. (1983). The activity of pars compacta neurons of the monkey

substantia nigra in relation to motor activation. Experimental Brain Research, 51, 377-387.

Skinner, B.F. (1938). The behavior of organisms. New York: D. Appleton-Century.

Stein, L. (1969). Chemistry of purposive behavior. In J. T. Tapp (Ed.), Reinforcement and Behavior,

Academic Press, New York, pp. 328-355.

Stein, L. (1968). Chemistry of reward and punishment, In: Proceedings of the American College of

NeuroPsychophar-macology (Efron DH, Ed.) (U.S. Government Printing Office:Washington, DC), pp.

105-123.

Strevens, M. (2004). The causal and unification accounts of explanation unified—causally. Noûs, 38, 154-

176.

Strevens, M. (2009). Depth: An Account of Scientific Explanation. Cambridge, MA: Harvard University

Press.

Stricker, E. M., & Zigmond, M. J. (1986). Brain monoamines, homeostasis, and adaptive behavior. In

Handbook of physiology, Vol. IV: Intrinsic regulatory systems of the brain (pp. 677-696). Bethesda,

MD: American Physiological Society.

Sutton, R.S. (1988). Learning to Predict by the Method of Temporal Differences. Machine Learning, 3, 9-44.

Sutton, R.S., & Barto, A.G. (1998). Reinforcement Learning. An Introduction. Cambridge, MA, MIT Press.

Sutton, R.S., & Barto, A.G. (1981). Toward a modern theory of adaptive networks: Expectation and

prediction. Psychological Review, 88 2, pp 135-170.

Sutton, R.S. & Barto, A.G. (1987). A temporal-difference model of classical conditioning. Proceedings of

the Ninth Annual Conference of the Cognitive Science Society. Seattle, WA.

Thorndike, E. (1911). Animal Intelligence. New York: MacMillan.

Tindell, A.J., Berridge, K.C., Zhang, J., Peciña, S. & Aldridge, J.W. (2005). Ventral pallidal neurons code

incentive motivation: amplification by mesolimbic sensitization and amphetamine. European Journal

of Neuroscience, 22, 2617-2634.

37

Toates, F. (1986). Motivational Systems. Cambridge University Press, Cambridge.

Tricomi, E., Balleine, B., & O’Doherty, J. (2009). A specific role for posterior dorsolateral striatum in

human habit learning. European Journal of Neuroscience, 29, 2225–2232.

Trowill, J. A., Panksepp, J., & Gandelman, R. (1969). An incentive model of rewarding brain stimulation.

Psychological Review, 76 (1969) 264-281.

Weslake, B. (2010). Explanatory Depth. Philosophy of Science, 77(2), 273-294.

White, N. M. (1986). Control of sensorimotor function by dopaminergic nigrostriatal neurons: Influences of

eating and drinking. Neuroscience and Biobehavioral Review, 10, 15-36.

Wise, R.A. (1982). Neuroleptics and operant behavior: the anhedonia hypothesis. Behavioral and Brain

Sciences, 5, 39-88.

Wise R.A. (2004). Dopamine, learning and motivation. Nature Reviews Neuroscience, 5, 483 494.

Woodward, J. (2003). Making Things Happen: A Theory of Causal Explanation. New York: Oxford

University Press.

Woodward, J., & Hitchcock, C. (2003). Explanatory Generalizations, pt. 2, Plumbing Explanatory Depth.

Noûs, 37, 181–99.

Wyvell, C.L., & Berridge, K.C. (2000) Intra-accumbens amphetamine increases the conditioned incentive

salience of sucrose reward: enhancement of reward “wanting” without enhanced “liking” or response

reinforcement. Journal of Neuroscience, 20, 8122-8130.

Zaghloul, K.A., Blanco, J.A., Weidemann, C.T., McGill, K., Jaggi, J.L., Baltuch, G.H., & Kahana, M.J.

(2009). Human substantia nigra neurons encode unexpected financial rewards. Science 323(5920):

1496-1499.

Zhang J., Berridge K.C., Tindell, A.J., Smith KS, & Aldridge JW (2009) A neural computational model of

incentive salience. PLoS Computational Biology, 5:e1000437.