Functional integration and inference in the brainbctill/papers/ememcog/Friston_2002.pdfProgress in...

31
Progress in Neurobiology 68 (2002) 113–143 Functional integration and inference in the brain Karl Friston The Wellcome Department of Imaging Neuroscience, Institute of Neurology, University College London, 12 Queen Square, London WC1N 3BG, UK Received 17 October 2001; accepted 16 August 2002 Abstract Self-supervised models of how the brain represents and categorises the causes of its sensory input can be divided into two classes: those that minimise the mutual information (i.e. redundancy) among evoked responses and those that minimise the prediction error. Although these models have similar goals, the way they are attained, and the functional architectures employed, can be fundamentally different. This review describes the two classes of models and their implications for the functional anatomy of sensory cortical hierarchies in the brain. We then consider how empirical evidence can be used to disambiguate between architectures that are sufficient for perceptual learning and synthesis. Most models of representational learning require prior assumptions about the distribution of sensory causes. Using the notion of empirical Bayes, we show that these assumptions are not necessary and that priors can be learned in a hierarchical context. Furthermore, we try to show that learning can be implemented in a biologically plausible way. The main point made in this review is that backward connections, mediating internal or generative models of how sensory inputs are caused, are essential if the process generating inputs cannot be inverted. Because these processes are dynamical in nature, sensory inputs correspond to a non-invertible nonlinear convolution of causes. This enforces an explicit parameterisation of generative models (i.e. backward connections) to enable approximate recognition and suggests that feedforward architectures, on their own, are not sufficient. Moreover, nonlinearities in generative models, that induce a dependence on backward connections, require these connections to be modulatory; so that estimated causes in higher cortical levels can interact to predict responses in lower levels. This is important in relation to functional asymmetries in forward and backward connections that have been demonstrated empirically. To ascertain whether backward influences are expressed functionally requires measurements of functional integration among brain systems. This review summarises approaches to integration in terms of effective connectivity and proceeds to address the question posed by the theoretical considerations above. In short, it will be shown that functional neuroimaging can be used to test for interactions between bottom–up and top–down inputs to an area. The conclusion of these studies points toward the prevalence of top–down influences and the plausibility of generative models of sensory brain function. © 2002 Elsevier Science Ltd. All rights reserved. Contents 1. Introduction ................................................................................. 114 2. Functional specialisation and integration ....................................................... 115 2.1. Background ............................................................................. 115 2.2. Functional specialisation and segregation .................................................. 115 2.3. The anatomy and physiology of cortico-cortical connections ................................. 115 2.4. Functional integration and effective connectivity ............................................ 117 3. Representational learning ..................................................................... 117 3.1. The nature of representations ............................................................. 117 3.2. Supervised models ....................................................................... 120 3.2.1. Category specificity and connectionism .............................................. 120 3.2.2. Implementation .................................................................... 121 3.3. Information theoretic approaches .......................................................... 122 3.3.1. Efficiency, redundancy and information .............................................. 122 3.3.2. Implementation .................................................................... 123 Tel.: +44-207-833-7454; fax: +44-207-813-1445. E-mail address: [email protected] (K. Friston). 0301-0082/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII:S0301-0082(02)00076-X

Transcript of Functional integration and inference in the brainbctill/papers/ememcog/Friston_2002.pdfProgress in...

Progress in Neurobiology 68 (2002) 113–143

Functional integration and inference in the brain

Karl Friston∗The Wellcome Department of Imaging Neuroscience, Institute of Neurology, University College London, 12 Queen Square, London WC1N 3BG, UK

Received 17 October 2001; accepted 16 August 2002

Abstract

Self-supervised models of how the brain represents and categorises the causes of its sensory input can be divided into two classes: thosethat minimise the mutual information (i.e. redundancy) among evoked responses and those that minimise the prediction error. Althoughthese models have similar goals, the way they are attained, and the functional architectures employed, can be fundamentally different. Thisreview describes the two classes of models and their implications for the functional anatomy of sensory cortical hierarchies in the brain.We then consider how empirical evidence can be used to disambiguate between architectures that are sufficient for perceptual learning andsynthesis.

Most models of representational learning require prior assumptions about the distribution of sensory causes. Using the notion of empiricalBayes, we show that these assumptions are not necessary and that priors can be learned in a hierarchical context. Furthermore, we try toshow that learning can be implemented in a biologically plausible way. The main point made in this review is that backward connections,mediating internal or generative models of how sensory inputs are caused, are essential if the process generating inputs cannot be inverted.Because these processes are dynamical in nature, sensory inputs correspond to a non-invertible nonlinear convolution of causes. Thisenforces an explicit parameterisation of generative models (i.e. backward connections) to enable approximate recognition and suggeststhat feedforward architectures, on their own, are not sufficient. Moreover, nonlinearities in generative models, that induce a dependenceon backward connections, require these connections to be modulatory; so that estimated causes in higher cortical levels can interact topredict responses in lower levels. This is important in relation to functional asymmetries in forward and backward connections that havebeen demonstrated empirically.

To ascertain whether backward influences are expressed functionally requires measurements of functional integration among brainsystems. This review summarises approaches to integration in terms of effective connectivity and proceeds to address the question posedby the theoretical considerations above. In short, it will be shown that functional neuroimaging can be used to test for interactions betweenbottom–up and top–down inputs to an area. The conclusion of these studies points toward the prevalence of top–down influences and theplausibility of generative models of sensory brain function.© 2002 Elsevier Science Ltd. All rights reserved.

Contents

1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142. Functional specialisation and integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

2.1. Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152.2. Functional specialisation and segregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152.3. The anatomy and physiology of cortico-cortical connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152.4. Functional integration and effective connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

3. Representational learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.1. The nature of representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.2. Supervised models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.2.1. Category specificity and connectionism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.2.2. Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.3. Information theoretic approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.3.1. Efficiency, redundancy and information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.3.2. Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

∗ Tel.: +44-207-833-7454; fax:+44-207-813-1445.E-mail address:[email protected] (K. Friston).

0301-0082/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved.PII: S0301-0082(02)00076-X

114 K. Friston / Progress in Neurobiology 68 (2002) 113–143

3.4. Predictive coding and the inverse problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233.4.1. Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243.4.2. Predictive coding and Bayesian inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.5. Cortical hierarchies and empirical Bayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1253.5.1. Empirical Bayes in the brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3.6. Generative models and representational learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283.6.1. Density estimation and EM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.6.2. Supervised representational learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.6.3. Information theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.6.4. Predictive coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.7. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314. Generative models and the brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.1. Context, causes and representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.2. Neuronal responses and representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.2.1. Examples from electrophysiology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345. Functional architectures assessed with brain imaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.1. Context-sensitive specialisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.1.1. Categorical designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.1.2. Multifactorial designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.1.3. Psychophysiological interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.2. Effective connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.2.1. Effective connectivity and Volterra kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1375.2.2. Nonlinear coupling among brain areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6. Functional integration and neuropsychology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.1. Dynamic diaschisis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.1.1. An empirical demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

1. Introduction

In concert with the growing interest in contextual andextra-classical receptive field effects in electrophysiology(i.e. how the receptive fields of sensory neurons change ac-cording to the context a stimulus is presented in), a sim-ilar paradigm shift is emerging in imaging neuroscience.Namely, the appreciation that functional specialisation ex-hibits similar extra-classical phenomena in which a corticalarea may be specialised for one thing in one context butsomething else in another. These extra-classical phenom-ena have implications for theoretical ideas about how thebrain might work. This review uses the relationship amongtheoretical models of representational learning as a vehicleto illustrate how imaging can be used to address importantquestions about functional brain architectures.

We start by reviewing two fundamental principles ofbrain organisation, namelyfunctional specialisationandfunctional integrationand how they rest upon the anatomyand physiology of cortico-cortical connections in the brain.Section 3deals with the nature and learning of represen-tations from a theoretical or computational perspective.This section reviewssupervised(e.g. connectionist) ap-proaches,information theoreticapproaches and those pred-icated onpredictive codingand reprises their heuristicsand motivation using the framework ofgenerative models.

The key focus of this section is on the functional architec-tures implied by each model of representational learning.Information theory can, in principle, proceed using onlyforward connections. However, it turns out that this is onlypossible when processes generating sensory inputs are in-vertible and independent. Invertibility is precluded whenthe cause of a percept and the context in which it is en-gendered interact. These interactions create a problem ofcontextual invariance that can only be solved using internalor generative models. Contextual invariance is necessaryfor categorisation of sensory input (e.g. category-specificresponses) and represents a fundamental problem in per-ceptual synthesis. Generative models based on predictivecoding solve this problem with hierarchies of backward andlateral projections that prevail in the real brain. In short,generative models of representational learning are a naturalchoice for understanding real functional architectures and,critically, confer a necessary role on backward connections.

Empirical evidence, from electrophysiological studiesof animals and functional neuroimaging studies of humansubjects, is presented inSections 4 and 5to illustrate thecontext-sensitive nature of functional specialisation andhow its expression depends upon integration among remotecortical areas.Section 4 looks at extra-classical effectsin electrophysiology, in terms of the predictions affordedby generative models of brain function. The theme of

K. Friston / Progress in Neurobiology 68 (2002) 113–143 115

context-sensitive evoked responses is generalised to a cor-tical level and human functional neuroimaging studies inthe subsequent section. The critical focus of this section isevidence for the interaction of bottom–up and top–downinfluences in determining regional brain responses. Theseinteractions can be considered signatures of backwardconnections. The final section reviews some of the impli-cations of the forgoing sections for lesion studies and neu-ropsychology. ‘Dynamic diaschisis’, is described, in whichaberrant neuronal responses can be observed as a conse-quence of damage to distal brain areas providing enablingor modulatory afferents. This section uses neuroimaging inneuropsychological patients and discusses the implicationsfor constructs based on the lesion-deficit model.

2. Functional specialisation and integration

2.1. Background

The brain appears to adhere to two fundamental princi-ples of functional organisation, functional integration andfunctional specialisation, where the integration within andamong specialised areas is mediated by effective connectiv-ity. The distinction relates to that between ‘localisationism’and ‘(dis)connectionism’ that dominated thinking aboutcortical function in the nineteenth century. Since the earlyanatomic theories of Gall, the identification of a particularbrain region with a specific function has become a centraltheme in neuroscience. However, functional localisation perse was not easy to demonstrate: for example, a meeting thattook place on 4 August 1881, addressed the difficulties ofattributing function to a cortical area, given the dependenceof cerebral activity on underlying connections (Phillipset al., 1984). This meeting was entitled “Localisation offunction in the cortex cerebri”. Goltz, although acceptingthe results of electrical stimulation in dog and monkeycortex, considered that the excitation method was inconclu-sive, in that the behaviours elicited might have originatedin related pathways, or current could have spread to dis-tant centres. In short, the excitation method could not beused to infer functional localisation because localisationismdiscounted interactions, or functional integration amongdifferent brain areas. It was proposed that lesion studiescould supplement excitation experiments. Ironically, it wasobservations on patients with brain lesions some years later(seeAbsher and Benson, 1993) that led to the concept of‘disconnection syndromes’ and the refutation of localisa-tionism as a complete or sufficient explanation of corticalorganisation. Functional localisation implies that a functioncan be localised in a cortical area, whereas specialisationsuggests that a cortical area is specialised for some aspectsof perceptual or motor processing where thisspecialisationcan be anatomicallysegregatedwithin the cortex. The cor-tical infrastructure supporting a single function may theninvolve many specialised areas whose union is mediated by

the functional integration among them. Functional special-isation and integration are not exclusive, they are comple-mentary. Functional specialisation is only meaningful in thecontext of functional integration and vice versa.

2.2. Functional specialisation and segregation

The functional role, played by any component (e.g. cor-tical area, sub-area, neuronal population or neuron) of thebrain, is defined largely by its connections. Certain pat-terns of cortical projections are so common that they couldamount to rules of cortical connectivity. “These rules re-volve around one, apparently, overriding strategy that thecerebral cortex uses—that of functional segregation” (Zeki,1990). Functional segregation demands that cells with com-mon functional properties be grouped together. This archi-tectural constraint in turn necessitates both convergence anddivergence of cortical connections. Extrinsic connections,between cortical regions, are not continuous but occur inpatches or clusters. This patchiness has, in some instances,a clear relationship to functional segregation. For example,the secondary visual area V2 has a distinctive cytochromeoxidase architecture, consisting of thick stripes, thin stripesand inter-stripes. When recordings are made in V2, direc-tionally selective (but not wavelength or colour selective)cells are found exclusively in the thick stripes. Retrograde(i.e. backward) labelling of cells in V5 is limited to thesethick stripes. All the available physiological evidence sug-gests that V5 is a functionally homogeneous area that is spe-cialised for visual motion. Evidence of this nature supportsthe notion that patchy connectivity is the anatomical infras-tructure that underpins functional segregation and speciali-sation. If it is the case that neurons in a given cortical areashare a common responsiveness (by virtue of their extrinsicconnectivity) to some sensorimotor or cognitive attribute,then this functional segregation is also an anatomical one.Challenging a subject with the appropriate sensorimotor at-tribute or cognitive process should lead to activity changesin, and only in, the areas of interest. This is the model uponwhich the search for regionally specific effects with func-tional neuroimaging is based.

2.3. The anatomy and physiology of cortico-corticalconnections

If specialisation rests upon connectivity then importantorganisational principles should be embodied in the neu-roanatomy and physiology of extrinsic connections. Extrin-sic connections couple different cortical areas whereas in-trinsic connections are confined to the cortical sheet. Thereare certain features of cortico-cortical connections that pro-vide strong clues about their functional role. In brief, thereappears to be a hierarchical organisation that rests upon thedistinction betweenforwardandbackwardconnections. Thedesignation of a connection as forward or backward dependsprimarily on its cortical layers of origin and termination.

116 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Table 1Some key characteristics of extrinsic cortico-cortical connections in the brain

Hierarchical organisationThe organisation of the visual cortices can be considered as a hierarchy (Felleman and Van Essen, 1991)The notion of a hierarchy depends upon a distinction between forward and backward extrinsic connectionsThis distinction rests upon different laminar specificity (Rockland and Pandya, 1979; Salin and Bullier, 1995)Backward connections are more numerous and transcend more levelsBackward connections are more divergent than forward connections (Zeki and Shipp, 1988)

Forwards connections Backwards connectionsSparse axonal bifurcations Abundant axonal bifurcationTopographically organised Diffuse topographyOriginate in supragranular layers Originate in bilaminar/infragranular layersTerminate largely in layer VI Terminate predominantly in supragranular layersPostsynaptic effects through fast AMPA (1.3–2.4 ms decay)

and GABAA (6 ms decay) receptorsModulatory afferents activate slow (50 ms decay)

voltage-sensitive NMDA receptors

Some characteristics of cortico-cortical connections are pre-sented below and are summarised inTable 1. The list is notexhaustive, nor properly qualified, but serves to introducesome important principles that have emerged from empiricalstudies of visual cortex.

• Hierarchical organisationThe organisation of the visual cortices can be consid-

ered as a hierarchy of cortical levels with reciprocal ex-trinsic cortico-cortical connections among the constituentcortical areas (Felleman and Van Essen, 1991). The no-tion of a hierarchy depends upon a distinction betweenforward and backward extrinsic connections.

• Forwards and backwards connections—laminar speci-ficity

Forwards connections (from a low to a high level)have sparse axonal bifurcations and are topographicallyorganised; originating in supragranular layers and termi-nating largely in layer VI. Backward connections, on theother hand, show abundant axonal bifurcation and a dif-fuse topography. Their origins are bilaminar/infragranularand they terminate predominantly in supragranular layers(Rockland and Pandya, 1979; Salin and Bullier, 1995).

• Forward connections are driving and backward connec-tions are modulatory

Reversible inactivation (e.g.Sandell and Schiller,1982; Girard and Bullier, 1989) and functional neu-roimaging (e.g.Büchel and Friston, 1997) studies suggestthat forward connections are driving, whereas backwardconnections can be modulatory. The notion that forwardconnections are concerned with the promulgation andsegregation of sensory information is consistent with: (i)their sparse axonal bifurcation; (ii) patchy axonal termi-nations; and (iii) topographic projections. In contradis-tinction, backward connections are generally consideredto have a role in mediating contextual effects and in theco-ordination of processing channels. This is consistentwith: (i) their frequent bifurcation; (ii) diffuse axonalterminations; and (iii) non-topographic projections (Salinand Bullier, 1995; Crick and Koch, 1998).

• Modulatory connections have slow time constantsForward connections meditate their post-synaptic ef-

fects through fast AMPA (1.3–2.4 ms decay) and GABAA(6 ms decay) receptors. Modulatory afferents activateNMDA receptors. NMDA receptors are voltage-sensitive,showing nonlinear and slow dynamics (50 ms decay).They are found predominantly in supragranular layerswhere backward connections terminate (Salin and Bullier,1995). These slow time-constants again point to a role inmediating contextual effects that are more enduring thanphasic sensory-evoked responses.

• Backwards connections are more divergent than forwardconnections

Extrinsic connections show an orderly convergenceand divergence of connections from one cortical level tothe next. At a macroscopic level, one point in a givencortical area will connect to a region 5–8 mm in diameterin another. An important distinction between forward andbackward connections is that backward connections aremore divergent. For example, the divergence region ofa point in V5 (i.e. the region receiving backward affer-ents from V5) may include thick and inter-stripes in V2,whereas its convergence region (i.e. the region providingforward afferents to V5) is limited to the thick stripes(Zeki and Shipp, 1988). Reciprocal interactions betweentwo levels, in conjunction with the divergence of back-ward connections, renders any area sensitive to the vicar-ious influence of other regions at the same hierarchicallevel even in the absence of direct lateral connections.

• Backward connections are more numerous and transcendmore levels

Backward connections are more abundant then forwardconnections. For example, the ratio of forward efferentconnections to backward afferents in the lateral genic-ulate is about 1:10/20. Another important distinction isthat backward connections will traverse a number of hi-erarchical levels, whereas forward connections are morerestricted. For example, there are backward connectionsfrom TE and TEO to V1 but no monosynaptic connec-tions from V1 to TE or TEO (Salin and Bullier, 1995).

K. Friston / Progress in Neurobiology 68 (2002) 113–143 117

In summary, the anatomy and physiology of cortico-cortical connections suggest that forward connections aredriving and commit cells to a pre-specified response giventhe appropriate pattern of inputs. Backward connections, onthe other hand, are less topographic and are in a positionto modulate the responses of lower areas to driving inputsfrom either higher or lower areas (seeTable 1). Backwardsconnections are abundant in the brain and are in a positionto exert powerful effects on evoked responses, in lowerlevels, that define the specialisation of any area or neuronalpopulation. The idea pursued below is that specialisa-tion depends upon backwards connections and, due to thegreater divergence of the latter, can embody contextual ef-fects. Appreciating this is important for understanding howfunctional integration can dynamically reconfigure the spe-cialisation of brain areas that mediate perceptual synthesis.

2.4. Functional integration and effective connectivity

Electrophysiology and imaging neuroscience have firmlyestablished functional specialisation as a principle of brainorganisation in man. The functional integration of spe-cialised areas has proven more difficult to assess. Functionalintegration refers to the interactions among specialised neu-ronal populations and how these interactions depend uponthe sensorimotor or cognitive context. Functional integrationis usually assessed by examining the correlations amongactivity in different brain areas, or trying to explain theactivity in one area in relation to activities elsewhere.Func-tional connectivityis defined as correlations between remoteneurophysiological events. However, correlations can arisein a variety of ways. For example, in multi-unit electroderecordings they can result from stimulus-locked transientsevoked by a common input or reflect stimulus-inducedos-cillations mediated by synaptic connections (Gerstein andPerkel, 1969). Integration within a distributed system isusually better understood in terms ofeffective connectivity.Effective connectivity refers explicitly to the influence thatone neuronal system exerts over another, either at a synaptic(i.e. synaptic efficacy) or population level (Friston, 1995a).It has been proposed that “the (electrophysiological) no-tion of effective connectivity should be understood as theexperiment- and time-dependent, simplest possible circuitdiagram that would replicate the observed timing relation-ships between the recorded neurons” (Aertsen and Preißl,1991). This speaks to two important points: (i) effectiveconnectivity is dynamic, i.e. activity- and time-dependent;and (ii) it depends upon a model of the interactions. An im-portant distinction, among models employed in functionalneuroimaging, is whether these models are linear or nonlin-ear. Recent characterisations of effective connectivity havefocussed on nonlinear models that accommodate the mod-ulatory or nonlinear effects mentioned above. A more de-tailed discussion of these models is provided inSection 5.2,after the motivation for their application is established inthe next section. In this review the terms modulatory and

nonlinear are used almost synonymously. Modulatory ef-fects imply the post-synaptic response evoked by one inputis modulated, or interacts, with another. By definition thisinteraction must depend on nonlinear synaptic mechanisms.

In summary, the brain can be considered as an ensembleof functionally specialised areas that are coupled in a nonlin-ear fashion by effective connections. Empirically, it appearsthat connections from lower to higher areas are predomi-nantly driving whereas backwards connections, that medi-ate top–down influences, are more diffuse and are capableof exerting modulatory influences. In the next section wedescribe a theoretical perspective, provided by ‘generativemodels’, that highlights the functional importance of back-wards connections and nonlinear interactions.

3. Representational learning

This section compares and contrasts the heuristics behindthree prevalent computational approaches to representationallearning and perceptual synthesis,supervised learning, andtwo forms ofself-supervised learningbased on informationtheory and predictive coding. These approaches will thenbe reconciled within the framework ofgenerative models.This article restricts itself to sensory processing in corticalhierarchies. This precludes a discussion of other importantideas (e.g. reinforcement learning (Sutton and Barto, 1990;Friston et al., 1994), neuronal selection (Edelman, 1993) anddynamical systems theory (Freeman and Barrie, 1994)).

The relationship between model and real neuronal archi-tectures is central to cognitive neuroscience. We address thisrelationship, in terms ofrepresentations, starting with anoverview of representations in which the distinctions amongvarious approaches can be seen clearly. An important focusof this section is the interaction among ‘causes’ of sensoryinput. These interactions posit the problem ofcontextualinvariance. In brief, it will be shown that the problem ofcontextual invariance points to the adoption of generativemodels where interactions among causes of a percept aremodelled explicitly. Within the class of self-supervisedmodels, we will compare classical information theoreticapproaches and predictive coding. These two schemes usedifferent heuristics which imply distinct architectures thatare sufficient for their implementation. The distinction restson whether an explicit model, of the way sensory inputs aregenerated, is necessary for representational learning. If thismodel is instantiated in backwards connections, then theo-retical distinctions may shed light on the functional role ofbackward and lateral connections that are so prevalent inthe brain.

3.1. The nature of representations

What is a representation? Here a representation is taken tobe a neuronal event that represents some ‘cause’ in the senso-rium. Causes are simply the states of the process generatingsensory data. It is not easy to ascribe meaning to these states

118 K. Friston / Progress in Neurobiology 68 (2002) 113–143

without appealing to the way that we categorise things, per-ceptually or conceptually. High-level conceptual causes maybe categorical in nature, such as the identity of a face inthe visual field or the semantic category a perceived ob-ject belongs to. In a hierarchical setting, high-level causesmay induce priors on lower-level causes that are more para-metric in nature. For example, the perceptual cause “mov-ing quickly” may show a one-to-many relationship withover-complete representations of different velocities in V5(MT) units. An essential aspect of causes is their relation-ship to each other (e.g. ‘is part of’) and, in particular, theirhierarchical structure. This ontology is often attended byambiguous many-to-one and one-to-many mappings (e.g. atable has legs but so do horses; a wristwatch is a watch irre-spective of the orientation of its hands). This ambiguity canrender the problem of inferring causes from sensory infor-mation ill-posed (as we will see further).

Even though causes may be difficult to describe, theyare easy to define operationally. Causes are the variables orstates that are necessary to specify the products of a process(or model of that process) generating sensory information.In very general terms, let us frame the problem of repre-senting real world causess(t) in terms of the system ofdeterministic equations

x = f (x, s)

u = g(x)(1)

wheres is a vector of underlying causes in the environment(e.g. the velocity of a particular object, direction of radiantlight, etc.) andu represents sensory inputs.x means the rateof change ofx, which here denotes some unobserved statesof the world that form our sensory impression of it. Thefunctionsf andg can be highly nonlinear and allow for boththe current state of the world and the causes of changes inthose states to interact, when evoking responses in sensoryunits. Sensory input can be shown to be a function of, andonly of, the causesand their recent history.

u=G(s) =∞∑i=1

∫ t

0. . .

∫ t

0

∂iu(t)

∂s(t − σ1) · · · ∂s(t − σi)

× s(t − σ1) · · · s(t − σi)dσ1 · · · dσi (2)

G(s) is a functional (function of a function) that generatesinputs from the causes.Eq. (2)is simply a functional Taylorexpansion covering dynamical systems of the sort impliedby Eq. (1). This expansion is called a Volterra series and canbe thought of as a nonlinear convolution of the causes to givethe inputs (see Box 1). Convolution is like smoothing, in thisinstance over time. A key aspect of this expansion is that itdoes not refer to the many hidden states of the world, onlythe causes of changes in states, that we want to represent.Furthermore,Eq. (1)does not contain any noise or error. Thisis becauseEqs. (1) and (2)describe a real world process.There is no distinction between deterministic and stochas-tic behaviour until that process is observed. At the pointthe process is modelled, this distinction is invoked through

notions of deterministic or observation noise. This sectiondeals with how the brain might construct such models.

The importance of this formulation is that it highlights: (i)the dynamicalaspects of sensory input; and (ii) the role ofinteractionsamong the causes of the sensory input. Dynamicaspects imply that the current state of the world, registeredthrough our sensory receptors, depends not only on the ex-tant causes but also on their history. Interactions among thesecauses, at any time in the past, can influence what is cur-rently sensed. The second-order terms withi = 2 in Eq. (2)represent pairwise interactions among the causes. Theseinteractions are formally identical to interaction terms inconventional statistical models of observed data and can beviewed as contextual effects, where the expression of a par-ticular cause depends on the context induced by another. Forexample, the extraction of motion from the visual field de-pends upon there being sufficient luminance or wavelengthcontrast to define the surface moving. Another ubiquitousexample, from early visual processing, is the occlusion ofone object by another. In the absence of interactions, wewould see a linear superposition of both objects, but thevisual input caused by the nonlinear mixing of these twocauses render one occluded by the other. At a more cognitivelevel, the cause associated with the word ‘HAMMER’ willdepend on the semantic context (that determines whetherthe word is a verb or a noun). These contextual effects areprofound and must be discounted before the representationsof the underlying causes can be considered veridical.

The problem the brain has to contend with is to find afunction of the inputu(t) that recognises or represents theunderlying causes. To do this, the brain must effectivelyundo the convolution and interactions to expose contextu-ally invariant causes. In other words, the brain must performsome form of nonlinear unmixing of ‘causes’ and ‘context’without knowing either. The key point here is that this non-linear mixing may not be invertible and that the estimationof causes from input may be fundamentally ill posed. Forexample, no amount of unmixing can discern the parts ofan object that are occluded by another. The mappingu = s2

provides a trivial example of this non-invertibility. Knowingu does not uniquely determines.

Nonlinearities are not the only source of non-invertibility.Because sensory inputs are convolutions of causes, thereis a potential loss of information during the convolutionor smoothing that may have been critical for a uniquedetermination of the causes. The convolution implied byEq. (2) means the brain has to de-convolve the inputs toobtain these causes. In estimation theory this problem issometimes called ‘blind de-convolution’ because the esti-mation is blind to the underlying causes that are convolvedto give the observed variables. To simplify the presenta-tion of the ideas below we will assume that the vectors ofcausess, and their estimatesv, include a sufficient historyto accommodate the dynamics implied byEq. (1).

All the schemas considered below can be construed astrying to effect a blind de-convolution of sensory inputs to

K. Friston / Progress in Neurobiology 68 (2002) 113–143 119

Box 1. Dynamical systems and Volterra kernels.

Input-state–output systems and Volterra seriesNeuronal systems are inherently nonlinear and lend themselves to modelling by nonlinear dynamical systems. How-

ever, due to the complexity of biological systems it is difficult to find analytic equations that describe them adequately.Even if these equations were known the state variables are often not observable. An alternative approach to identifica-tion is to adopt a very general model (Wray and Green, 1994) and focus on the inputs and outputs. Consider the singleinput–single output (SISO) system

x(t) = f (x(t), u(t))

y(t) = g(x(t))

The Fliess fundamental formula (Fliess et al., 1983) describes the causal relationship between the outputs and the recenthistory of the inputs. This relationship can be expressed as a Volterra series, in which the outputy(t) conforms to anonlinear convolution of the inputsu(t), critically without reference to the state variablesx(t). This series is simply afunctional Taylor expansion ofy(t).

y(t) = ∑∞i=1

∫ t

0 · · · ∫ t

0κi(σ1, · · · σi)u(t − σ1) · · · u(t − σi)dσ1 · · · dσi

κi(σ1, · · · , σi) = ∂iy(t)

∂u(t − σ1) · · · ∂u(t − σi)

whereκi(σ1, . . . , σi) is the ith-order kernel. Volterra series have been described as a ‘power series with memory’ andare generally thought of as a high-order or ‘nonlinear convolution’ of the inputs to provide an output. SeeBendat (1990)for a fuller discussion. This expansion is used in a number of places in the main text. When the inputs and outputs aremeasured neuronal activity the Volterra kernels have a special interpretation.

Volterra kernels and effective connectivityVolterra kernels are useful for characterising the effective connectivity or influences that one neuronal system exerts

over another because they represent the causal characteristics of the system in question. Neurobiologically they have asimple and compelling interpretation—they are synonymous with effective connectivity.

κ1(σ1) = ∂y(t)

∂u(t − σ1), κ2(σ1, σ2) = ∂2y(t)

∂u(t − σ1)∂u(t − σ2), . . .

It is evident that the first-order kernel embodies the response evoked by a change in input att −σ 1. In other words itis a time-dependant measure ofdriving efficacy. Similarly the second-order kernel reflects themodulatoryinfluence ofthe input att −σ 1 on the response evoked att −σ 2. And so on for higher orders.

estimate the causes with a recognition function.

v = R(u, φ, θ) (3)

Herev represents an estimate of the causes and could corre-spond to the activity of neuronal units (i.e. neurons or popu-lations of neurons) in the brain. The parametersφ andθ de-termine the transformations that sensory input is subject toand can be regarded as specifying the connection strengthsand architecture of a neuronal network model or effectiveconnectivity (see Box 1). For reasons that will become ap-parent later, we make a distinction between parameters forforward connectionsφ and backward connectionsθ .

The problem of recognising causes reduces to finding theright parameters such that the activity of the representationalunitsv have some clearly defined relationship to the causess. More formally, one wants to find the parameters that max-imise the mutual information or statistical dependence be-tween the dynamics of the representations and their causes.Models of neuronal computation try to solve this problem

in the hope that the ensuing parameters can be interpretedin relation to real neuronal infrastructures. The greater thebiological validity of the constraints under which thesesolutions are obtained, the more plausible this relationshipbecomes. In what follows, we will consider three modellingapproaches: (i) supervised models; (ii) models based oninformation theory; and (iii) those based on predictive cod-ing. The focus will be on the sometimes hidden constraintsimposed on the parameters and the ensuing implications forconnectivity architectures and the representational proper-ties of the units. In particular, we will ask whether backwardconnections, corresponding to the parametersθ , are nec-essary. And if so what is their role? The three approachesare reprised at the end of this section by treating them asspecial cases of generative models. Each subsection belowprovides the background and heuristics for each approachand describes its implementation using the formalismabove.Fig. 1 provides a graphical overview of the threeschemes.

120 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Fig. 1. Schematic illustrating the architectures implied by supervised, information theory-based approaches and predictive coding. The circles representnodes in a network and the arrows represent a few of the connections. See the main text for an explanation of the equations and designation of thevariables each set of nodes represents. The light grey boxes encompass connections and nodes within the model. Connection strengths are determined bythe free parameters of the modelφ (forward connections) andθ (backward connections). Nonlinear effects are implied when one arrow connects withanother. Nonlinearities can be construed as the modulation of responsiveness to one input by another (see Box 1 for a more formal account). The brokenarrow in the lower panel denotes connections that convey an error signal to the higher level from the input level.

3.2. Supervised models

Connectionism is an approach that has proved very use-ful in relating putative cognitive architectures to neuronalones and, in particular, modelling the impact of brain lesionson cognitive performance. Connectionism is used here asa well-known example of supervised learning in cognitiveneuroscience. We start by reviewing the role played by con-nectionist models in the characterisation of brain systemsunderlying cognitive functions.

3.2.1. Category specificity and connectionismSemantic memory impairments can result from a variety

of pathophysiological insults, including Alzheimer’s dis-ease, encephalitis and cerebrovascular accidents (e.g.Nebes,1989; Warrington and Shallice, 1984). The concept of cat-egory specificity stems from the work of Warrington andcolleagues (Warrington and McCarthy, 1983; Warringtonand Shallice, 1984) and is based on the observation thatpatients with focal brain lesions have difficulties in recog-nising or naming specific categories of objects. Patients

K. Friston / Progress in Neurobiology 68 (2002) 113–143 121

can exhibit double dissociations in terms of their residualsemantic capacity. For example, some patients can name ar-tifacts but have difficulty with animals, whereas others canname animals with more competence than artifacts. Thesefindings have engendered a large number of studies, allpointing to impairments in perceptual synthesis, phonolog-ical or lexico-semantic analysis that is specific for certaincategories of stimuli. There are several theories that havebeen posited to account for category specificity. Connec-tionist models have been used to adjudicate among someof them.

Connectionist (e.g. parallel distributed processing or PDP)techniques use model neuronal architectures that can belesioned to emulate neuropsychological deficits. This in-volves modelling semantic networks using connected unitsor nodes and suitable learning algorithms to determine aset of connection strengths (Rumelhart and McClelland,1986). Semantic memory impairments are then sim-ulated by lesioning the model to establish the natureof the interaction between neuropathology and cog-nitive deficit (e.g. Hinton and Shallice, 1991; Plautand Shallice, 1993). A compelling example of this sort of ap-proach is the connectionist model ofFarah and McClelland(1991): patterns of category-specific deficits ledWarringtonand McCarthy (1987)to suggest that an animate/inanimatedistinction could be understood in terms of a differentialdependence on functional and structural (perceptual) fea-tures for recognition. For example, tools have associatedmotor acts whereas animals do not, or tools are easier todiscriminate based upon their structural descriptions thanfour-legged animals.Farah and McClelland (1991)incor-porated this difference in terms of the proportion of the twotypes of semantic featural representations encoding a partic-ular object, with perceptual features dominating for animateobjects and both represented equally for artifacts. Damage tovisual features led to impairment for natural kinds and con-versely damage to functional features impaired the outputfor artifacts. Critically the model exhibited category-specificdeficits in the absence of any category-specific organisa-tion. The implication here is that an anatomical segregationof structural and functional representations is sufficientto produce category-specific deficits following focal braindamage. This example serves to illustrate how the connec-tionist paradigm can be used to relate neuronal and cogni-tive domains. In this example, connectionist models wereable to posit a plausible anatomical infrastructure whereinthe specificity of deficits, induced by lesions, is mediatedby differential dependence on either the functional or struc-tural attributes of an object and not by any (less plausible)category-specific anatomical organisation per se.

3.2.2. ImplementationIn connectionist models causes or ‘concepts’ like

“TABLE” are induced by patterns of activation overunits encoding semantic primitives (e.g. structural—“hasfour legs” or functional—“can put things on it”). These

primitives are simple localist representations “that are as-sumed to be encoded by larger pools of neurons in the brain”(Devlin et al., 1998). Irrespective of their theoretical bias,connectionist models assume the existence of fixed repre-sentations (i.e. units that represent a structural, phonologicalor lexico-semantic primitive) that are activated by someinput. These representational attributions are immutablewhere each unit has its ‘label’. The representation of a con-cept, object or ‘cause’ in the sensorium is defined in termsof which primitives are active.

Connectionist models employ some form ofsupervisedlearningwhere the model parameters (connection strengthsor biases) change to minimise the difference between the ob-served and required output. This output is framed in termsof a distributed profile or pattern of activity over the (output)units v = R(u, φ) which arises from sensory inputu cor-responding to activity in (input) primitives associated withthe stimulus being simulated. There are often hidden unitsinterposed between the input and output units. The initialinput (sometimes held constant or ‘clamped’ for a while) isdetermined by a generative function of theith stimulus orcauseui = G(si). Connectionist models try to find the freeparametersφ that minimise some function or potentialV ofthe error or difference between the output obtained and thatdesired

φ = minφ

V (ε, φ)

εi = R(ui, φ) − si

(4)

The potential is usually the (expected) sum of squared differ-ences. Although the connectionist paradigm has been veryuseful in relating cognitive science and neuropsychology, ithas a few limitations in the context of understanding howthe brain learns to represent things:

• First, one has to know the underlying causesi and thegenerative function, whereas the brain does not. This isthe conventional criticism of supervised algorithms as amodel of neuronal computation. Neural networks, of thesort used in connectionism, are well known to be flexi-ble nonlinear function approximators. In this sense theycan be used to approximate the inverse of any genera-tive functionui = G(si) to give model architectures thatcan be lesioned. However, representational learning in thebrain has to proceed without any information about theprocesses generating inputs and the ensuing architecturescannot be ascribed to connectionist mechanisms.

• Secondly, the generative mappingui = G(si) precludesnonlinear interactions among stimuli or causes, dynamicor static. This is a fundamental issue because one of themain objectives of neuronal modelling is to see how rep-resentations emerge with the nonlinear mixing and con-textual effects prevalent in real sensory input. Omittinginteractions among the causes circumvents one of the mostimportant questions that could have been asked; namelyhow does the brain unmix sensory inputs to discount con-textual effects and other aspects of nonlinear mixing? In

122 K. Friston / Progress in Neurobiology 68 (2002) 113–143

short, the same inputs are activated by a given cause, irre-spective of the context. This compromises the plausibilityof connectionist models when addressing the emergenceof representations.

In summary, connectionist models specify distributedprofiles of activity over (semantic) primitives that are in-duced by (conceptual) causes and try to find connectivityparameters that emulate the inverse of these mappings. Theyhave been used to understand how the performance (stor-age and generalisation) of a network responds to simulateddamage, after learning is complete. However, connection-ism has a limited role in understanding representationallearning per se. In the next subsection we will look atself-supervised approaches that do not require the causes forlearning.

3.3. Information theoretic approaches

There have been many compelling developments in theo-retical neurobiology that have used information theory (e.g.Barlow, 1961; Optican and Richmond, 1987; Linsker, 1990;Oja, 1989; Foldiak, 1990; Tovee et al., 1993; Tononi et al.,1994). Many appeal to the principle of maximum informa-tion transfer (e.g.Linsker, 1990; Atick and Redlich, 1990;Bell and Sejnowski, 1995). This principle has proven ex-tremely powerful in predicting some of the basic receptivefield properties of cells involved in early visual processing(e.g.Atick and Redlich, 1990; Olshausen and Field, 1996).This principle represents a formal statement of the com-mon sense notion that neuronal dynamics in sensory systemsshould reflect, efficiently, what is going on in the environ-ment (Barlow, 1961). In the present context, the principleof maximum information transfer (infomax;Linsker, 1990)suggests that a model’s parameters should maximise the mu-tual information between the sensory inputu and the evokedresponses or outputsv = R(u, φ). This maximisation is usu-ally considered in the light of some sensible constraints, e.g.the presence of noise in sensory input (Atick and Redlich,1990) or dimension reduction (Oja, 1989) given the smallernumber of divergent outputs from a neuronal population thanconvergent inputs (Friston et al., 1992).

Intuitively, mutual information is like the covariance orcorrelation between two variables but extended to covermultivariate observations. It is a measure of statistical de-pendence. In a similar way, entropy can be regarded as theuncertainty or variability of an observation (cf. variance ofa univariate observation). The mutual information betweeninputs and outputs underφ is given by

I (u, v;φ)=H(u) + H(v;φ) − H(u, v;φ)

=H(v;φ) − H(v|u) (5)

whereH(v|u) is the conditional entropy or uncertainty inthe output, given the input. For a deterministic system there

is no such uncertainty and this term can be discounted (seeBell and Sejnowski, 1995). More generally

∂φI (u, v;φ) = ∂

∂φH(v;φ) (6)

It follows that maximising the mutual information is thesame as maximising the entropy of the responses. The in-fomax principle (maximum information transfer) is closelyrelated to the idea of efficient coding. Generally speaking,redundancy minimisation and efficient coding are all varia-tions on the same theme and can be considered as the info-max principle operating under some appropriate constraintsor bounds. Clearly it would be trivial to conform to the in-fomax principle by simply multiplying the inputs by a verylarge number. What we would like to do is to capture theinformation in the inputs using a small number of outputchannels operating in some bounded way. The key thingthat distinguishes among the various information theoreticschemas is the nature of the constraints under which entropyis maximised. These constraints render infomax a viable ap-proach to recovering the original causes of data, if one canenforce the outputs to conform to the same distribution asthe causes (seeSection 3.3.1). One useful way of looking atconstraints is in terms of efficiency.

3.3.1. Efficiency, redundancy and informationThe efficiency of a system can be considered as the com-

plement of redundancy (Barlow, 1961), the less redundant,the more efficient a system will be. Redundancy is reflectedin the dependencies or mutual information among the out-puts. (cf.Gawne and Richmond, 1993).

I (v;φ) =∑

H(vi;φ) − H(v;φ) (7)

HereH(vi;φ) is the entropy of theith output.Eq. (7)impliesthat redundancy is the difference between the joint entropyand the sum of the entropies of the individual units (com-ponent entropies). Intuitively this expression makes sense ifone considers that the variability in activity of any single unitcorresponds to its entropy. Therefore, an efficient neuronalsystem represents its inputs with the minimal excursionsfrom baseline firing rates. Another way of thinking aboutEq. (7)is to note that maximising efficiency is equivalent tominimising the mutual information among the outputs. Thisis the basis of approaches that seek to de-correlate or orthog-onalise the outputs. To minimise redundancy one can eitherminimise the entropy of the output units or maximise theirjoint entropy, while ensuring the other is bounded in someway. Olshausen and Field (1996)present a very nice analy-sis based on sparse coding. Sparse coding minimises redun-dancy using single units with low entropy. Sparse codingimplies coding by units that fire very sparsely and will, gen-erally, not be firing. Therefore, one can be relatively certainabout their (quiescent) state, conferring low entropy on them.

Approaches that seek to maximise the joint entropy of theunits include principal component analysis (PCA) learningalgorithms (that sample the subspace of the inputs that have

K. Friston / Progress in Neurobiology 68 (2002) 113–143 123

the highest entropy) (e.g.Foldiak, 1990) and independentcomponent analysis (ICA). In PCA the component entropiesare bounded by scaling the connection strengths of a simplerecognition modelv = R(u, φ) = φu so that the sum of thevariances ofvi is constant. ICA finds nonlinear functions ofthe inputs that maximise the joint entropy (Common, 1994;Bell and Sejnowski, 1995). The component entropies areconstrained by the passing the outputs through a sigmoidsquashing functionv = R(u, φ) = σ(φu) so that the outputslie in a bounded interval (hypercube). SeeSection 3.6.1fora different perspective on ICA in which the outputs are notbounded but forced to have cumulative density functions thatconform to the squashing function.

An important aspect of the infomax principle is that it goesa long way to explaining functional segregation in the cortex.One perspective on functional segregation is that each corti-cal area is segregating its inputs into relatively independentfunctional outputs. This is exactly what infomax predicts.SeeFriston (2000and references therein) for an example ofhow infomax can be used to predict the segregation of pro-cessing streams from V2 to specialised motion, colour andform areas in extrastriate cortex.

3.3.2. ImplementationIn terms of the above formulation, information theoretic

approaches can be construed as finding the parameters ofa forward recognition function that maximise the efficiencyor minimise the redundancy

φ = minφ

I (v;φ)v = R(u, φ)

(8)

But when are the outputs of an infomax model veridicalestimates of the causes of its inputs? This is assured when:(i) the generating process is invertible; and (ii) the real worldcauses are independent such thatH(s) = ∑

H(si). This canbe seen by noting

I (v;φ)=∑

H(vi;φ) − H(v;φ)=

∑H(Ri(G(s), φ)) −

∑H(si)

−⟨ln

∣∣∣∣∂R(G(s), φ)∂v

∣∣∣∣⟩

≥ 0 (9)

with equality whenv = R(u, φ) = G−1(u) = s. Comparedto the connectionist scheme this has the fundamental advan-tage that the algorithm is unsupervised by virtue of the factthat the causes and generating process are not needed byEq. (8). Note that the architectures inFig. 1, depicting con-nectionist and infomax schemes, are identical apart from thenodes representing desired output (unfilled circles in the up-per panel). However, there are some outstanding problems:

• First, infomax recovers causes only when the generatingprocess is invertible. However, as we have seen above thenonlinear convolution of causes generating inputs may notbe invertible. This means that the recognition enacted by

forward connections may not be defined in relation to thegeneration of inputs.

• Second, we have to assume that the causes are indepen-dent. While this may be sensible for simple systems itis certainly not appropriate for more realistic hierarchicalprocesses that generate sensory inputs (seeSection 3.5.1).This is because correlations among causes at any levelare induced by, possibly independent, casual changes atsupraordinate levels.

Finally, the dynamical nature of evoked neuronal tran-sients is lost in many information theoretic formulationswhich treat the inputs as a stationary stochastic process,not as the products of a dynamical system. This is becausethe mutual information and entropy measures, that governlearning, pertain to probability distributions. These densitiesdo not embody information about the temporal evolutionof states, if they simply describe the probability the systemwill be found in a particular state when sampled over time.Indeed, in many instances, the connection strengths areidentifiable given just the densities of the inputs, withoutany reference to the fact that they were generated dynam-ically or constituted a time-series (cf. principal componentlearning algorithms that need only the covariances of the in-puts). Discounting dynamics is not a fundament of infomaxschemas. For example, our own work using ICA referredto above (Friston, 2000) expanded inputs using tempo-ral basis functions to model the functional segregation ofmotion, colour and form in V2. This segregation emergedas a consequence of maximising the information trans-fer between spatio-temporal patterns of visual inputs andV2 outputs.

In summary ICA and like-minded approaches, that try tofind some deterministic function of the inputs that maximisesinformation transfer, impose some simplistic and strong con-straints on the generating process that must be met beforeveridical representations emerge. In the final approach, con-sidered here, we discuss predictive coding models that donot require invertibility or independence and, consequently,suggest a more natural form for representational learning.

3.4. Predictive coding and the inverse problem

Over the past years predictive coding and generative mod-els have supervened over other modelling approaches tobrain function and represent one of the most promisingavenues, offered by computational neuroscience, to under-standing neuronal dynamics in relation to perceptual cate-gorisation. In predictive coding the dynamics of units in anetwork are trying to predict the inputs. As with infomaxschemas, the representational aspects of any unit emergespontaneously as the capacity to predict improves with learn-ing. There is no a priori ‘labelling’ of the units or any su-pervision in terms of what a correct response should be (cf.connectionist approaches). The only correct response is onein which the implicit internal model of the causes and their

124 K. Friston / Progress in Neurobiology 68 (2002) 113–143

nonlinear mixing is sufficient to predict the input with min-imal error.

Conceptually, predictive coding and generative models(see further) are related to ‘analysis-by-synthesis’ (Neisser,1967). This approach to perception, from cognitive psy-chology, involves adapting an internal model of the worldto match sensory input and was suggested byMumford(1992) as a way of understanding hierarchical neuronalprocessing. The idea is reminiscent of MacKay’s episte-mological automata (MacKay, 1956) which perceive bycomparing expected and actual sensory input (Rao, 1999).These models emphasise the role of backward connectionsin mediating the prediction, at lower or input levels, basedon the activity of units in higher levels. The connectionstrengths of the model are changed so as to minimise the er-ror between the predicted and observed inputs at any level.This is in direct contrast to connectionist approaches wereconnection strengths change to minimise the error betweenthe observed anddesiredoutput. In predictive coding thereis no ‘output’ because the representational meaning of theunits is not pre-specified but emerges during learning.

Predictive coding schemes can also be regarded as aris-ing from the distinction between forward and inverse mod-els adopted in machine vision (Ballard et al., 1983; Kawatoet al., 1993). Forward models generate inputs from causes,whereas inverse models approximate the reverse transfor-mation of inputs to causes. This distinction embraces thenon-invertibility of generating processes and the ill-posednature of inverse problems. As with all underdetermined in-verse problems the role of constraints becomes central. Inthe inverse literature a priori constraints usually enter interms of regularised solutions. For example; “Descriptionsof physical properties of visible surfaces, such as their dis-tance and the presence of edges, must be recovered fromthe primary image data. Computational vision aims to un-derstand how such descriptions can be obtained from inher-ently ambiguous and noisy data. A recent development inthis field sees early vision as a set of ill-posed problems,which can be solved by the use of regularisation methods”(Poggio et al., 1985). The architectures that emerge fromthese schemes suggest that “feedforward connections fromthe lower visual cortical area to the higher visual corticalarea provides an approximated inverse model of the imagingprocess (optics), while the backprojection connection fromthe higher area to the lower area provides a forward modelof the optics” (Kawato et al., 1993).

3.4.1. ImplementationPredictive, or more generally, generative, models turn

the inverse problem on its head. Instead of trying to findfunctions of the inputs that predict the causes they findfunctions of causal estimates that predict the inputs. Asin approaches based on information theory, the causes donot enter into the learning rules, which are therefore unsu-pervised. Furthermore, they do not require the convolutionof causes, engendering the inputs, to be invertible. This is

because the generative or forward model is instantiated ex-plicitly. Here the forward model is the nonlinear mixing ofcauses that, by definition must exist. The estimation of thecauses still rests upon constraints, but these are now framedin terms of the forward model and have a much more directrelationship to casual processes in the real world. The en-suing mirror symmetry between the real generative processand its forward model is illustrated in the architecture inFig. 1. Notice that the connections within the model arenow going backwards. In the predictive coding schemethese backward connections, parameterised byθ form pre-dictions from some estimate of the causesv to provide aprediction error. The parameters now change to minimisesome function of the prediction error cf.Eq. (4).

θ = minθV (ε, θ)

ε = u − G(v, θ)(10)

The differences betweenEqs. (10) and (4)are that the er-rors are at the input level, as opposed to the output leveland the parameters now pertain to a forward model instan-tiated in backward connections. This minimisation schemeeschews the real causess but where do their estimates comefrom? These casual estimates or representations change inthe same way as the other free parameters of the model.They change to minimise prediction error subject to some apriori constraint, modelled by a regularisation termλ(v, θ),usually through gradient ascent.1

v = −∂V (ε, θ)

∂v+ ∂λ(v, θ)

∂v(11)

The error is conveyed from the input layer to the outputlayer by forward connections that are rendered as a bro-ken line in the lower panel ofFig. 1. This component ofthe predictive coding scheme has a principled (Bayesian)motivation that is described in the next subsection. Forthe moment, consider what would transpire after trainingand prediction error is largely eliminated. This implies thebrain’s nonlinear convolution of the estimated causes reca-pitulates the real convolution of the real causes. In short,there is a veridical (or at least sufficient) representation ofboth the causes and the dynamical structure of their mixingthrough the backward connectionsθ .

The dynamics of representational units or populationsimplied by Eq. (11)represents the essential difference be-tween this class of approaches and those considered above.Only in predictive coding are the dynamics changing tominimise the same objective function as the parameters. Inboth the connectionist and infomax schemes the represen-tations of a given cause can only be changed vicariouslythrough the connection parameters. Predictive coding is astrategy that has some compelling (Bayesian) underpinnings(see further) and is not simply using a connectionist archi-tecture in auto-associative mode or using error minimisation

1 For simplicity, time constants have been omitted from expressionsdescribing the ascent of states or parameters on objective functions.

K. Friston / Progress in Neurobiology 68 (2002) 113–143 125

to maximise information transfer. It is a real time, dynam-ical scheme that embeds two concurrent processes. (i) Theparameters of the generative or forward model change toemulate the real world mixing of causes, using the currentestimates; and (ii) these estimates change to best explain theobserved inputs, using the current forward model. Both theparameters and the states change in an identical fashion tominimise prediction error. The predictive coding scheme es-chews the problems associated with earlier schemes. It caneasily accommodate nonlinear mixing of causes in the realworld. It does not require this mixing to be invertible andneeds only the sensory inputs. However, there is an outstand-ing problem:

• To finesse the inverse problem, posed by non-invertiblegenerative models, regularisation constraints are required.These resolve the problem of non-invertibility that con-founds simple infomax schemes but introduce a new prob-lem. Namely one needs to know the prior distribution ofthe causes. This is because, as shown next, the regulari-sation constraints are based on these priors.

In summary, predictive coding treats representationallearning as an ill-posed inverse problem and uses an explicitparameterisation of a forward model to generate predictionsof the observed input. The ensuing error is then used to re-fine the forward model. This component of representationallearning is dealt with below (Section 3.6). The predictionsare based on estimated causes that also minimise predic-tive error, under some constraints that resolve the generallyill-posed estimation problem. We now consider these con-straints from a Bayesian point of view.

3.4.2. Predictive coding and Bayesian inferenceOne important aspect of predictive coding and generative

models (see further) is that they portray the brain as an infer-ential machine (Dayan et al., 1995). From this perspective,functional architectures exist, not to filter the input to obtainthe causes, but to estimate causes and test the predictionsagainst the observed input. A compelling aspect of predic-tive coding schemas is that they lend themselves to Bayesiantreatment. This is important because it can be extended usingempirical Bayes and hierarchical models. In what followswe shall first describe the Bayesian view of regularisationin terms of priors on the causes. We then consider hierar-chical models in which priors can be derived empirically.The key implication, for neuronal implementations of pre-dictive coding, is that empirical priors eschew assumptionsabout the independence of causes (cf. infomax schemes) orthe form of constraints in regularised inverse solutions.

Suppose we knew the a priori distribution of the causesp(v), but wanted the best estimate given the input. This max-imum a posteriori (MAP) estimate maximises the posteriorp(v|u). The two probabilities are related through Bayes rulewhich states that the probability of the cause and input oc-curring together is the probability of the cause given the in-put times the probability of the input. This, in turn, is the

same as the probability of the input given the causes timesthe prior probability of the causes.

p(u, v) = p(v|u)p(u) = p(u|v)p(v) (12)

The MAP estimator of the causes is the most likely giventhe data.

vm = maxv

lnp(v|u) = maxv

[lnp(u|v) + lnp(v)] (13)

The first term on the right is known as the log likelihood orlikelihood potential and the second is the prior potential. Agradient ascent to findvm would take the form

v = ∂�

∂v�(u) = lnp(u|v; θ) + lnp(v; θ)

(14)

where the dependence of the likelihood and priors on themodel parameters has been made explicit. The likelihoodis defined by the forward modelu = G(v, θ) + ε wherep(u|v; θ) ∝ exp(−V (ε, θ)). V now plays the role of aGibb’s potential that specifies ones distributional assump-tions about the prediction error. Now we have

v = −∂V (ε, θ)

∂v+ ∂ lnp(v; θ)

∂v(15)

This is formally identical to the predictive coding schemeEq. (11), in which the regularisation termλ(v, θ) =lnp(v; θ) becomes a log prior that renders the ensuingestimation Bayesian. In this formulation the state of thebrain changes, not to minimise error per se, but to attain anestimate of the causes that maximises both the likelihoodof the input given that estimate and the prior probability ofthe estimate being true. The implicit Bayesian estimationcan be formalised from a number of different perspectives.Rao and Ballard (1998)give a very nice example usingthe Kalman filter that goes some way to dealing with thedynamical aspect of real sensory inputs.

3.5. Cortical hierarchies and empirical Bayes

The problem withEq. (15)is that the brain cannot con-struct priors de novo. They have to be learned along with theforward model. In Bayesian estimation priors are estimatedfrom data usingempiricalBayes. Empirical Bayes harnessesthe hierarchical structure of a forward model, treating theestimates of causes at one level as prior expectations for thesubordinate level (Efron and Morris, 1973). This provides anatural framework within which to treat cortical hierarchiesin the brain, each providing constraints on the level below.Fig. 2depicts a hierarchical architecture that is described inmore detail below. This extension models the world as a hi-erarchy of (dynamical) systems where supraordinate causesinduce, and moderate, changes in subordinate causes. Forexample, the presence of a particular object in the visual fieldchanges the incident light falling on a particular part of theretina. A more abstract example, that illustrates the brain’sinferential capacities, is presented inFig. 3. On reading the

126 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Fig. 2. Schematic depicting a hierarchical extension to the predictive coding architecture, using the same format asFig. 1. Here hierarchical arrangementswithin the model serve to provide predictions or priors to representations in the level below. The open circles are the error units and the filled circles arethe representations of causes in the environment. These representations change to minimise both the discrepancy between their predicted value and themismatch incurred by their own prediction of the representations in the level below. These two constraints correspond to prior and likelihood potentials,respectively (see main text).

first sentence ‘Jack and Jill went up the hill’ we perceive theword ‘event’ as ‘went’. In the absence of any hierarchicalinference the best explanation for the pattern of visual stimu-lation incurred by the text is ‘event’. This would correspondto the maximum likelihood estimate of the word and wouldbe the most appropriate in the absence of prior informationabout which is the most likely word. However, within hier-archical inference the semantic context provides top–down

Fig. 3. Schematic illustrating the role of priors in biasing towards onerepresentation of an input or another.Left: The word ‘event’ is selectedas the most likely cause of the visual input.Right: The word ‘went’ isselected as the most likely word that is: (i) a reasonable explanation forthe sensory input; and (ii) conforms to prior expectations induced bysemantic context.

predictions to which the posterior estimate is accountable.When this prior biases in favour of ‘went’ we tolerate asmall error as a lower level of visual analysis to minimisethe overall prediction error at the visual and lexical level.This illustrates the role of higher level estimates in provid-ing predictions or priors for subordinate levels. These priorsoffer contextual guidance towards the most likely cause ofthe input. Note that predictions at higher levels are subjectto the same constraints, only the highest level, if there isone in the brain, is free to be directed solely by bottom–upinfluences (although there are always implicit priors). If thebrain has evolved to recapitulate the casual structure of itsenvironment, in terms of its sensory infrastructures, it is in-teresting to reflect on the possibility that our visual corticesreflect the hierarchical casual structure of our environment.

The hierarchical structure of the real world is literally re-flected by the hierarchical architectures trying to minimiseprediction error, not just at the level of sensory input but atall levels of the hierarchy (notice the deliberate mirror sym-metry inFig. 2). The nice thing about this architecture is thatthe dynamics of casual representations at theith levelvi re-quire only the error for the current level and the immediatelypreceding level. This follows from the Markov property ofhierarchical systems where one only needs to know the im-mediately supraordinate causes to determine the density ofcauses at any level in question, i.e.p(vi |vi+1, . . . , vn) =p(vi |vi+1). The fact that only error from the current andlower level is required to drive the dynamics ofvi is impor-tant because it permits a biologically plausible implementa-tion, where the connections driving the error minimisation

K. Friston / Progress in Neurobiology 68 (2002) 113–143 127

have only to run forward from one level to the next (seeSection 3.5.1andFig. 2).

3.5.1. Empirical Bayes in the brainThe biological plausibility of the scheme depicted inFig. 2

can be established fairly simply. To do this a hierarchicalpredictive scheme is described in some detail. A more thor-ough account of this scheme, including simulations of var-ious neurobiological and psychophysical phenomena, willappear in future publications. For the moment, we will re-view neuronal implementation at a purely theoretical level,using the framework developed above.

Consider any leveli in a cortical hierarchy containingunits (neurons or neuronal populations) whose activityviis predicted by corresponding units in the level abovevi+1.The hierarchical form of the implicit generative model is

u = G1(v2, θ1) + ε1v2 = G2(v3, θ2) + ε2v3 = · · ·

(16)

with v1 = u. Technically, these models fall into the classof conditionally independent hierarchical models when theerror terms are independent at each level (Kass and Steffey,1989). These models are also calledparametric empiricalBayes(PEB) models because the obvious interpretation ofthe higher-level densities as priors led to the developmentof PEB methodology (Efron and Morris, 1973). We requireunits in all levels to jointly maximise the posterior probabil-ities of vi+1 givenvi . We will assume the errors are Gaus-sian with covariance

∑i = ∑

(λi). Therefore,θi and λiparameterise the means and covariances of the likelihood ateach level.

p(vi |vi+1 ) = N(vi : G(vi+1, θi),∑

i )

∝ ∣∣∑i

∣∣−1/2 exp(−1

2εTi

∑−1i εi

) (17)

This is also the prior density for the level below. Althoughθi andλi are both parameters of the forward modelλi aresometimes referred to as hyperparameters and in classicalstatistics correspond to variance components. We will pre-serve the distinction between parameters and hyperparam-eters because minimising the prediction error with respectto the estimated causes and parameters is sufficient to max-imise the likelihood of neuronal states at all levels. This isthe essence of predictive coding. For the hyperparametersthere is an additional term that depends on the hyperparam-eters themselves (see further).

In this hierarchical setting, the objective function com-prises a series of log likelihoods

�(u) = lnp(u|v1) + ln(v1|v2) + · · ·= −1

2ξT1 ξ1 − 1

2ξT2 ξ2 − · · ·

−12ln

∣∣∑1

∣∣ − 12ln

∣∣∑2

∣∣ − · · ·ξi = vi − Gi(vi+1, θ) − λiξi

= (1 + λi)−1εi

(18)

Here∑

(λi)1/2 = 1 + λi . The likelihood at each level

corresponds top(vi |vi+1) which also plays the role of aprior on vi that is jointly maximised with the likelihoodof the level belowp(vi−1|vi). In a neuronal setting the(whitened) prediction error is encoded by the activities ofunits denoted byξ i . These error units receive a predictionfrom units in the level above2 and connections from theprincipal unitsvi being predicted. Horizontal interactionsamong the error units serve to de-correlate them (cf.Foldiak,1990), where the symmetric lateral connection strengthsλihyper-parameterise the covariances of the errors

∑i which

are the prior covariances for leveli − 1.The estimatorsvi+1 and the connection strength parame-

ters perform a gradient ascent on the compound log proba-bility.

vi+1 = ∂�

∂vi+1= − ∂ξTi

∂vi+1ξi − ∂ξTi+1

∂vi+1ξi+1

θi = ∂�

∂θi= −∂ξTi

∂θiξ

λi = ∂�

∂λi= −∂ξTi

∂λiξ − (1 + λi)

−1

(19)

WhenGi(vi+1, θ) models dynamical processes (i.e. is ef-fectively a convolution operator) this gradient ascent is morecomplicated. In a subsequent paper we will show that, withdynamical models, it is necessary to maximise both� andits temporal derivatives (e.g.�). An alternative is to assumea simple hidden Markov model for the dynamics and useKalman filtering (cf.Rao and Ballard, 1998). For the mo-ment, we will assume the inputs change sufficiently slowlyfor gradient ascent not to be confounded.

Despite the complicated nature of the hierarchical modeland the abstract theorising, three simple and biologicallyplausible things emerge:

• Reciprocal connectionsThe dynamics of representational unitsvi+1 are subject

to two, locally available, influences. A likelihood termmediated by forward afferents from the error units in thelevel below and an empirical prior term conveyed by er-ror units in the same level. This follows from the condi-tional independence conferred by the hierarchical struc-ture of the model. Critically, the influences of the errorunits in both levels are meditated by linear connectionswith a strength that is exactly the same as the (negative)effective connectivity of the reciprocal connection fromvi+1 to ξ i andξ i+1 (see Box 1 for definition of effectiveconnectivity). In short, the lateral, forwards and backwardconnections are all reciprocal, consistent with anatomi-cal observations. Lateral connections, within each leveldecorrelate the error units allowing competition between

2 Clearly, the backward connections are not inhibitory but, after media-tion by inhibitory interneurons, their effective influence could be renderedinhibitory.

128 K. Friston / Progress in Neurobiology 68 (2002) 113–143

prior expectations with different precisions (precision isthe inverse of variance).

• Functionally asymmetric forward and backward connec-tions

The forward connections are the reciprocal (nega-tive transpose) of the backward effective connectivity∂ξi/∂vi+1 from the higher level to the lower level, ex-tant at that time. However, the functional attributes ofthe forward and backward influences are different. Theinfluences of units on error units in the lower level me-diate the forward modelξi = −Gi(vi+1, θ)+ . . . . Thesecan be nonlinear, where each unit in the higher levelmaymodulate or interact with the influence of others(accord-ing to the nonlinearities inG). In contradistinction,theinfluences of units in lower levels do not interactwhenproducing changes in the higher level because their ef-fects are linearly separablevi+1 = −∂ξi/∂vi+1ξi − · · · .This is a key observation because the empirical evidence,reviewed in the previous section, suggests that backwardconnections are in a position to interact (e.g. thoughNMDA receptors expressed predominantly in the supra-granular layers receiving backward connections) whereasforward connections are not. It should be noted that,although the implied forward connections∂ξi/∂vi+1 me-diate linearly separable effects ofξ i on vi+1, these con-nections might be activity- and time-dependent becauseof their dependence onvi+1.

• Associative plasticityChanges in the parameters correspond to plasticity in

the sense that the parameters control the strength of back-ward and lateral connections. The backward connectionsparameterise the prior expectations of the forward modeland the lateral connections hyper-parameterise the priorcovariances. Together they parameterise the Gaussiandensities that constitute the priors (and likelihoods) ofthe model. The motivation for these parameters maximis-ing the same objective function� as the neuronal statesis discussed in the next subsection. For the moment, weare concerned with the biological plausibility of thesechanges. The plasticity implied is seen more clearlywith an explicit parameterisation of the connections. Forexample, letGi(vi+1, θi) = θivi+1. In this instance

θi = (1 + λi)−1ξiv

Ti+1

λi = (1 + λi)−1(ξiξ

Ti − 1)

(20)

This is just Hebbian or associative plasticity where theconnection strengths change in proportion to the product ofpre and post-synaptic activity. An intuition aboutEq. (20)obtains by considering the conditions under which the ex-pected change in parameters is zero (i.e. after learning).For the backward connections this implies there is no com-ponent of prediction error that can be explained by casualestimates at the higher level〈ξivTi+1〉 = 0. The lateral con-nections stop changing when the prediction error has beenwhitened〈ξiξTi 〉 = 1.

Non-diagonal forms forλi complicate the biological in-terpretation because changes at any one connection dependon changes elsewhere. The problem can be finessed slightlyby rewriting the equations as

θi = ξivTi+1 − λi θi

λi = ξiξTi − λiλi − 1

(21)

where the decay terms are mediated by integration at the cellbody in a fashion similar to that described inFriston et al.(1993).

The overall scheme implied byEq. (19)sits comfortablythe hypothesis (Mumford, 1992). “On the role of the recip-rocal, topographic pathways between two cortical areas, oneoften a ‘higher’ area dealing with more abstract informationabout the world, the other ‘lower’, dealing with more con-crete data. The higher area attempts to fit its abstractionsto the data it receives from lower areas by sending back tothem from its deep pyramidal cells a template reconstructionbest fitting the lower level view. The lower area attempts toreconcile the reconstruction of its view that it receives fromhigher areas with what it knows, sending back from its su-perficial pyramidal cells the features in its data which arenot predicted by the higher area. The whole calculation isdone with all areas working simultaneously, but with orderimposed by synchronous activity in the various top–down,bottom–up loops”.

In summary, the predictive coding approach lends itselfnaturally to a hierarchical treatment, which considers thebrain as an empirical Bayesian device. The dynamics of theunits or populations are driven to minimise error at all levelsof the cortical hierarchy and implicitly render themselvesposterior estimates of the causes given the data. In contradis-tinction to connectionist schemas, hierarchical predictiondoes not require any desired output. Indeed predictions ofintermediate outputs at each hierarchical level emerge spon-taneously. Unlike information theoretic approaches they donot assume independent causes and invertible generativeprocesses. In contrast to regularised inverse solutions (e.g. inmachine vision) they do not depend on a priori constraints.These emerge spontaneously as empirical priors from higherlevels. The Bayesian considerations above pertain largely tothe estimates of the causes. In the final subsection we con-sider the estimation of model parameters using the frame-work provided by density learning with generative models.

3.6. Generative models and representational learning

In this section we bring together the various schemaconsidered above using the framework provided by densityestimation as a way of fitting generative models. This sec-tion follows Dayan and Abbot (2001)to which the readeris referred for a fuller discussion. Generative models rep-resent a generic formulation of representational leaning ina self-supervised context. There are many forms of genera-tive models that range from conventional statistical models(e.g. factor and cluster analysis) and those motivated by

K. Friston / Progress in Neurobiology 68 (2002) 113–143 129

Bayesian inference and learning (e.g.Dayan et al., 1995;Hinton et al., 1995). Indeed many of the algorithms dis-cussed under the heading of information theory can beformulated as generative models. The goal of generativemodels is “to learn representations that are economical todescribe but allow the input to be reconstructed accurately”(Hinton et al., 1995). In current treatments, representationallearning is framed in terms of estimating probability den-sities of the inputs and outputs. Although density learningis formulated at a level of abstraction that eschews manyissues of neuronal implementation (e.g. the dynamics ofreal-time learning), it does provide a unifying frameworkthat connects the various schemes considered so far.

The goal of generative models is to make the density ofthe inputs, implied by the generative modelp(u; θ ), as closeas possible to those observedp(u). The generative model isspecified in terms of the prior distribution over the causesp(v; θ) and the conditionalgenerativedistribution of theinputs given the causesp(u|v; θ) which together define themarginal distribution that has to be matched to the inputdistribution

p(u; θ) =∫

p(u|v; θ)p(v; θ)dv (22)

Once the parameters of the generative model have been es-timated, through this matching, the posterior density of thecauses, given the inputs are given by the recognition modeldefined in terms of therecognitiondistribution

p(v|u; θ) = p(u|v; θ)p(v; θ)p(u; θ) (23)

However, as considered in depth above, the generativemodel may not be invertible and it may not be possible tocompute the recognition distribution fromEq. (23). In thisinstance, an approximate recognition distribution can beusedq(v; u, φ) that we try to approximate to the true one.The distribution has some parametersφ that need to belearned, for example, the strength of forward connections.The question addressed in this review is whether forwardconnections are sufficient for representational leaning. For amoment, consider deterministic models that discount prob-abilistic or stochastic aspects. We have been asking whetherwe can find the parameters of a deterministic recognitionmodel that renders it the inverse of a generating process

R(u, φ) = G−1(u, θ) (24)

The problem is thatG(v, θ) is a nonlinear convolution andis generally not invertible. The generative model approachposits that it is sufficient to find the parameters of an (ap-proximate) recognition modelφ and the generative modelθ that predict the inputs

G(R(u, φ), θ) = u (25)

under the constraint that the recognition model is (approxi-mately) the inverse of the generative model.Eq. (25)is the

same asEq. (24)after applyingG to both sides. The impli-cation is that one needs an explicit parameterisation of the(approximate) recognition (inverse) model and generative(forward) models that induces the need for both forward andbackward influences. Separate recognition and generativemodels resolve the problem caused by the non-invertibilityof generating processes. The corresponding motivation, inprobabilistic learning, rests on finessing the combinatorialexplosion of ways in which stochastic generative modelscan generate input patterns (Dayan et al., 1995). The com-binatorial explosion represents another perspective on theuninvertible ‘many to one’ relationship between causes andinputs.

In the general density learning framework, representa-tional learning has two components that can be seen in termsof expectation maximisation (EM,Dempster et al., 1977). IntheE-Step the approximate recognition distribution is modi-fied to match the density implied by the generative model pa-rameters, so thatq(v; u, φ) ≈ p(v|u; θ) and in theM-Stepthese parameters are changed to renderp(u; θ) ≈ p(u). Inother words, theE-Step ensures the recognition model ap-proximates the generative model and theM-Step ensures thatthe generative model can predict the observed inputs. If themodel is invertible theE-Step reduces to settingq(v; u, φ) =p(v|u; θ) usingEq. (23). Probabilistic recognition proceedsby usingq(v; u, φ) to determine the probability thatv causedthe observed sensory inputs. This recognition becomes de-terministic whenq(v; u, φ) is a Diracδ-function over theMAP estimator of the causesvm. The distinction betweenprobabilistic and deterministic recognition is important be-cause we have restricted ourselves to deterministic modelsthus far but these are special cases of density estimation ingenerative modelling.

3.6.1. Density estimation and EMEM provides a useful procedure for density estimation

that helps relate many different models within a frameworkthat has direct connections with statistical mechanics. Bothsteps of the EM algorithm involve maximising a function ofthe densities that corresponds to the negative free energy inphysics.

F(φ, θ)=⟨∫

q(v; u, φ) lnp(v, u; θ)q(v; u, φ) dv

⟩u

= 〈lnp(u; θ)〉u−〈KL(q(v; u, φ), p(v|u; θ)〉u (26)

This objective function comprises two terms. The first is theexpected log likelihood of the inputs, under the generativemodel, over the observed inputs. Maximising this term im-plicitly minimises the Kullback–Leibler (KL) divergence3

between the actual input density and that implied by the gen-erative model. This is equivalent to maximising the log like-lihood of the inputs. The second term is the KL divergencebetween the approximating and true recognition densities. In

3 A measure of the discrepancy between two densities.

130 K. Friston / Progress in Neurobiology 68 (2002) 113–143

short, maximisingF encompasses two components of rep-resentational learning: (i) it increases the likelihood that thegenerative model could have produced the inputs; and (ii)minimises the discrepancy between the approximate recog-nition model and that implied by the generative model. TheE-Step increasesF with respect to the recognition parame-tersφ through minimising the KL term, ensuring a veridicalapproximation to the recognition distribution implied byθ .TheM-Step increasesF by changingθ , enabling the gener-ative model to reproduce the inputs.

E : φ = minφ

F (φ, θ)

M : θ = minθ

F (φ, θ)(27)

This formulation of representational leaning is critical for thethesis of this review because it shows that backward connec-tions, parameterising a generative model, are essential whenthe model is not invertible. If the generative model is invert-ible then the KL term can be discounted and learning reducesto theM-Step (i.e. maximising the likelihood). In principle,this could be done using a feedforward architecture corre-sponding to the inverse of the generative model. However,when processes generating inputs are non-invertible (due tononlinear interactions among, and temporal convolutions of,the causes) a parameterisation of the generative model (back-ward connections) and approximate recognition model (for-ward connections) is required that can be updated inM- andE-Steps, respectively. In short, non-invertibility enforces anexplicit parameterisation of the generative model in repre-sentational learning. In the brain this parameterisation maybe embodied in backward and lateral connections.

The EM scheme enables exact and approximate maxi-mum likelihood density estimation for a whole variety ofgenerative models that can be specified in terms of priorsand generative distributions.Dayan and Abbot (2001)workthrough a series of didactic examples from cluster analy-sis to independent component analyses, within this unifyingframework. For example, factor analysis corresponds to thegenerative model

p(v; θ) = N(v : 0,1)

p(u |v ; θ) = N(u : θv,∑

)(28)

Namely, the underlying causes of inputs are independentnormal variates that are mixed linearly and added to Gaus-sian noise to form inputs. In the limiting case of

∑ → 0the generative and recognition models become deterministicand the ensuing model conforms to PCA. By simply assum-ing non-Gaussian priors one can specify generative modelsfor sparse coding of the sort proposed byOlshausen andField (1996).

p(v; θ) = ∏p(vi; θ)

p(u |v ; θ) = N(u : θv,∑

)(29)

where p(vi; θ) are chosen to be suitably sparse (i.e.heavy-tailed) with a cumulative density function that

corresponds to the squashing function inSection 3.3.1. Thedeterministic equivalent of sparse coding is ICA that obtainswhen

∑ → 0. The relationships among different modelsare rendered apparent under the perspective of generativemodels. It is useful to revisit the schemes above to examinetheir implicit generative and recognition models.

3.6.2. Supervised representational learningIn supervised schemes the generative model is already

known and only the recognition model needs to be esti-mated. The generative model is known in the sense that thedesired output determines the input either deterministicallyor stochastically (e.g. the input primitives are completelyspecified by their cause, which is the desired output). In thiscase only theE-Step is required in which the parametersφ that specifyq(v; u, φ) change to maximiseF. The onlyterm in Eq. (26)that depends onφ is the divergence term,such that learning reduces to minimising the expected differ-ence between the approximate recognition density and thatrequired by the generative model. This can proceed proba-bilistically (e.g. Contrastive Hebbian learning in stochasticnetworks (Dayan and Abbot, 2001, p. 322)) or deterministi-cally. In the deterministic modeq(v; u, φ) corresponds to aδ-function over the point estimatorvm = R(u, φ). The con-nection strengthsφ are changed, typically using the deltarule, such that the distance between the modes of the approx-imate and desired recognition distributions are minimisedover all inputs. This is equivalent to nonlinear function ap-proximation; a perspective that can be adopted on all super-vised learning of deterministic mappings with neural nets.

Note, again, that any scheme, based on supervised learn-ing, requires the processes generating inputs to be known apriori and as such cannot be used by the brain.

3.6.3. Information theoryIn this section on information theory we had considered

whether infomax principles were sufficient to specify deter-ministic recognition architectures, in the absence of back-ward connections. They were introduced in terms of findingsome function of the inputs that produces an output den-sity with maximum entropy. Maximisation ofF attains thesame thing through minimising the discrepancy between theobserved input distributionp(u) and that implied by a gen-erative model with maximum entropy priors. Although theinfomax and density learning approaches have the same ob-jective their heuristics are complementary. Infomax is moti-vated by maximising the mutual information betweenu andv under some constraints. The generative model approachtakes its heuristics from the assumption that the causes of in-puts are independent and possibly non-Gaussian. This resultsin a prior with maximum entropyp(v; θ) = ∏

p(vi; θ). Thereason for adopting non-Gaussian priors (e.g. sparse cod-ing and ICA) is that the central limit theorem implies mix-tures of causes will have Gaussian distributions and there-fore something that is not Gaussian is unlikely to be amixture.

K. Friston / Progress in Neurobiology 68 (2002) 113–143 131

For invertible deterministic modelsv = R(u, φ) =G−1(u, θ) the KL component ofF disappears leaving onlythe likelihood term.

F = 〈lnp(u; θ)〉u = 〈lnp(v; θ)〉u + 〈lnp(u|v; θ)〉u=

⟨ln

∏p(vi; θ)

⟩u

+⟨ln

∣∣∣∣∂R(u, φ)∂u

∣∣∣∣⟩u

= −∑

H(vi; θ) + H(v;φ) − H(u) (30)

This has exactly the same dependence on the parametersas the objective function employed by infomax inEq. (7).In this context, the free energy and the information differonly by the entropy of the inputs−F = I + H(u). Thisequivalence rests on uses maximum entropy priors of thesort assumed for sparse coding.

Notice again that, in the context of invertible deterministicgenerative models, the parameters of the recognition modelspecify the generative model and only the recognition model(i.e. forward connections meditatingv = R(u, φ)) needs tobe instantiated. If the generative modal cannot be invertedthe recognition model is not defined and the scheme aboveis precluded. In this instance one has to parameterise both anapproximate recognition and generative model as requiredby EM. This enables the use of nonlinear generative models,such as nonlinear PCA (e.g.Kramer, 1991; Karhunen andJoutsensalo, 1994; Dong and McAvoy, 1996; Taleb andJutten, 1997). These schemes typically employ a ‘bottleneck’architecture that forces the inputs through a small numberof nodes. The output from these nodes then diverges toproduce the predicted inputs. The approximate recognitionmodel is implemented, deterministically in connections tothe bottleneck nodes and the generative model by connectionfrom these nodes to the outputs. Nonlinear transformations,from the bottleneck nodes to the output layer, recapitulatethe nonlinear mixing of the real causes of the inputs. Afterlearning, the activity of the bottleneck nodes can be treatedas estimates of the causes. These representations obtain byprojection of the input onto a low-dimensional curvilin-ear manifold (encompassing the activity of the bottlenecknodes) by an approximate recognition model.

3.6.4. Predictive codingIn the forgoing, density learning is based on the expecta-

tions of probability distributions over the inputs. Clearly thebrain does not have direct access to these expectations butsees only one input at any instant. In this instance represen-tational learning has to proceed on-line, by sampling inputsover time.

For deterministic recognition models,q(v; u, φ) is param-eterised by its input-specific modev(u), whereq(v(u); u) =1 and

�(u) = ∫q(v; u, φ)lnp(v, u; θ)

q(v; u, φ) dv = lnp(v(u), u; θ)= lnp(u|v(u); θ) + lnp(v(u); θ)

F = 〈�(u)〉u(31)

�(u) is simply the log of the joint probability, under thegenerative model, of the observed inputs and their cause,implied by approximate recognition. This log probabilitycan be decomposed into a log likelihood and log prior andis exactly the same objective function used to find the MAPestimator in predictive coding cf.Eq. (14).

On-line representational learning can be thought of ascomprising two components, corresponding to theE andM-Steps. The expectation (E) component updates the recog-nition density, whose mode is encoded by the neuronal ac-tivity v, by maximising�(u). Maximising�(u) is sufficientto maximise its expectationF over inputs because it is max-imised for each input separately. The maximisation (M)component corresponds to an ascent of these parameters,encoded by the connection strengths, on the same log prob-ability

E : φ = v = ∂�

∂v

M : θ = ∂�

∂θ

(32)

such that the expected change approximates4 an ascent onF; 〈θ〉 ≈ 〈∂�/∂θ〉u = ∂F/∂θ . Eq. (32) is formally identi-cal to Eq. (19), the hierarchical prediction scheme, wherethe hyperparameters have been absorbed into the param-eters. In short, predictive coding can be regarded as anon-line or dynamic form of density estimation using a de-terministic recognition model and a stochastic generativemodel. Conjoint changes in neuronal states and connectionstrengths map to the expectation maximisation of the ap-proximate recognition and generative models, respectively.Note that there is no explicit parameterisation of the recog-nition model; the recognition density is simply representedby its mode for the inputu at a particular time. This affordsa very unconstrained recognition model that can, in princi-ple, approximate the inverse of highly nonlinear generativemodels.

3.7. Summary

In summary, the formulation of representational learn-ing in terms of generative models embodies a number ofkey distinctions: (i) the distinction between invertible versusnon-invertible models; (ii) deterministic versus probabilisticrepresentations; and (iii) dynamic versus density learning.

Non-invertible generative models require their explicit pa-rameterisation and suggest an important role for backwardconnections in the brain. Invertible models can, in princi-ple be implemented using only forward connections becausethe recognition model completely specifies the generativemodel and vice versa. However, nonlinear and dynamic as-pects of the sensorium render invertibility highly unlikely.

4 This approximation can be finessed by using traces, to approximatethe expectation explicitly, and changing the connections in proportion tothe trace.

132 K. Friston / Progress in Neurobiology 68 (2002) 113–143

This section has focused on the conditions under which for-ward connections are sufficient to parameterise a generativemodel. In short, these conditions rest on invertibility andspeak to the need for backward connections in the contextof nonlinear and noninvertible generative models.

Most of the examples in this section have focused on de-terministic recognition models where neuronal dynamics en-code the most likely causes of the current sensory input. Thisis largely because we have been concerned with how thebrain represents things. The distinction between determinis-tic and probabilistic representation addresses a deeper ques-tion about whether neuronal dynamics represent the state ofthe world or the probability densities of those states. Fromthe point of view of hierarchical models the state of the neu-ronal units encodes the mode of the posterior density at anygiven level. This can be considered a point recognition den-sity. However, the states of units at any level also induce aprior density in the level below. This is because the priormode is specified by dynamic top–down influences and theprior covariance by the strength of lateral connections. Thesecovariances render the generative model a probabilistic one.

By encoding densities in terms of their modes, using neu-ronal activity, the posterior and prior densities can changequickly with sensory inputs. However, this does entail uni-modal densities. From the point of view of a statistician thismay be an impoverished representation of the world thatcompromises any proper inference, especially when the pos-terior distribution is multimodal. However, it is exactly thisapproximate nature of recognition that pre-occupies psy-chophysicists and psychologists; The emergence of unitary,deterministic perceptual representations in the brain is com-monplace and is of special interest when the causes are am-biguous (e.g. illusions and perceptual transitions induced bybinocular rivalry and ambiguous figures).

The brain is a dynamical system that samples inputs dy-namically over time. It does not have instantaneous access tothe statistics of its inputs that are required for distinctE- andM-Steps. Representational learning therefore has to proceedunder this constraint. In this review, hierarchical predictivecoding has been portrayed as a variant of density leaningthat conforms to these constraints.

We have seen that supervised, infomax and generativemodels require prior assumptions about the distribution ofcauses. This section introduced empirical Bayes to show thatthese assumptions are not necessary and that priors can belearned in a hierarchical context. Furthermore, we have triedto show that hierarchical prediction can be implemented inbrain-like architectures using mechanisms that are biologi-cally plausible.

4. Generative models and the brain

The arguments in the preceding section clearly favourpredictive coding, over supervised or information theoreticframeworks, as a more plausible account of functional brain

architectures. However, it should be noted that the differ-ences among them have been deliberately emphasised. Forexample, predictive coding and the implicit error minimi-sation results in the maximisation of information transfer.In other words, predictive coding conforms to the princi-ple of maximum information transfer, but in a distinct way.Predictive coding is entirely consistent with the principle ofmaximum information. The infomax principle is a principle,whereas predictive coding represents a particular schemethat serves that principle. There are examples of infomaxthat do not employ predictive coding (e.g. transformationsof stimulus energy in early visual processing;Atick andRedlich, 1990) that may be specified genetically or epigenet-ically. However, predictive coding is likely to play a muchmore prominent role at higher levels of processing for thereasons detailed in the previous section.

In a similar way predictive coding, especially in its hier-archical formulation, conforms to the same PDP principlesthat underpin connectionist schemes. The representation ofany cause depends upon the internally consistent represen-tations of subordinate and supraordinate causes in lowerand higher levels. These representations mutually induceand maintain themselves, across and within all levels ofthe sensory hierarchy, through dynamic and reentrant inter-actions (Edelman, 1993). The same PDP phenomena (e.g.lateral interactions leading to competition among represen-tations) can be observed. For example, the lateral connectionstrengths embody what has been learnt empirically about theprior covariances among causes. A prior that transpires to bevery precise (i.e. low variance) will receive correspondinglylow strength inhibitory connections from its competing errorunits (recall

∑(λi)

1/2 = 1+λi). It will therefore superveneover other error units and have a greater corrective impacton the estimate causing the prediction error. Conversely,top–down expectations that are less informative will induceerrors that are more easily suppressed and have less effecton the representations. In predictive coding, these dynam-ics are driven explicitly by error minimisation, whereas inconnectionist simulations the activity is determined solelyby the connection strengths established during training.

In addition to the theoretical bias toward generative mod-els and predictive coding, the clear emphasis on backwardand reentrant (Edelman, 1993) dynamics make it a morenatural framework for understanding neuronal infrastruc-tures. Fig. 1 shows the fundamental difference betweeninfomax and generative schemes. In the infomax schemesthe connections are universally forward. In the predictivecoding scheme the forward connections (broken line) drivethe prediction so as to minimise error whereas backwardsconnections (solid lines) use these representations of causesto emulate mixing enacted by the real world. The nonlinearaspects of this mixing imply that only backward influ-ences interact in the predictive coding scheme whereas thenonlinearunmixing, in classical infomax schemas, is medi-ated by forward connections.Section 2assembled some ofthe anatomical and physiological evidence suggesting that

K. Friston / Progress in Neurobiology 68 (2002) 113–143 133

backward connections are prevalent in the real brain andcould support nonlinear mixing through their modulatorycharacteristics. It is pleasing that purely theoretical consider-ations and neurobiological empiricism converge on the samearchitecture. Before turning to electrophysiological andfunctional neuroimaging evidence for backward connectionswe consider the implications for classical views of receptivefields and the representational capacity of neuronal units.

4.1. Context, causes and representations

The Bayesian perspective suggests something quite pro-found for the classical view of receptive fields. If neuronalresponses encompass a bottom–up likelihood term andtop–down priors, then responses evoked by bottom–up in-put should change with the context established by priorexpectations from higher levels of processing. Consider theexample inFig. 3 again. Here a unit encoding the visualform of ‘went’ responds when we read the first sentence atthe top of this figure. When we read the second sentence‘The last event was cancelled’ it would not. If we recordedfrom this unit we might infer that our ‘went’ unit was, insome circumstances, selective for the word ‘event’. Withoutan understanding of hierarchical inference and the semanticcontext the stimulus was presented in this might be difficultto explain. In short, under a predictive coding scheme, thereceptive fields of neurons should be context-sensitive. Theremainder of this section deals with empirical evidence forthese extra-classical receptive field effects.

Generative models suggest that the role of backward con-nections is to provide contextual guidance to lower levelsthrough a prediction of the lower level’s inputs. When thisprediction is incomplete or incompatible with the lowerarea’s input, an error is generated that engenders changesin the area above until reconciliation. When, and onlywhen, the bottom–up driving inputs are in harmony withtop–down prediction, error is suppressed and a consensusbetween the prediction and the actual input is established.Given this conceptual model a stimulus-related responseor ‘activation’ corresponds to some transient error signalthat induces the appropriate change in higher areas until averidical higher-level representation emerges and the erroris ‘cancelled’ by backwards connections. Clearly the pre-diction error will depend on the context and consequentlythe backward connections confer context-sensitivity on thefunctional specificity of the lower area. In short, the activa-tion does not just depend on bottom–up input but on the dif-ference between bottom–up input and top–down predictions.

The prevalence of nonlinear or modulatory top–down ef-fects can be inferred from the fact that context interacts withthe content of representations. Here context is establishedsimply through the expression of causes other than the onein question. Backward connections from one higher areacan be considered as providing contextual modulation of theprediction from another. Because the effect of context willonly be expressed when the thing being predicted is present

these contextual afferents will not elicit a response by them-selves. Effects of this sort, which change the responsivenessof units but do not elicit a response, are a hallmark of mod-ulatory projections. In summary, hierarchical models offer ascheme that allows for contextual effects; firstly through bi-asing responses towards their prior expectation and secondlyby conferring a context-sensitivity on these priors throughmodulatory backward projections. Next we consider the na-ture of real neuronal responses and whether they are consis-tent with this perspective.

4.2. Neuronal responses and representations

Classical models (e.g. classical receptive fields) assumethat evoked responses will be expressed invariably in thesame units or neuronal populations irrespective of the con-text. However, real neuronal responses are not invariant butdepend upon the context in which they are evoked. For exam-ple, visual cortical units have dynamic receptive fields thatcan change from moment to moment (cf. the non-classicalreceptive field effects modelled in (Rao and Ballard, 1998)).Another example is attentional modulation of evoked re-sponses that can change the sensitivity of neurons to differentperceptual attributes (e.g.Treue and Maunsell, 1996). Theevidence for contextual responses comes from neuroanatom-ical and electrophysiological studies. There are numerousexamples of context-sensitive neuronal responses. Perhapsthe simplest is short-term plasticity. Short-term plasticityrefers to changes in connection strength, either potentiationor depression, following pre-synaptic inputs (e.g.Abbotet al., 1997). In brief, the underlying connection strengths,that define what a unit represents, are a strong function ofthe immediately preceding neuronal transient (i.e. preced-ing representation). A second, and possibly richer, exampleis that of attentional modulation. It has been shown, bothin single unit recordings in primates (Treue and Maunsell,1996) and human functional fMRI studies (Büchel andFriston, 1997), that attention to specific visual attributescan profoundly alter the receptive fields or event-relatedresponses to the same stimuli.

These sorts of effects are commonplace in the brain andare generally understood in terms of the dynamic modula-tion of receptive field properties by backward and lateralafferents. There is clear evidence that lateral connections invisual cortex are modulatory in nature (Hirsch and Gilbert,1991), speaking to an interaction between the functional seg-regation implicit in the columnar architecture of V1 and theneuronal dynamics in distal populations. These observations,suggests that lateral and backwards interactions may conveycontextual information that shapes the responses of any neu-ron to its inputs (e.g.Kay and Phillips, 1996; Phillips andSinger, 1997) to confer on the brain the ability to make con-ditional inferences about sensory input. See alsoMcIntosh(2000)who develops the idea from a cognitive neuroscienceperspective “that a particular region in isolation may notact as a reliable index for a particular cognitive function.

134 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Instead, theneural contextin which an area is active maydefine the cognitive function.” His argument is predicatedon careful characterisations of effective connectivity usingneuroimaging.

4.2.1. Examples from electrophysiologyIn the next section we will illustrate the context-sensitive

nature of cortical activations, and implicit specialisation, inthe inferior temporal lobe using neuroimaging. Here we con-sider the evidence for contextual representations in terms ofsingle cell responses, to visual stimuli, in the temporal cor-tex of awake behaving monkeys. If the representation of astimulus depends on establishing representations of subor-dinate and supraordinate causes at all levels of the visualhierarchy, then information about the high-order attributesof a stimulus, must be conferred by top–down influences.Consequently, one might expect to see the emergence of se-lectivity, for high-level attributes,after the initial visuallyevoked response (it typically takes about 10 ms for volleysof spikes to be propagated from one cortical area to anotherand about a 100 ms to reach prefrontal areas). This is be-cause the representations at higher levels must emerge be-fore backward afferents can reshape the response profile ofneurons in lower areas. This temporal delay, in the emer-gence of selectivity, is precisely what one sees empirically:Sugase et al. (1999)recorded neurons in macaque temporalcortex during the presentation of faces and objects. The faceswere either human or monkey faces and were categorised interms of identity (whose face it was) and expression (happy,angry, etc.). “Single neurones conveyed two different scalesof facial information in their firing patterns, starting at dif-ferent latencies. Global information, categorising stimuli asmonkey faces, human faces or shapes, was conveyed in theearliest part of the responses. Fine information about iden-tity or expression was conveyed later”, starting on averageabout 50 ms after face-selective responses. These observa-tions demonstrate representations for facial identity or ex-pression that emerge dynamically in a way that might relyon backward connections. These influences imbue neuronswith a selectivity that is not intrinsic to the area but dependson interactions across levels of a processing hierarchy.

A similar late emergence of selectivity is seen in motionprocessing. A critical aspect of visual processing is the inte-gration of local motion signals generated by moving objects.This process is complicated by the fact that local velocitymeasurements can differ depending on contour orientationand spatial position. Specifically, any local motion detectorcan measure only the component of motion perpendicularto a contour that extends beyond its field of view (Pack andBorn, 2001). This “aperture problem” is particularly relevantto direction-selective neurons early in the visual pathways,where small receptive fields permit only a limited view ofa moving object.Pack and Born (2001)have shown “thatneurons in the middle temporal visual area (known as MTor V5) of the macaque brain reveal a dynamic solution tothe aperture problem. MT neurons initially respond primar-

ily to the component of motion perpendicular to a contour’sorientation, but over a period of approximately 60 ms the re-sponses gradually shift to encode the true stimulus direction,regardless of orientation”.

The preceding examples were taken from electrophys-iology. Similar predictions can be made, albeit at a lessrefined level, about population responses elicited in func-tional neuroimaging where functional specialisation (cf.selectivity in unit recordings) is established by showingregionally-specific responses to some sensorimotor attributeor cognitive component. At the level of cortical responsesin neuroimaging the dynamic and contextual nature ofevoked responses means that regionally-specific responsesto a particular cognitive component may be expressed inone context but not another. In the next section we look atsome empirical evidence from functional neuroimaging thatconfirms the idea that functional specialisation is conferredin a context-sensitive fashion by backwards connectionsfrom higher brain areas.

5. Functional architectures assessed withbrain imaging

Information theory and predictive coding schemas sug-gest alternative architectures that are sufficient for represen-tational learning. Forward connections are sufficient for theformer, whereas the latter posits that most of the brain’s in-frastructure is used to predict sensory input through a hierar-chy of top–down projections. Clearly to adjudicate betweenthese alternatives the existence of backward influences mustbe established. This is a slightly deeper problem for func-tional neuroimaging than might be envisaged. This is be-cause making causal inferences about effective connectivityis not straightforward (seePearl, 2000). It might be thoughtthat showing regional activity was partially predicted by ac-tivity in a higher level would be sufficient to confirm the ex-istence of backward influences, at least at a population level.The problem is that this statistical dependency does not per-mit any causal inference. Statistical dependencies could eas-ily arise in a purely forward architecture because the higherlevel activity is predicated on activity in the lower level. Oneresolution of this problem is to perturb the higher level di-rectly using transmagnetic stimulation or pathological dis-ruptions (seeSection 6). However, discounting these inter-ventions, one is left with the difficult problem of inferringbackward influences, based on measures that could be cor-related because of forward connections. Although there arecausal modelling techniques that can address this problemwe will take a simpler approach and note that interactionsbetween bottom–up and top–down influences cannot be ex-plained by a purely feedforward architecture. This is becausethe top–down influences have no access to the bottom–upinputs. An interaction, in this context, can be construed as aneffect of backward connections on the driving efficacy of for-ward connections. In other words, the response evoked by the

K. Friston / Progress in Neurobiology 68 (2002) 113–143 135

same driving bottom–up inputs depends upon the context es-tablished by top–down inputs. This interaction is used belowsimply as evidence for the existence of backward influences.However, there are some instances of predictive coding thatemphasises this phenomenon. For example, the “Kalman fil-ter model demonstrates how certain forms of attention can beviewed as an emergent property of the interaction betweentop–down expectations and bottom–up signals” (Rao, 1999).

The remainder of this article focuses on the evidencefor these interactions. From the point of view of func-tionally specialised responses these interactions manifestas context-sensitive or contextual specialisation, wheremodality-, category- or exemplar-specific responses, drivenby bottom up inputs are modulated by top–down influencesinduced by perceptual set. The first half of this sectionadopts this perceptive. The second part of this section usesmeasurements of effective connectivity to establish inter-actions between bottom–up and top–down influences. Allthe examples presented below rely on attempts to establishinteractions by trying to change sensory-evoked neuronalresponses through putative manipulations of top–down in-fluences. These include inducing independent changes inperceptual set, cognitive (attentional) set and, in the lastsection through the study of patients with brain lesions.

5.1. Context-sensitive specialisation

If functional specialisation is context-dependent then oneshould be able to find evidence for functionally-specificresponses, using neuroimaging, that are expressed in onecontext and not in another. The first part of this sectionprovides an empirical example. If the contextual nature ofspecialisation is mediated by backwards modulatory affer-ents then it should be possible to find cortical regions inwhich functionally-specific responses, elicited by the samestimuli, are modulated by activity in higher areas. The sec-ond example shows that this is indeed possible. Both ofthese examples depend on multifactorial experimental de-signs that have largely replaced subtraction and categoricaldesigns in human brain mapping.

5.1.1. Categorical designsCategorical designs, such as cognitive subtraction, have

been the mainstay of functional neuroimaging over the pastdecade. Cognitive subtraction involves elaborating two tasksthat differ in a separable component. Ensuing differencesin brain activity are then attributed to this component. Thetenet of cognitive subtraction is that the difference betweentwo tasks can be formulated as a separable cognitive or sen-sorimotor component and that the regionally specific differ-ences in hemodynamic responses identify the correspondingfunctionally specialised area. Early applications of subtrac-tion range from the functional anatomy of word processing(Petersen et al., 1989) to functional specialisation in extras-triate cortex (Lueck et al., 1989). The latter studies involvedpresenting visual stimuli with and without some sensory at-

tribute (e.g. colour, motion etc.). The areas highlighted bysubtraction were identified with homologous areas in mon-keys that showed selective electrophysiological responses toequivalent visual stimuli.

Consider a specific example; namely the difference be-tween simply saying “yes” when a recognisable object isseen, and saying “yes” when an unrecognisable non-objectis seen. Regionally specific differences in brain activity thatdistinguish between these two tasks could be implicated inimplicit object recognition. Although its simplicity is appeal-ing this approach embodies some strong assumptions aboutthe way that the brain implements cognitive processes. Akey assumption is ‘pure insertion’. Pure insertion asserts thatone can insert a new component into a task without effect-ing the implementation of pre-existing components (for ex-ample, how do we know that object recognition is not itselfaffected by saying “yes”?). The fallibility of this assumptionhas been acknowledged for decades, perhaps most explic-itly by Sternberg’s revision of Donder’s subtractive method.The problem for subtraction is as follows: if one develops atask by adding a component then the new task comprises notonly the previous components and the new component butthe integration of the new and old components (for example,the integration of phonology and object recognition). Thisintegration orinteractioncan itself be considered as a newcomponent. The difference between two tasks therefore in-cludes the new component and the interactions between thenew component and those of the original task. Pure inser-tion requires that all these interaction terms are negligible.Clearly in many instances they are not. We next consider fac-torial designs that eschew the assumption of pure insertion.

5.1.2. Multifactorial designsFactorial designs combine two or more factors within a

task or tasks. Factorial designs can be construed as per-forming subtraction experiments in two or more differentcontexts. The differences in activations, attributable to theeffects of context, are simply the interaction. Consider re-peating the above implicit object recognition experiment inanother context, for example naming (of the object’s nameor the non-object’s colour). The factors in this example areimplicit object recognition with two levels (objects versusnon-objects) and phonological retrieval (naming versus say-ing “yes”). The idea here is to look at the interaction be-tween these factors, or the effect that one factor has on theresponses elicited by changes in the other. Generally, in-teractions can be thought of as a difference in activationsbrought about by another processing demand. Dual task in-terference paradigms are a clear example of this approach(e.g.Fletcher et al., 1995).

Consider the above object recognition experiment again.Noting that object-specific responses are elicited (by askingsubjects to view objects relative to meaningless shapes), withand without phonological retrieval, reveals the factorial na-ture of this experiment. This ‘two by two’ design allows oneto look specifically at the interaction between phonological

136 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Fig. 4. This example of regionally specific interactions comes from an experiment where subjects were asked to view coloured non-object shapes orcoloured objects and say “yes”, or to name either the coloured object or the colour of the shape.Left: A regionally specific interaction in the leftinfero-temporal cortex. The SPM threshold isP < 0.05 (uncorrected) (Friston et al., 1995b). Right: The corresponding activities in the maxima of thisregion are portrayed in terms of object recognition-dependent responses with and without naming. It is seen that this region shows object recognitionresponses when, and only when, there is phonological retrieval. The ‘extra’ activation with naming corresponds to the interaction. These data wereacquired from 6 subjects scanned 12 times using PET.

retrieval and object recognition. This analysis identifiesnot regionally specific activations but regionally specificinteractions. When we actually performed this experimentthese interactions were evident in the left posterior, inferiortemporal region and can be associated with the integrationof phonology and object recognition (seeFig. 4andFristonet al., 1996for details). Alternatively this region can bethought of as expressing recognition-dependent responsesthat are realised in, and only in, the context of having toname the object seen. These results can be construed asevidence of contextual specialisation for object-recognitionthat depends upon modulatory afferents (possibly from tem-poral and parietal regions) that are implicated in naming avisually perceived object. There is no empirical evidence inthese results to suggest that the temporal or parietal regionsare the source of this top–down influence but in the nextexample the source of modulation is addressed explicitlyusing psychophysiological interactions.

5.1.3. Psychophysiological interactionsPsychophysiological interactions speak directly to the

interactions between bottom–up and top–down influences,where one is modelled as an experimental factor and theother constitutes a measured brain response. In an analysisof psychophysiological interactions one is trying to explaina regionally specific response in terms of an interaction be-tween the presence of a sensorimotor or cognitive processand activity in another part of the brain (Friston et al., 1997).The supposition here is that the remote region is the sourceof backward modulatory afferents that confer functionalspecificity on the target region. For example, by combininginformation about activity in the posterior parietal cortex,mediating attentional or perceptual set pertaining to a partic-ular stimulus attribute, can we identify regions that respond

to that stimulus when, and only when, activity in the parietalsource is high? If such an interaction exists, then one mightinfer that the parietal area is modulating responses to thestimulus attribute for which the area is selective. This hasclear ramifications in terms of the top–down modulation ofspecialised cortical areas by higher brain regions.

The statistical model employed in testing for psychophysi-ological interactions is a simple regression model of effectiveconnectivity that embodies nonlinear (second-order or mod-ulatory effects). As such, this class of model speaks directlyto functional specialisation of a nonlinear and contextualsort. Fig. 5 illustrates a specific example (seeDolan et al.,1997 for details). Subjects were asked to view (degraded)faces and non-face (object) controls. The interaction betweenactivity in the parietal region and the presence of faces wasexpressed most significantly in the right infero-temporal re-gion not far from the homologous left infero-temporal re-gion implicated in the object naming experiment above.Changes in parietal activity were induced experimentally bypre-exposure of the (un-degraded) stimuli before some scansbut not others to prime them. The data in the right panelof Fig. 5 suggests that the infero-temporal region showsface-specific responses, relative to non-face objects, when,and only when, parietal activity is high. These results can beinterpreted as a priming-dependent face-specific response,in infero-temporal regions that are mediated by interactionswith medial parietal cortex. This is a clear example of con-textual specialisation that depends on top–down effects.

5.2. Effective connectivity

The previous examples demonstrating contextual special-isation are consistent with functional architectures impliedby predictive coding. However, they do not provide defini-

K. Friston / Progress in Neurobiology 68 (2002) 113–143 137

Fig. 5. Top: Examples of the stimuli presented to subjects. During the measurement of brain responses only degraded stimuli where shown (e.g. the righthand picture). In half the scans the subject was given the underlying cause of these stimuli, through presentation of the original picture (e.g. left)beforescanning. This priming induced a profound difference in perceptual set for the primed, relative to non-primed, stimuli,Right: Activity observed in aright infero-temporal region, as a function of (mean corrected) PPC activity. This region showed the most significant interaction between the presenceof faces in visually presented stimuli and activity in a reference location in the posterior medial parietal cortex (PPC). This analysis can be thought ofas finding those areas that are subject to top–down modulation of face-specific responses by medial parietal activity. The crosses correspond to activitywhilst viewing non-face stimuli and the circles to faces. The essence of this effect can be seen by noting that this region differentiates between facesand non-faces when, and only when, medial parietal activity is high. The lines correspond to the best second-order polynomial fit. These data wereacquired from six subjects using PET.Left: Schematic depicting the underlying conceptual model in which driving afferents from ventral form areas(here designated as V4) excite infero-temporal (IT) responses, subject to permissive modulation by PPC projections.

tive evidence for an interaction between top–down andbottom–up influences. In this subsection we look for directevidence of these interactions using functional imaging.This rests upon being able to measure effective connectivityin a way that is sensitive to interactions among inputs. Thisrequires a plausible model of coupling among brain regionsthat accommodates nonlinear and dynamical effects. Wehave used a model that is based on the Volterra expansionintroduced inSection 3. Before turning to empirical evi-dence for interactions between bottom–up and top–downinputs the motivation for this particular model of effectiveconnectivity is presented briefly.

5.2.1. Effective connectivity and Volterra kernelsThe problem faced, when trying to measure effective con-

nectivity, is that measurements of brain responses are usu-ally very limited, either in terms of their resolution (in spaceor time) or in terms of the neurophysiological or biophysi-cal variable that is measured. Given the complicated nature

of neuronal interactions, involving a huge number of micro-scopic variables, it may seem an impossible task to makemeaningful measurements of coupling among brain systems,especially with measurements afforded by techniques likefMRI. However, the problem is not as intractable as onemight think.

Suppose that the variablesx represented a complete andself-consistent description of the state variables of a brainregion. In other words, everything needed to determine theevolution of that region’s state, at a particular place andtime, was embodied in these measurements. If such a set ofvariables existed they would satisfy some immensely com-plicated nonlinear equations (cf.Eq. (1))

x = f (s, u)

y = g(x)(33)

u represents a set of inputs conveyed by projections fromother regions andx is a large vector of state variables whichrange from depolarisation at every point in the dendritic tree

138 K. Friston / Progress in Neurobiology 68 (2002) 113–143

to the phosphorylation status of every relevant enzyme; fromthe biochemical status of every glial cell compartment toevery aspect of gene expression. The vast majority of thesevariables are hidden and not measurable directly. However,there are measurementsy that can be made, that, as we haveseen inSection 3, are simply a nonlinear convolution of theinputs with some Volterra kernels. These measures usuallyreflect the activity of whole cells or populations and are mea-sured in many ways, for example firing at the initial segmentof an axon or local field potentials. The critical thing hereis that the output is casually related to the inputs,which arethe outputs of other regions. This means that we never needto know the underlying and ‘hidden’ variables that describe

Fig. 6. Left: Brain regions and connections comprising the model.Right: Characterisation of the effects of V2 inputs on V5 and their modulation byposterior parietal cortex (PPC). The broken lines represent estimates of V5 responses when PPC activity is zero, according to a second-order Volterramodel of effective connectivity with inputs to V5 from V2, PPC and the pulvinar (PUL). The solid curves represent the same response when PPC activityis one standard deviation of its variation over conditions. It is evident that V2 has an activating effect on V5 and that PPC increases the responsivenessof V5 to these inputs. The insert shows all the voxels in V5 that evidenced a modulatory effect (P < 0.05 uncorrected). These voxels were identifiedby thresholding a SPM (Friston et al., 1995b) of the F statistic testing for the contribution of second-order kernels involving V2 and PPC (treating allother terms as nuisance variables). The data were obtained with fMRI under identical stimulus conditions (visual motion subtended by radially movingdots) whilst manipulating the attentional component of the task (detection of velocity changes).

the details of each region’s electrochemical status. We onlyneed to know the history of its inputs, which obtain fromthe measurable outputs of other regions. In principle, a com-plete description of regional responses could be framed interms of inputs and the Volterra kernels required to producethe outputs. The nice thing about the kernels is that they canbe interpreted directly as effective connectivity (see Box 1).

Because the inputs (and outputs) are measurable one canestimate the kernels empirically. The first-order kernel issimply the change in response induced by a change in inputin the recent past. The second-order kernels are the changein the first–order effective connectivity induced by changesin a second (modulatory) input and so on for higher orders.

K. Friston / Progress in Neurobiology 68 (2002) 113–143 139

Another nice thing about the Volterra formulation is that theresponse is linear in the unknowns, which can be estimatedusing standard least square procedures. In short, Volterrakernels are synonymous with effective connectivity becausethey characterise the measurable effect that an input has onits target.

5.2.2. Nonlinear coupling among brain areasLinear models of effective connectivity assume that the

multiple inputs to a brain region are linearly separable. Thisassumption precludes activity-dependent connections thatare expressed in one context and not in another. The resolu-tion of this problem lies in adopting nonlinear models likethe Volterra formulation that include interactions among in-puts. These interactions can be construed as a context- oractivity-dependent modulation of the influence that one re-gion exerts over another (Büchel and Friston, 1997). In theVolterra model, second-order kernels model modulatory ef-fects. Within these models the influence of one region onanother has two components: (i) the direct ordriving in-fluence of input from the first (e.g. hierarchically lower)region, irrespective of the activities elsewhere; and (ii) anactivity-dependent,modulatorycomponent that representsan interaction with inputs from the remaining (e.g. hierar-chically higher) regions. These are mediated by the first andsecond-order kernels, respectively. The example provided inFig. 6addresses the modulation of visual cortical responsesby attentional mechanisms (e.g.Treue and Maunsell, 1996)and the mediating role of activity-dependent changes in ef-fective connectivity.

The right panel inFig. 6 shows a characterisation of thismodulatory effect in terms of the increase in V5 responses,to a simulated V2 input, when posterior parietal activity iszero (broken line) and when it is high (solid lines). In thisstudy subjects were studied with fMRI under identical stim-ulus conditions (visual motion subtended by radially movingdots) whilst manipulating the attentional component of thetask (detection of velocity changes). The brain regions andconnections comprising the model are shown in the upperpanel. The lower panel shows a characterisation of the ef-fects of V2 inputs on V5 and their modulation by posteriorparietal cortex (PPC) using simulated inputs at different lev-els of PPC activity. It is evident that V2 has an activating ef-fect on V5 and that PPC increases the responsiveness of V5to these inputs. The insert shows all the voxels in V5 that ev-idenced a modulatory effect (P < 0.05 uncorrected). Thesevoxels were identified by thresholding statistical parametricmaps of theF statistic (Friston et al., 1995b) testing for thecontribution of second-order kernels involving V2 and PPCwhile treating all other components as nuisance variables.The estimation of the Volterra kernels and statistical infer-ence procedure is described inFriston and Büchel (2000).

This sort of result suggests that backward parietal in-puts may be a sufficient explanation for the attentionalmodulation of visually evoked extrastriate responses. Moreimportantly, they are consistent with the functional archi-

tecture implied by predictive coding because they establishthe existence of functionally expressed backward connec-tions. V5 cortical responses evidence an interaction betweenbottom–up input from early visual cortex and top–downinfluences from parietal cortex. In the final section theimplications of this sort of functional integration are ad-dressed from the point of view of the lesion-deficit modeland neuropsychology.

6. Functional integration and neuropsychology

If functional specialisation depends on interactions amongcortical areas then one might predict changes in functionalspecificity in cortical regions that receive enabling or modu-latory afferents from a damaged area. A simple consequenceis that aberrant responses will be elicited in regions hierar-chically below the lesion if, and only if, these responses de-pend upon inputs from the lesion site. However, there may beother contexts in which the region’s responses are perfectlynormal (relying on other, intact, afferents). This leads to thenotion of a context-dependent regionally-specific abnormal-ity, caused by, but remote from, a lesion (i.e. an abnormalresponse that is elicited by some tasks but not others). Wehave referred to this phenomenon as ‘dynamic diaschisis’(Price et al., 2000).

6.1. Dynamic diaschisis

Classical diaschisis, demonstrated by early anatomicalstudies and more recently by neuroimaging studies of rest-ing brain activity, refers to regionally specific reductions inmetabolic activity at sites that are remote from, but con-nected to, damaged regions. The clearest example is ‘crossedcerebellar diaschisis’ (Lenzi et al., 1982) in which abnormal-ities of cerebellar metabolism are seen characteristically fol-lowing cerebral lesions involving the motor cortex. Dynamicdiaschisis describes the context-sensitive and task-specificeffects that a lesion can have on theevoked responsesof adistant cortical region. The basic idea behind dynamic di-aschisis is that an otherwise viable cortical region expressesaberrant neuronal responses when, and only when, those re-sponses depend upon interactions with a damaged region.This can arise because normal responses in any given regiondepend upon inputs from, and reciprocal interactions with,other regions. The regions involved will depend on the cog-nitive and sensorimotor operations engaged at any particulartime. If these regions include one that is damaged, then ab-normal responses may ensue. However, there may be situa-tions when the same region responds normally, for instancewhen its dynamics depend only upon integration with un-damaged regions. If the region can respond normally in somesituations then forward driving components must be intact.This suggests that dynamic diaschisis will only present it-self when the lesion involves a hierarchically equivalent orhigher area.

140 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Fig. 7. (a)Top: These renderings illustrate the extent of cerebral infarcts in four patients, as identified by voxel-based morphometry. Regions of reducedgrey matter (relative to neurologically normal controls) are shown in white on the left hemisphere. The SPMs (Friston et al., 1995b) were thresholdedat P < 0.001 uncorrected. All patients had damage to Broca’s area. The first (upper left) patient’s left middle cerebral artery infarct was most extensiveencompassing temporal and parietal regions as well as frontal and motor cortex. (b)Bottom: SPMs illustrating the functional imaging results with regionsof significant activation shown in black on the left hemisphere. Results are shown for: (i) normal subjects reading words (left); (ii) activations commonto normal subjects and patients reading words using a conjunction analysis (middle-top); (iii) areas where normal subjects activate significantly morethan patients reading words, using the group times condition interaction (middle lower); and (iv) the first patient activating normally for a semantic task.Context-sensitive failures to activate are implied by the abnormal activations in the first patient, for the implicit reading task, despite a normal activationduring a semantic task.

6.1.1. An empirical demonstrationWe investigated this possibility in a functional imaging

study of four aphasic patients, all with damage to the left pos-terior inferior frontal cortex, classically known as Broca’sarea (seeFig. 7, upper panels). These patients had speechoutput deficits but relatively preserved comprehension. Gen-erally functional imaging studies can only make inferencesabout abnormal neuronal responses when changes in cogni-tive strategy can be excluded. We ensured this by engagingthe patients in an explicit task that they were able to performnormally. This involved a keypress response when a visuallypresented letter string contained a letter with an ascendingvisual feature (e.g.: h, k, l, or t). While the task remainedconstant, the stimuli presented were either words or conso-nant letter strings. Activations detected for words, relativeto letters, were attributed to implicit word processing. Eachpatient showed normal activation of the left posterior mid-dle temporal cortex that has been associated with seman-tic processing (Price, 1998). However, none of the patientsactivated the left posterior inferior frontal cortex (damagedby the stroke), or the left posterior inferior temporal region

(undamaged by the stroke) (seeFig. 4). These two regionsare crucial for word production (Price, 1998). Examinationof individual responses in this area revealed that all the nor-mal subjects showed increased activity for words relative toconsonant letter strings while all four patients showed thereverse effect. The abnormal responses in the left posteriorinferior temporal lobe occurred even though this undamagedregion: (i) lies adjacent and posterior to a region of the leftmiddle temporal cortex that activated normally (see middlecolumn ofFig. 7b); and (ii) is thought to be involved in anearlier stage of word processing than the damaged left in-ferior frontal cortex (i.e. is hierarchically lower than the le-sion). From these results we can conclude that, during thereading task, responses in the left basal temporal languagearea rely on afferent inputs from the left posterior inferiorfrontal cortex. When the first patient was scanned again,during an explicit semantic task, the left posterior inferiortemporal lobe responded normally. The abnormal implicitreading related responses were therefore task-specific.

These results serve to illustrate the concept of dy-namic diaschisis; namely the anatomically remote and

K. Friston / Progress in Neurobiology 68 (2002) 113–143 141

context-specific effects of focal brain lesions. Dynamicdiaschisis represents a form of functional disconnectionwhere regional dysfunction can be attributed to the lossof enabling inputs from hierarchically equivalent or higherbrain regions. Unlike classical or anatomical disconnectionsyndromes its pathophysiological expression depends uponthe functional brain state at the time responses are evoked.Dynamic diaschisis may be characteristic of many region-ally specific brain insults and may have implications forneuropsychological inference.

7. Conclusion

In conclusion, the representational capacity and inherentfunction of any neuron, neuronal population or cortical areain the brain is dynamic and context-sensitive. Functional in-tegration, or interactions among brain systems, that employdriving (bottom up) and backward (top–down) connections,mediate this adaptive and contextual specialisation. A crit-ical consequence is that hierarchically organised neuronalresponses, in any given cortical area, can represent differentthings at different times. We have seen that most models ofrepresentational learning require prior assumptions about thedistribution of causes. However, empirical Bayes suggeststhat these assumptions can be relaxed and that priors can belearned in a hierarchical context. We have tried to show thatthis hierarchical prediction can be implemented in brain-likearchitectures and in a biologically plausible fashion.

The main point made in this review is that backwardconnections, mediating internal or generative models ofhow sensory inputs are caused, are essential if the pro-cesses generating inputs are non-invertible. Because thesegenerating processes are dynamical in nature, sensory inputcorresponds to a non-invertible nonlinear convolution ofcauses. This non-invertibility demands an explicit parame-terisation of generative models (backward connections) toenable approximate recognition and suggests that feedfor-ward architectures, are not sufficient for representationallearning. Moreover, nonlinearities in generative models, thatinduce dependence on backward connections, require theseconnections to be modulatory; so that estimated causesin higher cortical levels can interact to predict responsesin lower levels. This is important in relation to asymme-tries in forward and backward connections that have beencharacterised empirically.

The arguments in this article were developed underprediction models of brain function, where higher-levelsystems provide a prediction of the inputs to lower-level re-gions. Conflict between the two is resolved by changes in thehigher-level representations, which are driven by the ensu-ing error in lower regions, until the mismatch is ‘cancelled’.From this perspective the specialisation of any region is de-termined both by bottom–up driving inputs and by top–downpredictions. Specialisation is therefore not an intrinsic prop-erty of any region but depends on both forward and backward

connections with other areas. Because the latter have accessto the context in which the inputs are generated they are in aposition to modulate the selectivity or specialisation of lowerareas. The implications for classical models (e.g. classicalreceptive fields in electrophysiology, classical specialisationin neuroimaging and connectionism in cognitive models) aresevere and suggest these models may provide incomplete ac-counts of real brain architectures. On the other hand, predic-tive coding in the context of hierarchical generative modelsnot only accounts for many extra-classical phenomena seenempirically but also enforces a view of the brain as an infer-ential machine through its empirical Bayesian motivation.

Acknowledgements

The Wellcome Trust funded this work. I would like tothank my colleagues for help in writing this review anddeveloping these ideas, especially Cathy Price for the psy-chological components and Peter Dayan for the theoreticalneurobiology.

References

Abbot, L.F., Varela, J.A., Nelson, S.B., 1997. Synaptic depression andcortical gain control. Science 275, 220–223.

Absher, J.R., Benson, D.F., 1993. Disconnection syndromes: an overviewof Geschwind’s contributions. Neurology 43, 862–867.

Aertsen, A., Preißl, H., 1991. Dynamics of activity and connectivity inphysiological neuronal Networks. In: Schuster, H.G. (Ed.), NonlinearDynamics and Neuronal Networks. VCH publishers, New York, NY,USA, pp. 281–302.

Atick, J.J., Redlich, A.N., 1990. Towards a theory of early visualprocessing. Neural Comput. 2, 308–320.

Ballard, D.H., Hinton, G.E., Sejnowski, T.J., 1983. Parallel visualcomputation. Nature 306, 21–26.

Barlow, H.B., 1961. Possible principles underlying the transformation ofsensory messages. In: Rosenblith, W.A. (Ed.), Sensory Communication.MIT Press, Cambridge, MA.

Bell, A.J., Sejnowski, T.J., 1995. An information maximisation approachto blind separation and blind de-convolution. Neural Comput. 7, 1129–1159.

Bendat, J.S., 1990. Nonlinear System Analysis and Identification fromRandom Data. Wiley, New York, USA.

Büchel, C., Friston, K.J., 1997. Modulation of connectivity in visualpathways by attention: cortical interactions evaluated with structuralequation modelling and fMRI. Cereb. Cortex 7, 768–778.

Common, P., 1994. Independent component analysis, a new concept?Signal Processing 36, 287–314.

Crick, F., Koch, C., 1998. Constraints on cortical and thalamic projections:the no-strong-loops hypothesis. Nature 391, 245–250.

Dayan, P., Abbot, L.F., 2001. Theoretical Neuroscience: Computationaland Mathematical Modeling of Neural Systems. MIT Press, Cambridge,MA.

Dayan, P., Hinton, G.E., Neal, R.M., 1995. The Helmholtz machine.Neural Comput. 7, 889–904.

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihoodfrom incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B39, 1–38.

Devlin, J.T., Gunnerman, L.M., Andersen, E.S., Seidenberg, M.S., 1998.Category-specific semantic deficits in focal and widespread braindamage: a computational account. J. Cog. Neurosci. 10, 77–84.

142 K. Friston / Progress in Neurobiology 68 (2002) 113–143

Dolan, R.J., Fink, G.R., Rolls, E., Booth, M., Holmes, A., Frackowiak,R.S.J., Friston, K.J., 1997. How the brain learns to see objects andfaces in an impoverished context. Nature 389, 596–598.

Dong, D., McAvoy, T.J., 1996. Nonlinear principal component analysis—based on principal curves and neural networks. Comput. Chem. Eng.20, 65–78.

Edelman, G.M., 1993. Neural Darwinism: selection and reentrantsignalling in higher brain function. Neuron 10, 115–125.

Efron, B., Morris, C., 1973. Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Am. Stat. Assoc. 68, 117–130.

Farah, M., McClelland, J., 1991. A computational model of semanticmemory impairments: modality specificity and emergent categoryspecificity. J. Exp. Psychol. Gen. 120, 339–357.

Felleman, D.J., Van Essen, D.C., 1991. Distributed hierarchical processingin the primate cerebral cortex. Cereb. Cortex 1, 1–47.

Fletcher, P.C., Frith, C.D., Grasby, P.M., Shallice, T., Frackowiak, R.S.J.,Dolan, R.J., 1995. Brain systems for encoding and retrieval ofauditory–verbal memory. Brain 118, 401–416.

Fliess, M., Lamnabhi, M., Lamnabhi-Lagarrigue, F., 1983. An algebraicapproach to nonlinear functional expansions. IEEE Trans. Circuits Syst.30, 554–570.

Foldiak, P., 1990. Forming sparse representations by local anti-Hebbianlearning. Biol. Cybern. 64, 165–170.

Freeman, W., Barrie, J., 1994. Chaotic oscillations and the genesisof meaning in cerebral cortex. In: Buzsaki, G., Llinas, R., Singer,W., Berthoz, A., Christen, T. (Eds.), Temporal Coding in the Brain.Springer, Berlin, pp. 13–38.

Friston, K.J., 2000. The labile brain. III. Transients and spatio-temporalreceptive fields. Phil. Trans. R. Soc. Lond. B 355, 253–265.

Friston, K.J., Büchel, C., 2000. Attentional modulation of V5 in human.Proc. Natl. Acad. Sci. U.S.A. 97, 7591–7596.

Friston, K.J., Frith, C., Passingham, R.E., Dolan, R., Liddle, P.,Frackowiak, R.S.J., 1992. Entropy and cortical activity: informationtheory and PET findings. Cereb. Cortex 3, 259–267.

Friston, K.J., Frith, C.D., Frackowiak, R.S.J., 1993. Principal componentanalysis learning algorithms: a neurobiological analysis. Proc. R. Soc.B 254, 47–54.

Friston, K.J., Tononi, G., Reeke, G.H., Sporns, O., Edelman, G.E., 1994.Value-dependent selection in the brain: simulation in synthetic neuralmodel. Neuroscience 39, 229–243.

Friston, K.J., 1995a. Functional and effective connectivity inneuroimaging: a synthesis. Hum. Brain Mapp. 2, 56–78.

Friston, K.J., Holmes, A.P., Worsley, K.J., Poline, J.B., Frith, C.D.,Frackowiak, R.S.J., 1995b. Statistical parametric maps in functionalimaging: a general linear approach. Hum. Brain Mapp. 2, 189–210.

Friston, K.J., Price, C.J., Fletcher, P., Moore, C., Frackowiak, R.S.J.,Dolan, R.J., 1996. The trouble with cognitive subtraction. NeuroImage4, 97–104.

Friston, K.J., Büchel, C., Fink, G.R., Morris, J., Rolls, E., Dolan,R.J., 1997. Psychophysiological and modulatory interactions inneuroimaging. NeuroImage 6, 218–229.

Gawne, T.J., Richmond, B.J., 1993. How independent are the messagescarried by adjacent inferior temporal cortical neurons? J. Neurosci. 13,2758–2771.

Gerstein, G.L., Perkel, D.H., 1969. Simultaneously recorded trains ofaction potentials: analysis and functional interpretation. Science 164,828–830.

Girard, P., Bullier, J., 1989. Visual activity in area V2 during reversibleinactivation of area 17 in the macaque monkey. J. Neurophysiol. 62,1287–1301.

Hinton, G.T., Shallice, T., 1991. Lesioning an attractor network:investigations of acquired dyslexia. Psychol. Rev. 98, 74–95.

Hinton, G.E., Dayan, P., Frey, B.J., Neal, R.M., 1995. The wake–sleepalgorithm for unsupervised neural networks. Science 268, 1158–1161.

Hirsch, J.A., Gilbert, C.D., 1991. Synaptic physiology of horizontalconnections in the cat’s visual cortex. J. Neurosci. 11, 1800–1809.

Karhunen, J., Joutsensalo, J., 1994. Representation and separation ofsignals using nonlinear PCA type learning. Neural Netw. 7, 113–127.

Kass, R.E., Steffey, D., 1989. Approximate Bayesian inference inconditionally independent hierarchical models (parametric empiricalBayes models). J. Am. Stat. Assoc. 407, 717–726.

Kay, J., Phillips, W.A., 1996. Activation functions, computational goalsand learning rules for local processors with contextual guidance. NeuralComput. 9, 895–910.

Kramer, M.A., 1991. Nonlinear principal component analysis usingauto-associative neural networks. AIChE J. 37, 233–243.

Lenzi, G.L., Frackowiak, R.S.J., Jones, T., 1982. Cerebral oxygenmetabolism and blood flow in human cerebral ischaemic infarction. J.Cereb. Blood Flow Metab. 2, 321–335.

Linsker, R., 1990. Perceptual neural organization: some approaches basedon network models and information theory. Annu. Rev. Neurosci. 13,257–281.

Lueck, C.J., Zeki, S., Friston, K.J., Deiber, M.P., Cope, N.O., Cunningham,V.J., Lammertsma, A.A., Kennard, C., Frackowiak, R.S.J., 1989. Thecolour centre in the cerebral cortex of man. Nature 340, 386–389.

Kawato, M., Hayakawa, H., Inui, T., 1993. A forward-inverse optics modelof reciprocal connections between visual cortical areas. Network 4,415–422.

MacKay, D.M., 1956. The epistemological problem for automata.In: Automata Studies. Princeton University Press, Princeton, NJ,pp. 235–251.

McIntosh, A.R., 2000. Towards a network theory of cognition. NeuralNetw. 13, 861–870.

Mumford, D., 1992. On the computational architecture of the neocortex.II. The role of cortico-cortical loops. Biol. Cybern 66, 241–251.

Nebes, R., 1989. Semantic memory in Alzheimer’s disease. Psychol. Bull.106, 377–394.

Neisser, U., 1967. Cognitive Psychology. Appleton-Century-Crofts, NewYork.

Oja, E., 1989. Neural networks, principal components, and subspaces. Int.J. Neural Syst. 1, 61–68.

Olshausen, B.A., Field, D.J., 1996. Emergence of simple-cell receptivefield properties by learning a sparse code for natural images. Nature381, 607–609.

Optican, L., Richmond, B.J., 1987. Temporal encoding of two-dimensionalpatterns by single units in primate inferior cortex. II. Informationtheoretic analysis. J. Neurophysiol. 57, 132–146.

Pack, C.C., Born, R.T., 2001. Temporal dynamics of a neural solutionto the aperture problem in visual area MT of macaque brain. Nature409, 1040–1042.

Pearl, J., 2000. Causality, Models, Reasoning and Inference. CambridgeUniversity Press, Cambridge, UK.

Petersen, S.E., Fox, P.T., Posner, M.I., Mintun, M., Raichle, M.E., 1989.Positron emission tomographic studies of the processing of singlewords. J. Cog. Neurosci. 1, 153–170.

Phillips, W.A., Singer, W., 1997. In search of common foundations forcortical computation. Behav. Brain Sci. 20, 57–83.

Phillips, C.G., Zeki, S., Barlow, H.B., 1984. Localisation of function inthe cerebral cortex: past present and future. Brain 107, 327–361.

Plaut, D., Shallice, T., 1993. Deep dyslexia—a case study of connectionistneuropsychology. Cog. Neuropsychol. 10, 377–500.

Poggio, T., Torre, V., Koch, C., 1985. Computational vision andregularisation theory. Nature 317, 314–319.

Price, C.J., 1998. The functional anatomy of word comprehension andproduction. Trends Cog. Sci. 2, 281–288.

Price, C.J., Warburton, E.A., Moore, C.J., Frackowiak, R.S.J., Friston, K.J.,2000. Dynamic diaschisis: anatomically remote and context-specifichuman brain lesions. J. Cog. Neurosci. 00, 00–00.

Rao, R.P., 1999. An optimal estimation approach to visual perception andlearning. Vision Res. 39, 1963–1989.

Rao, R.P., Ballard, D.H., 1998. Predictive coding in the visual cortex: afunctional interpretation of some extra-classical receptive field effects.Nat. Neurosci. 2, 79–87.

Rockland, K.S., Pandya, D.N., 1979. Laminar origins and terminations ofcortical connections of the occipital lobe in the rhesus monkey. BrainRes. 179, 3–20.

K. Friston / Progress in Neurobiology 68 (2002) 113–143 143

Rumelhart, D., McClelland, J., 1986. Parallel Distributed Processing:Explorations in the Microstructure of Cognition. MIT Press,Cambridge, MA.

Salin, P.-A., Bullier, J., 1995. Corticocortical connections in the visualsystem: structure and function. Psychol. Bull. 75, 107–154.

Sandell, J.H., Schiller, P.H., 1982. Effect of cooling area 18 on striatecortex cells in the squirrel monkey. J. Neurophysiol. 48, 38–48.

Sugase, Y., Yamane, S., Ueno, S., Kawano, K., 1999. Global and fineinformation coded by single neurons in the temporal visual cortex.Nature 400, 869–873.

Sutton, R.S., Barto, A.G., 1990. Time derivative models of Pavlovianreinforcement. In: Gabriel, M., Moore, J. (Eds.), Learning andComputational Neuroscience: Foundations of Adaptive Networks. MITPress, Cambridge, MA, pp. 497–538.

Taleb, A., Jutten, C., 1997. Nonlinear source separation: the post-nonlinearmixtures. In: Proceedings of the ESANN’97, Bruges, April 1997,pp. 279–284.

Tononi, G., Sporns, O., Edelman, G.M., 1994. A measure for braincomplexity: relating functional segregation and integration in thenervous system. Proc. Natl. Acad. Sci. U.S.A. 91, 5033–5037.

Tovee, M.J., Rolls, E.T., Treves, A., Bellis, R.P., 1993. Informationencoding and the response of single neurons in the primate temporalvisual cortex. J. Neurophysiol. 70, 640–654.

Treue, S., Maunsell, H.R., 1996. Attentional modulation of visual motionprocessing in cortical areas MT and MST. Nature 382, 539–541.

Warrington, E.K., McCarthy, R., 1983. Category specific access dysphasia.Brain 106, 859–878.

Warrington, E.K., McCarthy, R., 1987. Categories of knowledge:further fractionations and an attempted integration. Brain 110,1273–1296.

Warrington, E.K., Shallice, T., 1984. Category specific semanticimpairments. Brain 107, 829–853.

Wray, J., Green, G.G.R., 1994. Calculation of the Volterra kernels ofnonlinear dynamic systems using an artificial neuronal network. Biol.Cybern. 71, 187–195.

Zeki, S., 1990. The motion pathways of the visual cortex. In: Blakemore,C. (Ed.), Vision: Coding and Efficiency. Cambridge University Press,Cambridge, UK, pp. 321–345.

Zeki, S., Shipp, S., 1988. The functional logic of cortical connections.Nature 335, 311–317.