Distributed Memory and the Representation of General and ...jlmcc/papers/McCRumelhart85JEPG.pdfof...

30
Journal of Experimental Psychology: General Copyright 1985 by the American Psychological Association, Inc. 1985, Vol. 114, No. 2, 159-188 0096-3445/85/S00.75 Distributed Memory and the Representation of General and Specific Information James L. McClelland and David E. Rumelhart University of California, San Diego We describe a distributed model of information processing and memory and apply it to the representation of general and specific information. The model consists of a large number of simple processing elements which send excitatory and inhibitory signals to each other via modifiable connections. Information processing is thought of as the process whereby patterns of activation are formed over the units in the model through their excitatory and inhibitory interactions. The memory trace of a processing event is the change or increment to the strengths of the interconnections that results from the processing event. The traces of separate events are superimposed on each other in the values of the connection strengths that result from the entire set of traces stored in the memory. The model is applied to a number of findings related to the question of whether we store abstract representations or an enumeration of specific experiences in memory. The model simulates the results of a number of important experiments which have been taken as evidence for the enumeration of specific experiences. At the same time, it shows how the functional equivalent of abstract representa- tions—prototypes, logogens, and even rules—can emerge from the superposition of traces of specific experiences, when the conditions are right for this to happen. In essence, the model captures the structure present in a set of input patterns; thus, it behaves as though it had learned prototypes or rules, to the extent that the structure of the environment it has learned about can be captured by describing it in terms of these abstractions. In the late 1960s and early 1970s a number demonstration of this basic point comes from of experimenters, using a variety of different the work of Posner and Keele (1968, 1970). tasks, demonstrated that subjects could learn Using a categorization task, they found that through experience with exemplars of a cat- there were some conditions in which subjects egory to respond better—more accurately, or categorized the prototype of a category more more rapidly—to the prototype than to any accurately than the particular exemplars of of the particular exemplars. The seminal the category that they had previously seen. This work, and many other related experi- ments, supported the development of the Preparation of this article was supported in part by a view that memory by its basic nature SOme- grant from the Systems Development Foundation and in how abstracts the central tendency of a set of part by a National Science Foundation Grant BNS-79- disparate experiences, and gives relatively little 24062. The first author is a recipient of a Career Devel- we ight to the Specific experiences that gave opment Award from the National Institute of Mental f fh oKctrartinm Health (5-K01-MH00385). " Se tO №eSC abstractlons - This article was originally presented at a conference Recently, however, some have come to organized by Lee Brooks and Larry Jacoby on "The question this "abstractive" point of view, for Priority of the Specific." We would like to thank the two reasons. First, specific events and expe- organizers, as well as several of the participants, partic- riences d j j prominent role in mem- ularly Doug Medm and Rich Shinnn, for stimulating , . . _ . , , discussion and for empirical input to the development of or y and learning. Experimental demonstra- this article. tions of the importance of specific stimulus Requests for reprints should be sent to James L. events even in tasks which have been thought McClelland, Department of Psychology, Carnegie-Mellon to i nvo l ve abstraction of a concept or rule ?ffiff2^^^^ 52 ^ to 5g -now legion Responses in categorization University of California—San Diego, La Jolla, California tasks (Brooks, 1978; Medm & Shatter, 1978), 92093. perceptual identification tasks (Jacoby, 1983a, 159

Transcript of Distributed Memory and the Representation of General and ...jlmcc/papers/McCRumelhart85JEPG.pdfof...

  • Journal of Experimental Psychology: General Copyright 1985 by the American Psychological Association, Inc.1985, Vol. 114, No. 2, 159-188 0096-3445/85/S00.75

    Distributed Memory and the Representationof General and Specific Information

    James L. McClelland and David E. RumelhartUniversity of California, San Diego

    We describe a distributed model of information processing and memory andapply it to the representation of general and specific information. The modelconsists of a large number of simple processing elements which send excitatoryand inhibitory signals to each other via modifiable connections. Informationprocessing is thought of as the process whereby patterns of activation are formedover the units in the model through their excitatory and inhibitory interactions.The memory trace of a processing event is the change or increment to thestrengths of the interconnections that results from the processing event. Thetraces of separate events are superimposed on each other in the values of theconnection strengths that result from the entire set of traces stored in the memory.The model is applied to a number of findings related to the question of whetherwe store abstract representations or an enumeration of specific experiences inmemory. The model simulates the results of a number of important experimentswhich have been taken as evidence for the enumeration of specific experiences.At the same time, it shows how the functional equivalent of abstract representa-tions—prototypes, logogens, and even rules—can emerge from the superpositionof traces of specific experiences, when the conditions are right for this to happen.In essence, the model captures the structure present in a set of input patterns;thus, it behaves as though it had learned prototypes or rules, to the extent thatthe structure of the environment it has learned about can be captured bydescribing it in terms of these abstractions.

    In the late 1960s and early 1970s a number demonstration of this basic point comes fromof experimenters, using a variety of different the work of Posner and Keele (1968, 1970).tasks, demonstrated that subjects could learn Using a categorization task, they found thatthrough experience with exemplars of a cat- there were some conditions in which subjectsegory to respond better—more accurately, or categorized the prototype of a category moremore rapidly—to the prototype than to any accurately than the particular exemplars ofof the particular exemplars. The seminal the category that they had previously seen.

    This work, and many other related experi-ments, supported the development of the

    Preparation of this article was supported in part by a view that memory by its basic nature SOme-grant from the Systems Development Foundation and in how abstracts the central tendency of a set ofpart by a National Science Foundation Grant BNS-79- disparate experiences, and gives relatively little24062. The first author is a recipient of a Career Devel- weight to the Specific experiences that gaveopment Award from the National Institute of Mental • f fh oKctrartinmHealth (5-K01-MH00385). "Se tO №eSC abstractlons-

    This article was originally presented at a conference Recently, however, some have come toorganized by Lee Brooks and Larry Jacoby on "The question this "abstractive" point of view, forPriority of the Specific." We would like to thank the two reasons. First, specific events and expe-organizers, as well as several of the participants, partic- riences d j j prominent role in mem-ularly Doug Medm and Rich Shinnn, for stimulating , . . _ . , ,discussion and for empirical input to the development of ory and learning. Experimental demonstra-this article. tions of the importance of specific stimulus

    Requests for reprints should be sent to James L. events even in tasks which have been thoughtMcClelland, Department of Psychology, Carnegie-Mellon to involve abstraction of a concept or rule

    ?ffiff2^^^^52^to5g -now legion Responses in categorizationUniversity of California—San Diego, La Jolla, California tasks (Brooks, 1978; Medm & Shatter, 1978),92093. perceptual identification tasks (Jacoby, 1983a,

    159

  • 160 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    1983b; Whittlesea, 1983), and pronunciationtasks (Glushko, 1979) all seem to be quitesensitive to the congruity between particulartraining stimuli and particular test stimuli,in ways which most abstraction models wouldnot expect.

    At the same time, a number of modelshave been proposed in which behavior whichhas often been characterized as rule-based orconcept-based is attributed to a process thatmakes use of stored traces of specific eventsor specific exemplars of the concepts or rules.According to this class of models, the appar-ently rule-based or concept-based behavioremerges from what might be called a con-spiracy of individual memory traces or froma sampling of one from the set of such traces.Models of this class include the Medin andShaffer (1978) context model, Hintzman's(1983) multiple trace model, and Whittlesea's(1983) episode model. This trend is alsoexemplified by our interactive activationmodel of word perception (McClelland &Rumelhart, 1981; Rumelhart & McClelland,1981, 1982), and an extension of the inter-active activation model to generalization fromexemplars (McClelland, 1981).

    One feature of some of these exemplar-based models troubles us. Many of them areinternally inconsistent with respect to theissue of abstraction. Thus, though our wordperception model assumes that linguistic rulesemerge from a conspiracy of partial activa-tions of detectors for particular words, therebyeliminating the need for abstraction of rules,the assumption that there is a single detectorfor each word implicitly assumes that thereis an abstraction process that lumps eachoccurrence of the same word into the samesingle detector unit. Thus, the model has itsabstraction and creates it too, though atslightly different levels.

    One logically coherent response to thisinconsistency is to simply say that each wordor other representational object is itself aconspiracy of the entire ensemble of memorytraces of the different individual experienceswe have had with that unit. We will call thisview the enumeration of specific experiencesview. It is exemplified most clearly by Jacoby(1983a, 1983b), Hintzman (1983), and Whit-tlesea (1983).

    As the papers just mentioned demonstrate,enumeration of specific experiences can workquite well as an account of quite a numberof empirical findings. However, there stillseems to be one drawback. Such models seemto require an unlimited amount of storagecapacity, as well as mechanisms for searchingan almost unlimited mass of data. This isespecially true when we consider that theprimitives out of which we normally assumeone experience is built are themselves ab-stractions. For example, a word is a sequenceof letters, or a sentence is a sequence ofwords. Are we to believe that all of theseabstractions are mere notational conveniencesfor the theorist, and that every event is storedas an extremely rich (obviously structured)representation of the event, with no abstrac-tion?

    In this article, we consider an alternativeconceptualization: a distributed, superposi-tional approach to memory. This view issimilar to the separate enumeration of expe-riences view in some respects, but not in all.On both views, memory consists of tracesresulting from specific experiences; and onboth views, generalizations emerge from thesuperposition of these specific memory traces.Our model differs, though, from the enumer-ation of specific experiences in assuming thatthe superposition of traces occurs at the timeof storage. We do not keep each trace in aseparate place, but rather we superimposethem so that what the memory contains is acomposite.

    Our theme will be to show that distributedmodels provide a way to resolve the abstrac-tion-representation of specifics dilemma.With a distributed model, the superpositionof traces automatically results in abstractionthough it can still preserve to some extentthe idiosyncrasies of specific events and ex-periences, or of specific recurring subclassesof events and experiences.

    We will begin by introducing a specificversion of a distributed model of memory.We will show how it works and describe someof its basic properties. We will show how ourmodel can account for several recent findings(Salasoo, Shiffrin, & Feustel, 1985; Whittlesea,1983), on the effects of specific experienceson later performance, and the conditions

  • DISTRIBUTED MEMORY 161

    under which functional equivalents of abstractrepresentations such as prototypes or logogensemerge. The discussion considers generaliza-tions of the approach to the semantic-episodicdistinction and the acquisition of linguisticrule systems, and considers reasons for pre-ferring a distributed-superpositional memoryover other models.

    Previous, related models. Before we getdown to work, some important credits are inorder. Our distributed model draws heavilyfrom the work of Anderson (e.g., 1977, 1983;Anderson, Silverstein, Ritz, & Jones, 1977;Knapp & Anderson, 1984) and Hinton(198la). We have adopted and synthesizedwhat we found to be the most useful aspectsof their distinct but related models, preserving(we hope) the basic spirit of both. We viewour model as an exemplar of a class ofexisting models whose exploration Hinton,Anderson, Kohonen (e.g., Kohonen, 1977;Kohonen, Oja, & Lehtio, 1981), and othershave pioneered. A useful review of prior workin this area can be obtained from Andersonand Hinton (1981) and other articles in thevolume edited by Hinton and Anderson(1981). Some points similar to some of thesewe will be making have recently been coveredin the papers of Murdock (1982) and Eich(1982), though the distributed representationswe use are different in important ways fromthe representations used by these other au-thors.

    Our distributed model is not a completetheory of human information processing andmemory. It is a model of the internal structureof some components of information process-ing, in particular those concerned with theretrieval and use of prior experience. Themodel does not specify in and of itself howthese acts of retrieval and use are planned,sequenced, and organized into coherent pat-terns of behavior.

    A Distributed Model of Memory

    General Properties

    Our model adheres to the following generalassumptions, some of which are shared withseveral other distributed models of processingand memory.

    Simple, highly interconnected units. Theprocessing system consists of a collection ofsimple processing units, each interconnectedwith many other units. The units take onactivation values, and communicate withother units by sending signals modulated byweights associated with the connections be-tween the units. Sometimes, we may think ofthe units as corresponding to particular rep-resentational primitives, but they need not.For example, even what we might considerto be a primitive feature of something, likehaving a particular color, might be a patternof activation over a collection of units.

    Modular structure. We assume that theunits are organized into modules. Each mod-ule receives inputs from other modules, theunits within the module are richly intercon-nected with each other, and they send outputsto other modules. Figure 1 illustrates theinternal structure of a very simple module,and Figure 2 illustrates some hypotheticalinterconnections between a number of mod-ules. Both figures grossly underrepresent ourview of the numbers of units per module andthe number of modules. We would imaginethat there would be thousands to millions ofunits per module and many hundreds orperhaps many thousands of partially redun-dant modules in anything close to a completememory system.

    The state of each module represents asynthesis of the states of all of the modulesit receives inputs from. Some of the inputswill be from relatively more sensory modules,closer to the sensory end-organs of one mo-dality or another. Others will come fromrelatively more abstract modules, whichthemselves receive inputs from and send out-puts to other modules placed at the abstractend of several different modalities. Thus,each module combines a number of differentsources of information.

    Mental state as pattern of activation. In adistributed memory system, a mental state isa pattern of activation over the units in somesubset of the modules. The patterns in thedifferent modules capture different aspects ofthe content of the mental states in a partiallyoverlapping fashion. Alternative mental statesare simply alternative patterns of activationover the modules. Information processing is

  • 162 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Figure I . A simple information processing module, consisting of a small ensemble of eight processingunits. [Each unit receives inputs from other modules (indicated by the single input impinging on the inputline of the node from the left; this can stand for a number of converging input signals from several nodesoutside the module) and sends outputs to other modules (indicated by the output line proceeding to theright from each unit). Each unit also has a modifiable connection to all the other units in the samemodule, as indicated by the branches of the output lines that loop back onto the input lines leading intoeach unit. All connections, which may be positive or negative, are represented by dots.]

    the process of evolution in time of mentalstates.

    Units play specific roles within patterns.A pattern of activation only counts as thesame as another if the same units are involved.

    Figure 2. An illustrative diagram showing several modulesand interconnections among them. (Arrows betweenmodules simply indicate that some of the nodes in onemodule send inputs to some of the nodes in the other.The exact number and organization of modules is ofcourse unknown; the figure is simply intended to besuggestive.)

    The reason for this is that the knowledgebuilt into the system for recreating the pat-terns is built into the set of interconnectionsamong the units, as we will explain later. Fora pattern to access the right knowledge itmust arise on the appropriate units. In thissense, the units play specific roles in thepatterns. Obviously, a system of this sort isuseless without sophisticated perceptual pro-cessing mechanisms at the interface betweenmemory and the outside world, so that similarinput patterns arising at different locationsin the world can be mapped into the sameset of units internally. Such mechanisms areoutside the scope of this article (but seeHinton, 198 Ib; McClelland, 1985).

    Memory traces as changes in the weights.Patterns of activation come and go, leavingtraces behind when they have passed. Whatare the traces? They are changes in thestrengths or weights of the connections be-tween the units in the modules.

    This view of the nature of the memorytrace clearly sets these kinds of models apartfrom traditional models of memory in whichsome copy of the "active" pattern is generallythought of as being stored directly. Instead

  • DISTRIBUTED MEMORY 163

    of this, what is actually stored in our modelis changes in the connection strengths. Thesechanges are derived from the presented pat-tern, and are arranged in such a way that,when a part of a known pattern is presentedfor processing, the interconnection strengthscause the rest of the pattern to be reinstated.Thus, although the memory trace is not acopy of the learned pattern, it is somethingfrom which a replica of that pattern can berecreated. As we already said, each memorytrace is distributed over many different con-nections, and each connection participates inmany different memory traces. The traces ofdifferent mental states are therefore super-imposed in the same set of weights. Surpris-ingly enough, as we will see in several ex-amples, the connections between the units ina single module can store the informationneeded to complete many different familiarpatterns.

    Retrieval as reinstatement of prior patternof activation. Retrieval amounts to partialreinstatement of a mental state, using a cuewhich is a fragment of the original state. Forany given module, we can see the cues asoriginating from outside of it. Some cuescould arise ultimately from sensory input.Others would arise from the results of pre-vious retrieval operations fed back to thememory system under the control of a searchor retrieval plan. It would be premature tospeculate on how such schemes would beimplemented in this kind of a model, but itis clear that they must exist.

    Detailed Assumptions

    In the rest of our presentation, we will befocusing on operations that take place withina single module. This obviously oversimplifiesthe behavior of a complete memory systembecause the modules are assumed to be incontinuous interaction. The simplification isjustified, however, in that it allows us to focuson some of the basic properties of distributedmemory that are visible even without theseinteractions with other modules.

    Let us look, therefore, at the internal struc-ture of one very simple module, as shown inFigure 1. Again, our image is that in a realsystem there would be much larger numbersof units. We have restricted our analysis to

    small numbers simply to illustrate basic prin-ciples as clearly as possible; this also helps tokeep the running time of simulations inbounds.

    Activation values. The units take on acti-vation values which range from —1 to +1.Zero represents in this case a neutral restingvalue, toward which the activations of theunits tend to decay.

    Inputs, outputs, and internal connections.Each unit receives input from other modulesand sends output to other modules. For thepresent, we assume that the inputs from othermodules occur at connections whose weightsare fixed. In the simulations, we treat theinput from outside the module as a fixedpattern, ignoring (for simplicity) the fact thatthe input pattern evolves in time and mightbe affected by feedback from the moduleunder study. Although the input to each unitmight arise from a combination of sourcesin other modules, we can lump the externalinput to each unit into a single real valuednumber representing the combined effects ofall components of the external input. Inaddition to extra-modular connections, eachunit is connected to all other units in themodule via a weighted connection. Theweights on these connections are modifiable,as described later. The weights can take onany real values, positive, negative, or 0. Thereis no connection from a unit onto itself.

    The processing cycle. Processing within amodule takes place as follows. Time is dividedinto discrete ticks. An input pattern is pre-sented at some point in time over some orall of the input lines to the module and isthen left on for several ticks, until the patternof activation it produces settles down andstops changing.

    Each tick is divided into two phases. Inthe first phase, each unit determines its netinput, based on the external input to the unitand activations of all of the units at the endof the preceding tick modulated by the weightcoefficients which determine the strength anddirection of each unit's effect on every other.

    For mathematical precision, consider twounits in our module, and call one of themunit /', and the other unit j. The input to unit/ from unit j, written iu is just

  • 164 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    where a, is the activation of unit j, and wy isthe weight constant modulating the effect ofunit j on unit /. The total input to unit ifrom all other units internal to the module,/,, is then just the sum of all of these separateinputs:

    '/ = 2 iij-j

    Here, j ranges over all units in the moduleother than ('. This sum is then added to theexternal input to the unit, arising from outsidethe module, to obtain the net input to unit

    where et is just the lumped external input tounit i.

    In the second phase, the activations of theunits are updated. If the net input is positive,the activation of the unit is incremented byan amount proportional to the distance leftto the ceiling activation level of + 1 .0. If thenet input is negative, the activation is decre-mented by an amount proportional to thedistance left to the floor activation level of— 1.0. There is also a decay factor whichtends to pull the activation of the unit backtoward the resting level of 0.

    Mathematically, we can express these as-sumptions as follows: For unit i, if «/ > 0,

    dt = £«,(! - oi) - Daf.

    I f w / s O ,

    cf, = £n, [a, -(-!)]-£>«,,In these equations, £ and D are global pa-rameters which apply to all units, and set therates of excitation and decay, respectively.The term a, is the activation of unit i at theend of the previous cycle, and a, is the changein a,; that is, it is the amount added to (or,if negative, subtracted from) the old value a,to determine its new value for the next cycle.

    Given a fixed set of inputs to a particularunit, its activation level will be driven up ordown in response until the activation reachesthe point where the incremental effects of theinput are balanced by the decay. In practice,of course, the situation is complicated by thefact that as each units' activation is changingit alters the input to the others. Thus, it isnecessary to run the simulation to see how

    the system will behave for any given set ofinputs and any given set of weights. In allthe simulations reported here, the model isallowed to run for 50 cycles, which is consid-erably more than enough for it to achieve astable pattern of activation over all the units.

    Memory traces. The memory trace of aparticular pattern of activation is a set ofchanges in the entire set of weights in themodule. We call the whole set of changes anincrement to the weights. After a stable patternof activation is achieved, weight adjustmenttakes place. This is thought of as occurringsimultaneously for all of the connections inthe module.

    The Delta rule. The rule that determinesthe size and direction (up or down) of thechange at each connection is the crux of themodel. The idea is often difficult to grasp onfirst reading, but once it is understood itseems very simple, and it directly capturesthe goal of facilitating the completion of thepattern, given some part of the pattern as aretrieval or completion cue.

    To allow each part of a pattern to recon-struct the rest of the pattern, we simply wantto set up the internal connections among theunits in the module so that when part of thepattern is presented, activating some of theunits in the module, the internal connectionswill lead the active units to tend to reproducethe rest. To do this, we want to make theinternal input to each unit have the sameeffect on the unit that the external input hason the unit. That is, given a particular patternto be stored, we want to find a set of connec-tions such that the internal input to eachunit from all of the other units matches theexternal input to that unit. The connectionchange procedure we will describe has theeffect of moving the weights of all the con-nections in the direction of achieving thisgoal.

    The first step in weight adjustment is tosee how well the module is already doing. Ifthe network is already matching the externalinput to each unit with the internal inputfrom the other units, the weights do not needto be changed. To get an index of how wellthe network is already doing at matching itsexcitatory input, we assume that each unit icomputes the difference A, between its exter-

  • DISTRIBUTED MEMORY 165

    nal input and the net internal input to theunit from the other units in the module:

    A,- = ef - /,-.

    In determining the activation value of theunit, we added the external input togetherwith the internal input. Now, in adjusting theweights, we are taking the difference betweenthese two terms. This implies that the unitmust be able to aggregate all inputs forpurposes of determining its activation, but itmust be able to distinguish between externaland internal inputs for purposes of adjustingits weights.

    Let us consider the term A, for a moment.If it is positive, the internal input is notactivating the unit enough to match the ex-ternal input to the unit. If negative, it isactivating the unit too much. If zero, every-thing is fine and we do not want to changeanything. Thus, A, determines the magnitudeand direction of the overall change that needsto be made in the internal input to unit i.To achieve this overall effect, the individualweights are then adjusted according to thefollowing formula:

    wfj = S&iOj.

    Thei parameter S is just a global strengthparameter which regulates the overall mag-nitude of the adjustments of the weights; vv/,is the change in the weight to i from j.

    We call this weight modification rule thedelta rule. It has all the intended conse-quences; that is, it tends to drive the weightsin the direction of the right values to makethe internal inputs to a unit match the exter-nal inputs. For example, consider the case inwhich A; is positive and Oj is positive. In thiscase, the value of A, tells us that unit / is notreceiving enough excitatory input, and thevalue of Uj tells us that unit j has positiveactivation. In this case, the delta rule willincrease the weight from j to /. The resultwill be that the next time unity has a positiveactivation, its excitatory effect on unit / willbe increased, thereby reducing A,.

    Similar reasoning applies to cases whereA, is negative, a, is negative, or both arenegative. Of course, when either A, or a, is 0,Wjj is not changed. In the first case, there isno error to compensate for; in the second

    case, a change in the weight will have noeffect the next time unit j has the sameactivation value.

    What the delta rule can and cannot do.The delta rule is a continuous variant of theperceptron convergence procedure (Rosen-blatt, 1962), and has been independentlyinvented many times (see Sutton & Barto,1981, for a discussion). Its popularity is basedon the fact that it is an error-correcting rule,unlike the Hebb rule used until recently byAnderson (1977; Anderson et al., 1977). Anumber of interesting theorems have beenproven about this rule (Kohonen, 1977; Stone,1985). Basically, the important result is that,for a set of patterns which we present repeat-edly to a module, if there is a set of weightswhich will allow the system to reduce A to 0for each unit in each pattern, this rule willfind it through repeated exposure to all ofthe members of the set of patterns.

    It is important to note that the existenceof a set of weights that will allow A to bereduced to 0 is not guaranteed, but dependson the structure inherent in the set of patternswhich the model is given to learn. To beperfectly learnable by our model, the patternsmust conform to the following linear predict-ability constraint:

    Over the entire set of patterns, the externalinput to each unit must be predictablefrom a linear combination of the activationsof every other unit.

    This is an important constraint, for there aremany sets of patterns that violate it. However,it is necessary to distinguish between thepatterns used inside the model, and the stim-ulus patterns to which human observers mightbe exposed in experiments, as described bypsychologists. For our model to work, it isimportant for patterns to be assigned tostimuli in a way that will allow them to belearned.

    A crucial issue, then, is the exact mannerin which the stimulus patterns are encoded.As a rule of thumb, an encoding which treatseach dimension or aspect of a stimulus sep-arately is unlikely to be sufficient; what isrequired is a context sensitive encoding, suchthat the representation of each aspect is col-ored by the other aspects. For a full discussion

  • 166 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Table 1Behavior of an 8-Unit Distributed Memory Module

    Case Input or response for each unit

    Pattern 1The Pattern:

    Response to Pattern beforelearning

    Response to Pattern after10 learning trials

    Test Input (Incomplete versionof Pattern)

    ResponseTest Input (Distortion of

    Pattern)Response

    Pattern 2The Pattern:

    Response to Pattern withweights learned forPattern 1

    Response to Pattern after10 learning trials

    Retest of response toPattern 1

    +.5 -.5 +.5 -.5 +.5 +.5

    +.7 -.7 +.7 -.7 +.7 +.7

    +.6 -.6 +.6 -.6

    +.6 -.6 +.6 -.6

    +.4 +.4

    +.6 +.6

    +.5 +.5 -.5 -.5 -.5 +.5

    +.7 +.7 -.7 -.7 -.7 +.7

    +.7 -.7 +.7 -.7 +.7 +.7

    -.5

    -.7

    -.4

    -.6

    -.5

    _ 7

    -.7

    -.5

    -.7

    -.4

    +*+.1

    +.5

    +.7

    -.7

    of this issue, see Hinton, McClelland, andRumelhart (in press).

    Decay in the increments to the weights.We assume that each trace or incrementundergoes a decay process though the rate ofdecay of the increments is assumed to bemuch slower than the rate of decay of patternsof activation. Following a number of theorists(e.g., Wickelgren, 1979), we imagine thattraces at first decay rapidly, but then theremaining portion becomes more and moreresistant to further decay. Whether it everreaches a point where it is no longer decayingat all we do not know. The basic effect ofthis assumption is that individual inputs exertlarge short-term effects on the weights, butafter they decay the residual effect is consid-erably smaller. The fact that each incrementhas its own temporal history increases thecomplexity of computer simulations enor-mously. In all of the particular cases to beexamined, we will therefore specify simplifiedassumptions to keep the simulations tractable.

    Illustrative Examples

    In this section, we describe a few illustrativeexamples to give the reader a feel for how we

    use the model, and to illustrate key aspectsof its behavior.

    Storage and retrieval of several patterns ina single memory module. First, we considerthe storage and retrieval of two patterns in asingle module of 8 units. Our basic aim is toshow how several distinct patterns of activa-tion can all be stored in the same set ofweights, by what Lashley (1950) called a kindof algebraic summation, and not interferewith each other.

    Before the first presentation of either pat-tern, we start out with all the weights set to0. The first pattern is given at the top ofTable 1. It is an arrangement of +1 and — 1inputs to the eight units in the module. (InTable 1, the Is are suppressed in the inputsfor clarity). When we present the first patternto this module, the resulting activation valuessimply reflect the effects of the inputs them-selves because none of the units are yetinfluencing any of the others.

    Then, we teach the module this pattern bypresenting it to the module 10 times. Eachtime, after the pattern of activation has hadplenty of time to settle down, we adjust theweights. The next time we present the com-plete pattern after the 10 learning trials, the

  • DISTRIBUTED MEMORY 167

    module's response is enhanced, comparedwith the earlier situation. That is, the activa-tion values are increased in magnitude, owingto the combined effects of the external andinternal inputs to each of the units. If wepresent an incomplete part of the pattern,the module can complete it; if we distort thepattern, the module tends to drive the acti-vation back in the direction it thinks it oughtto have. Of course, the magnitudes of theseeffects depend on parameters; but the basicnature of the effects is independent of thesedetails.

    Figure 3 shows the weights our learningprocedure has assigned. Actual numericalvalues have been suppressed to emphasizethe basic pattern of excitatory and inhibitoryinfluences. In this example, all the numericalvalues are identical. The pattern of + and -signs simply gives the pattern of pairwisecorrelations of the elements. This is as itshould be to allow pattern enhancement,completion, and noise elimination. Unitswhich have the same activation in the patternhave positive weights, so that when one isactivated it will tend to activate the other,and when one is inhibited it will tend toinhibit the other. Units which have differentactivations in the pattern have negativeweights, so that when one is activated it willinhibit the other and vice versa.

    What happens when we present a newpattern, dissimilar to the first? This is illus-trated in the lower portion of Table 1. Atfirst, the network responds to it just as thoughit knew nothing at all: The activations simplyreflect the direct effects of the input, as theywould in a module with all 0 weights. Thereason is simply that the effects of the weightsalready in the network cancel each other out.This is a result of the fact that the twopatterns are maximally dissimilar from eachother. If the patterns had been more similar,there would not have been this completecancellation of effects.

    Now we learn the new pattern, presentingit 10 times and adjusting the weights eachtime. The resulting weights (Figure 3) repre-sent the sum of the weights for Patterns 1and 2. The response to the new pattern isenhanced, as shown in Table 1. The responseto the old, previously learned pattern is not

    Pattern 1

    Weights ror Pattern 1

    1 2 3 4 5 6 7 8

    — + — + + — —

    + _ _ + + _ _

    + - + - + - -+ - + - + - -- + - + - - +

    Pattern 2

    Weights ror Pattern 2

    1 2 3 4 5 6 7 8

    Composite Weightsfor Both Patterns

    1 2 3 4 5 6 7

    Figure 3. Weights acquired in learning Pattern 1 andPattern 2 separately, and the composite weights resultingfrom learning both. (The weight in a given cell reflectsthe strength of the connection from the correspondingcolumn unit to the corresponding row unit. Only thesign and relative magnitude of the weights are indicated.A blank indicates a weight of 0; + and - signify positiveand negative, with a double symbol, ++ or —, repre-senting a value twice as large as a single symbol, + or —.

    affected. The module will now show enhance-ment, completion, and noise elimination forboth patterns though these properties are notillustrated in Table 1.

    Thus, we see that more than one patterncan coexist in the same set of weights. Thereis an effect of storing multiple patterns, ofcourse. When only one pattern is stored, thewhole pattern (or at least, a pale copy of it)can be retrieved by driving the activation ofany single unit in the appropriate direction.As more patterns are stored, larger subpatternsare generally needed to specify the pattern tobe retrieved uniquely.

    Learning a Prototype From Exemplars

    In the proceeding section, we consideredthe learning of particular patterns and showedthat the delta rule was capable of learningmultiple patterns, in the same set of connec-tions. In this section, we consider what hap-pens when distributed models using the deltarule are presented with an ensemble of pat-terns that have some common structure. Theexamples described in this section illustratehow the delta rule can be used to extract thestructure from an ensemble of inputs, andthrow away random variability.

    Let us consider the following hypotheticalsituation. A little boy sees many differentdogs, each only once, and each with a differentname. All the dogs are a little different fromeach other, but in general there is a pattern

  • 168 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    which represents the typical dog: each one isjust a different distortion of this prototype.(We are not claiming that the dogs in theworld have no more structure than this; wemake this assumption for purposes of illus-tration only.) For now we will assume thatthe names of the dogs are all completelydifferent. Given this experience, we wouldexpect that the boy would learn the prototypeof the category, even without ever seeing anyparticular dog which matches the prototypedirectly (Posner & Keele, 1968, 1970; Ander-son, 1977, applies an earlier version of adistributed model to this case). That is, theprototype will seem as familiar as any of theexemplars, and he will be able to completethe pattern corresponding to the prototypefrom any part of it. He will not, however, bevery likely to remember the names of eachof the individual dogs though he may remem-ber the most recent ones.

    We model this situation with a moduleconsisting of 24 units. We assume that thepresentation of a dog produces a visual pat-tern of activation over 16 of the units in thehypothetical module (the 9th through 24th,counting from left to right). The name of thedog produces a pattern of activation over theother 8 units (Units 1 to 8, counting fromleft to right).

    Each visual pattern, by assumption, is adistortion of a single prototype. The prototypeused for the simulation simply had a randomseries of +1 and — 1 values. Each distortionof the prototype was made by probabilisticallyflipping the sign of randomly selected ele-ments of the prototype pattern. For each newdistorted pattern, each element has an inde-pendent chance of being flipped, with a prob-ability of .2. Each name pattern was simplya random sequence of +ls and —Is for theeight name units. Each encounter with a newdog is modeled as a presentation of a newname pattern with a new distortion of theprototype visual pattern. Fifty different trialswere run, each with a new name pattern-visual pattern pair.

    For each presentation, the pattern of acti-vation is allowed to stabilize, and then theweights are adjusted as before. The incrementto the weights is then allowed to decay con-siderably before the next input is presented.For simplicity, we assume that before the

    next pattern is presented, the last incrementdecays to a fixed small proportion of its initialvalue, and thereafter undergoes no furtherdecay.

    What does the module learn? The moduleacquires a set of weights which is continuallybuffeted about by the latest dog exemplar,but which captures the prototype dog quitewell. Waiting for the last increment to decayto the fixed residual yields the weights shownin Figure 4.

    These weights capture the correlationsamong the values in the prototype dog patternquite well. The lack of exact uniformity isdue to the more recent distortions presented,whose effects have not been corrected bysubsequent distortions. This is one way inwhich the model gives priority to specificexemplars, especially recent ones. The effectsof recent exemplars are particularly strong,of course, before they have had a chance todecay. The module can complete the proto-type quite well, and it will respond morestrongly to the prototype than to any distor-tion of it. It has, however, learned no partic-ular relation between this prototype and anyname pattern, because a totally different ran-dom association was presented on each trial.If the pattern of activation on the name unitshad been the same in every case (say, eachdog was just called dog), or even in just areasonable fraction of the cases, then themodule would have been able to retrieve thisshared name pattern from the prototype ofthe visual pattern and the prototype patternfrom the name.

    Multiple, nonorthogonal prototypes. In thepreceeding simulation we have seen how thedistributed model acts as a sort of signalaverager, finding the central tendency of a setof related patterns. In and of itself this is animportant property of the model, but theimportance of this property increases whenwe realize that the model can average severaldifferent patterns in the same compositememory trace. Thus, several different proto-types can be stored in the same set of weights.This is important, because it means that themodel does not fall into the trap of needingto decide which category to put a pattern inbefore knowing which prototype to averageit with. The acquisition of the different pro-totypes proceeds without any sort of explicit

  • DISTRIBUTED MEMORY 169

    Prototype pattern:

    Weights acquired after learning:

    + -

    . +

    1 __

    Figure 4. Weights acquired in learning from distorted exemplars of a prototype. (The prototype patternis shown above the weight matrix. Blank entries correspond to weights with absolute values less than .01;dots correspond to absolute values less than .06; pluses or minuses are used for weights with largerabsolute values.)

    categorization. If the patterns are sufficientlydissimilar, there is no interference amongthem at all. Increasing similarity leads toincreased confusability during learning, buteventually the delta rule finds a set of con-nection strengths that minimizes the confus-ability of similar patterns.

    To illustrate these points, we created asimulation analog of the following hypothet-ical situation. Let us say that our little boysees, in the course of his daily experience,different dogs, different cats, and differentbagels. First, let's consider the case in whicheach experience with a dog, a cat, or a bagelis accompanied by someone saying dog, cat,or bagel, as appropriate.

    The simulation analog of this situationinvolved forming three visual prototype pat-terns of 16 elements, two of them (the onefor dog and the one for cat) somewhat similarto each other (r = .5), and the third (for thebagel) orthogonal to both of the other two.Paired with each visual pattern was a namepattern of eight elements. Each name patternwas orthogonal to both of the others. Thus,the prototype visual pattern for cat and the

    prototype visual pattern for dog were similarto each other, but their names were notrelated.

    Stimulus presentations involved presenta-tions of distorted exemplars of the name-visual pattern pairs to a module of 24 ele-ments like the one used in the previoussimulation. This time, both the name patternand the visual pattern were distorted, witheach element having its sign flipped with anindependent probability of. 1 on each presen-tation. Fifty different distortions of eachname-visual pattern pair were presented ingroups of three consisting of one distortionof the dog pair, one distortion of the cat pair,and one distortion of the bagel pair. Weightadjustment occurred after each presentation,with decay to a fixed residual before eachnew presentation.

    At the end of training, the module wastested by presenting each name pattern andobserving the resulting pattern of activationover the visual nodes, and by presenting eachvisual pattern and observing the pattern ofactivation over the name nodes. The resultsare shown in Table 2. In each case, the model

  • 170 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Table 2Results of Tests After Learning the Dog, Cat, and Bagel Patterns

    Case

    Input or response for each unit

    Name units Visual pattern units

    +3 -4 +4 +4 -4 -4 -4 -4 +4 +4 +4 +3 +4 -4 -4 -3

    Pattern for dog + — + - + - +prototype

    Response to dogname

    Response to dog +5 -4 +4 -5 +5 -4 +4 -4visual pattern

    Pattern for cat + + — - + + — — + — + + — — — — + - + - + + - +prototype

    Response to cat +4 -3 +4 +4 -4 -3 -3 -4 +4 -4 +4 -4 +4 +4 -4 +4name

    Response to cat +5 +4 -4 -5 +4 +4 -4 -4visual pattern

    Pattern for bagel + • - - + + -- + + + - + - + + - + -- + + + + -prototype

    Response to bagal +3 +4 -4 +4 -4 +4 +4 -4 +4 -4 -4 +4 +3 +4 +4 -4name

    Response to bagel +4 -4 -4 +4 +4 -4 -4 +4visual pattern

    Note. Decimal points have been suppressed for clarity; thus, an entry of +4 represents an activation value of +.4.

    reproduces the correct completion for theprobe, and there is no apparent contamina-tion of the cat pattern .by the dog pattern,even though the visual patterns are similarto each other.

    In general, pattern completion is a matterof degree. One useful measure of patterncompletion is the dot product of the patternof activation over the units with the patternof external inputs to the units. Because wetreat the external inputs as +ls and —Is, andbecause the activation of each node can onlyrange from +1 to -1, the largest possiblevalue the dot product can have is 1.0. Wewill use this measure explicitly later whenconsidering some simulations of experimentalresults. For getting an impression of the degreeof pattern reinstatement in the present cases,it is sufficient to note that when the sign ofall of the elements is correct, as it is in all ofthe completions in Table 2, the average mag-nitude of the activations of the units corre-sponds to the dot product.

    In a case like the present one, in whichsome of the patterns known to the model arecorrelated, the values of the connectionstrengths that the model produces do notnecessarily have a simple interpretation.Though their sign always corresponds to the

    sign of the correlation between the activationsof the two units, their magnitude is not asimple reflection of the magnitude of theircorrelation, but is influenced by the degreeto which the model is relying on this partic-ular correlation to predict the activation ofone node from the others. Thus, in a casewhere two nodes (call them / and j) areperfectly correlated, the strength of the con-nection from i to j will depend on the numberof other nodes whose activations are corre-lated with j. If i is the only node correlatedwith j, it will have to do all the work ofpredicting j, so the weight will be very strong;on the other hand, if many nodes besides iare correlated with j, then the work of pre-dicting j will be spread around, and theweight between i and j will be considerablysmaller. The weight matrix acquired as aresult of learning the dog, cat, and bagelpatterns (Figure 5) reflects these effects. Forexample, across the set of three prototypes,Units 1 and 5 are perfectly correlated, as areUnits 2 and 6. Yet the connection from 2 to5 is stronger than the connection from 1 to4 (these connections are *d in Figure 5). Thereason for the difference is that 2 is one ofonly three units which correlate perfectlywith 5, whereas Unit 1 is one of seven units

  • DISTRIBUTED MEMORY 171

    which correlate perfectly with 4. (In Figure5, the weights do not reflect these contrastsperfectly in every case, because the noiseintroduced into the learning happens bychance to alter some of the correlations pres-ent in the prototype patterns. Averaged overtime, though, the weights will conform totheir expected values.)

    Thus far we have seen that several proto-types, not necessarily orthogonal, can bestored in the same module without difficulty.It is true, though we do not illustrate it, thatthe model has more trouble with the catand dog visual patterns earlier on in train-ing, before learning has essentially reachedasymptotic levels as it has by the end of 50cycles through the full set of patterns. And,of course, even at the end of learning, if we

    present as a probe a part of the visual pattern,if it does not differentiate between the dogand the cat, the model will produce a blendedresponse. Both these aspects of the modelseem generally consistent with what we shouldexpect from human subjects.

    Category learning without labels. An im-portant further fact about the model is thatit can learn several different visual patterns,even without the benefit of distinct identifyingname patterns during learning. To demon-strate this we repeated the previous simula-tion, simply replacing the name patterns withOs. The model still learns about the internalstructure of the visual patterns, so that, after50 cycles through the stimuli, any uniquesubpart of any one of the patterns is sufficientto reinstate to the rest of the corresponding

    Prototypes:

    Dog: +

    Cat: +

    Bagel: +

    Weights:

    -1 -1 +2--1 -1

    -2 -1 +4'

    -1 -2 -1 +5

    -1 -1 -1 +2

    +3 -1 -1 +1

    -1 +5 -3 -2

    -1 +3 -1 -1 -2

    -1 -1 -1 +2 -1

    +3 +5 -2 -1 -3 +2 -1

    +1 -1 -1 -1 +1 -1 -1

    +1 +1 -1 -1 -1 +4

    +2 -1

    -3 -1 -3 +3

    +1 -5 -1

    -1 +3 -3 +2 -1-3 -1 -2 -1 +1 +3

    -1 -1-3 -3 -2 +2 +1 +3 +1 -1 -1

    -1 -1 -1 -1 -1 -2 -H -5 -1 +2 -1 +5

    +1 -1 +1 +4 -1 -2

    -1 +3 -2 -1 +3 +3 -1 -1 -3 +1 -1 +3 +3

    +3 -1 -1 +2 -1 -1

    -1 -1 -1 +3 -1 +3

    +1 +1 +1 -3 -2

    +2 -1 -1 -1 +3 -1

    -2 +1 +1 -2 +1 +1 +1

    -1 +2 -1 +2

    -1 -1 -1 +2 -1 -1 +2

    -2 +1 +1 +1 -3 +1 +1

    +3 -1 +2 -1

    +5 -1 -1 -1 +4 -1

    + 1 -2 +1 + 1 -3

    +1 -5 +2 +1 +1 -5 +3

    +2 -1 -1 +3 -1

    +3 -4 +1 +3 -4 +1

    +2 -1 -1

    -1 +3 -2 -1 -1 +5 -3

    +3 -2 -3 +2 +1 +3 +1 -1

    -3 -1 +3 +3 +1 -1 -2 -2 +1 -1 +2

    +1 -3 -1 -3 -3 +1 +3 -1 -1 -2 +2

    +3 -1 +1 -2 -1 -2 +2 -1 +2 +1 -1

    -3 -3 +1 +2 -3 -1 -2 -1 +1

    -1 +2 -2 +1 +2 +1 -1 -3 -1 +2

    +2-3 +3. +1 -3 -1 +3

    -2 +1 -1 -3 +3 -2 +1 -1 -1 -3 -1 +1

    +2 +3 -3 -3 -1 +1 +3 +1 -1 -1

    +1 -3 -H +1 -1 -5 ̂ 1 -1

    +1 -2 +2 +1 -1 -3 -3 -1 +1 -2 -3 +1

    -2 +1 -1 +1 +1 +1 -2 +1 -2 +1 -3

    +2 +1 +2 -2 -3 +2 +1

    +1 -1 +1 +1 +1 -5 -1 -3 +1 +2

    -1 +2 -3 +2 +2 -1 -3 +1

    -1 +3 +1 -1 -1 +1 -2 +3 -1

    Figure 5. Weights acquired in learning the three prototype patterns shown. (Blanks in the matrix ofweights correspond to weights with absolute values less than or equal to .05. Otherwise the actual valueof the weight is about .05 times the value shown; thus +5 stands for a weight of +.25. The gap in thehorizontal and vertical dimensions is used to separate the name field from the visual pattern field.)

  • 172 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Table 3Results of Tests After Learning The Dog, Cat, and Bagel Patterns Without Names

    Case Input or response for each visual unit

    Dog visualpattern + - + +

    ProbeResponse +3 -3 +3 +3

    Cat visualpattern + - + +

    ProbeResponse +3 -3 +3 +3

    Bagel visualpattern + + - +

    ProbeResponse +2 +3 -4 +3

    -3 -4 -3 -3 +6 +5 +6 +5 +3 -2 -3 -1

    -3 -3 -3 -3 +6 -5 +6 -5 +3 +2 -3 +2

    -3 +3 +3 -3 +6 -6 -6 +6 +3 +3 +3 -3

    pattern correctly. This aspect of the model'sbehavior is illustrated in Table 3. Thus, wehave a model that can, in effect, acquirea number of distinct categories, simplythrough a process of incrementing connectionstrengths in response to each new stimuluspresentation. Noise, in the form of distortionsin the patterns, is filtered out. The modeldoes not require a name or other guide todistinguish the patterns belonging to differentcategories.

    Coexistence of the prototype and repeatedexemplars. One aspect of our discussion upto this point may have been slightly mislead-ing. We may have given the impression thatthe model is simply a prototype extractiondevice. It is more than this, however; it is adevice that captures whatever structure ispresent in a set of patterns. When the set ofpatterns has a prototype structure, the modelwill act as though it is extracting prototypes;but when it has a different structure, themodel will do its best to accommodate thisas well. For example, the model permits thecoexistence of representations of prototypeswith representations of particular, repeatedexemplars.

    Consider the following situation. Let us saythat our little boy knows a dog next doornamed Rover and a dog at his grandma'shouse named Fido. And let's say that thelittle boy goes to the park from time to timeand sees dogs, each of which his father tellshim is a dog.

    The simulation-analog of this involvedthree different eight-element name patterns,one for Rover, one for Fido, and one for Dog.

    The visual pattern for Rover was a particularrandomly generated distortion of the dogprototype pattern, as was the visual patternfor Fido. For the dogs seen in the park, eachone was simply a new random distortion ofthe prototype. The probability of flipping thesign of each element was again .2. The learn-ing regime was otherwise the same as in thedog-cat-bagel example.

    At the end of 50 learning cycles, the modelwas able to retrieve the visual pattern corre-sponding to either repeated exemplar (seeTable 4) given the associated name as input.When given the Dog name pattern as input,it retrieves the prototype visual pattern fordog. It can also retrieve the appropriate namefrom each of the three visual patterns. Thisis true, even though the visual pattern forRover differs from the visual pattern for dogby only a single element. Because of thespecial importance of this particular element,the weights from this element to the unitsthat distinguish Rover's name pattern fromthe prototype name pattern are quite strong.Given part of a visual pattern, the model willcomplete it; if the part corresponds to theprototype, then that is what is completed,but if it corresponds to one of the repeatedexemplars, that exemplar is completed. Themodel, then, knows both the prototype andthe repeated exemplars quite well. Severalother sets of prototypes and their repeatedexemplars could also be stored in the samemodule, as long as its capacity is not exceeded;given large numbers of units per module, alot of different patterns can be stored.

    Let us summarize the observations we

  • DISTRIBUTED MEMORY 173

    Table 4Results of Tests with Prototype and Specific Exemplar Patterns

    Input or response for each unit

    Case Name units Visual pattern units

    Pattern for dog + — + — + - + — + — + + — — — — + + + + + — — —prototype

    Response to +4 -5 +3 +3 -4 -3 -3 -3 +3 +3 +4 +3 +4 -3 -4 -4prototype name

    Response to +5 -4 +4 -4 +5 -4 +4 -4prototype visualpattern

    Pattern for "Fido" + - _ _ _ + _ — _ + _ (_) - | _ _ _ _ _ + . | _ _ | - + -|_ (+) _ _exemplar

    Response to Fido +4 -4 -4 +4 -4 -4 -4 -4 +4 +4 +4 +4 +4 +4 -4 -4name

    Response to Fido +5 -5 -3 -5 +4 -5 -3 -5visual pattern

    Pattern for "Rover" +-- + + + - + + ( + ) + + - - - - + + + + + ---exemplar

    Response to Rover +4 +5 +4 +4 -4 -4 -4 -4 +4 +4 +4 +4 +4 -4 -4 -4name

    Response to Rover +4 -4 -2 +4 +4 +4 -2 +4visual pattern

    have made in these several illustrative simu-lations. First, our distributed model is capableof storing not just one but a number ofdifferent patterns. It can pull the centraltendency of a number of different patternsout of the noisy inputs; it can create thefunctional equivalent of perceptual categorieswith or without the benefit of labels; and itcan allow representations of repeated exem-plars to coexist with the representation of theprototype of the categories they exemplify inthe same composite memory trace. The modelis not simply a categorizer or a prototypingdevice; rather, it captures the structure inher-ent in a set of patterns, whether it be char-acterizable by description in terms of proto-types or not.

    The ability to retrieve accurate completionsof similar patterns is a property of the modelwhich depends on the use of the delta learningrule. This allows both the storage of differentprototypes that are not completely orthogonaland the coexistence of prototype representa-tions and repeated exemplars.

    Simulations of Experimental Results

    Up to this point, we have discussed ourdistributed model in general terms and haveoutlined how it can accommodate both ab-

    straction and representation of specific infor-mation in the same network. We will nowconsider, in the next two sections, how wellthe model does in accounting for some recentevidence about the details of the influence ofspecific experiences on performance.

    Repetition and Familiarity Effects

    When we perceive an item—say a word,for example—this experience has effects onour later performance. If the word is presentedagain, within a reasonable interval of time,the prior presentation makes it possible forus to recognize the word more quickly, orfrom a briefer presentation.

    Traditionally, this effect has been inter-preted in terms of units that represent thepresented items in memory. In the case ofword perception, these units are called worddetectors or logogens, and a model of repeti-tion effects for words has been constructedaround the logogen concept (Morton, 1979).The idea is that the threshold for the logogenis reduced every time it fires (that is, everytime the word is recognized), thereby makingit easier to fire the logogen at a later time.There is supposed to be a decay of thispriming effect, with time, so that eventuallythe effect of the first presentation wears off.

  • 174 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    This traditional interpretation has comeunder serious question of late, for a numberof reasons. Perhaps paramount among thereasons is the fact that the exact relationbetween the specific context in which thepriming event occurs and the context inwhich the test event occurs makes a hugedifference (Jacoby, 1983a, 1983b). Generallyspeaking, nearly any change in the stimulus—from spoken to printed, from male speakerto female speaker, and so forth—tends toreduce the magnitude of the priming effect.

    These facts might easily be taken to supportthe enumeration of specific experiences view,in which the logogen is replaced by the entireensemble of experiences with the word, witheach experience capturing aspects of the spe-cific context in which it occurred. Such aview has been championed most strongly byJacoby (1983a, 1983b).

    Our distributed model offers an alternativeinterpretation. We see the traces laid downby the processing of each input as contributingto the composite, superimposed memory rep-resentation. Each time a stimulus is processed,it gives rise to a slightly different memorytrace: either because the item itself is differentor because it occurs in a different contextthat conditions its representation. The logogenis replaced by the set of specific traces, butthe traces are not kept separate. Each tracecontributes to the composite, but the char-acteristics of particular experiences tend nev-ertheless to be preserved, at least until theyare overridden by cancelling characteristicsof other traces. And the traces of one stimuluspattern can coexist with the traces of otherstimuli, within the same composite memorytrace.

    It should be noted that we are not faultingeither the logogen model or models based onthe enumeration of specific experiences fortheir physiological implausibility here, becausethese models are generally not stated in phys-iological terms, and their authors might rea-sonably argue that nothing in their modelsprecludes distributed storage at a physiologicallevel. What we are suggesting is that a modelwhich proposes explicitly distributed, super-positional storage can account for the kindsof findings that logogen models have beenproposed to account for, as well as otherfindings which strain the utility of the concept

    of the logogen as a psychological construct.In the discussion section we will considerways in which our distributed model differsfrom enumeration models as well.

    To illustrate the distributed model's ac-count of repetition priming effects, we carriedout the following simulation experiment. Wemade up a set of eight random vectors, each24 elements long, each one to be thought ofas the prototype of a different recurring stim-ulus pattern. Through a series of 10 trainingcycles using the set of eight vectors, we con-structed a composite memory trace. Duringtraining, the model did not actually see theprototypes, however. On each training pre-sentation it saw a new random distortion ofone of the eight prototypes. In each of thedistortions, each of the 24 elements had itsvalue flipped with a probability of. 1. Weightswere adjusted after every presentation, andthen allowed to decay to a fixed residualbefore the presentation of the next pattern.

    The composite memory trace formed as aresult of the experience just described playsthe same role in our model that the set oflogogens or detectors play in a model likeMorton's or, indeed, the interactive activationmodel of word preception. That is, the tracecontains information which allows the modelto enhance perception of familiar patterns,relative to unfamiliar ones. We demonstratethis by comparing the activations resultingfrom the processing of subsequent presenta-tions of new distortions of our eight familiarpatterns with other random patterns withwhich the model is not familiar. The patternof activation that is the model's response tothe input is stronger, and grows to a particularlevel more quickly, if the stimulus is a newdistortion of an old pattern than if it is anew pattern. We already observed this generalenhanced response to exact repetitions offamiliar patterns in our first example (seeTable 1). Figure 6 illustrates that the effectalso applies to new distortions of old patterns,as compared with new patterns, and illustrateshow the activation process proceeds oversuccessive time cycles of processing.

    Pattern activation and response strength.The measure of activation shown in the Figure6 is the dot product of the pattern of activationover the units of the module times the stim-

  • DISTRIBUTED MEMORY 175

    O

    O

    fn0)

    •+JH->cd

    Effect of Familiarityon Perceptual Processing

    O 0.6f-H

    o ft*T3

    0.3

    0.1

    0.0

    familiar

    unfamiliar

    i I i I I i I i I-6 10 16 20 25 30 35

    Processing CyclesFigure 6. Growth of the pattern of activation for new distortions of familiar and unfamiliar patterns. (Themeasure of the strength of the pattern of activation is the dot product of the response pattern with theinput vector. See text for an explanation.)

    ulus pattern itself, normalized for the numbern of elements in the pattern: For the patternj we call this expression a,. The expressiona,- represents the degree to which the actualpattern of activation on the units capturesthe input pattern. It is an approximate analogto the activation of an individual unit inmodels which allocate a single unit to eachwhole pattern.

    To relate these pattern activations to re-sponse probabilities, we must assume thatmechanisms exist for translating patterns ofactivation into overt responses measurable byan experimenter. We will assume that thesemechanisms obey the principles stated byMcClelland and Rumelhart (1981) in theinteractive activation model of word percep-tion, simply replacing the activations of par-ticular units with the a measure of patternactivation.

    In the interactive activation model, theprobability of choosing the response appro-priate to a particular unit was based on anexponential transform of a time average ofthe activation of the unit. This quantity,called the strength of the particular response,

    was divided by the total strength of all alter-natives (including itself) to find the responseprobability (Luce, 1963). One complicationarises because of the fact that it is not ingeneral possible to specify exactly what theset of alternative responses might be for thedenominator. For this reason, the strengthsof other responses are represented by a con-stant C (which stands for the competition).Thus, the expression for probability of choos-ing the response appropriate to pattern j isjust p(rj) = ek"J/(C + ek">\ where ctj representsthe time average of a/, and A: is a scalingconstant.

    These assumptions finesse an importantissue, namely the mechanism by which apattern of activation give rise to a particularresponse. A detailed discussion of this issuewill appear in Rumelhart and McClelland(in press). For now, we wish only to capturebasic properties any actual response selectionmechanism must have: It must be sensitiveto the input pattern, and it must approximateother basic aspects of response selection be-havior captured by the Luce (1963) choicemodel.

  • 176 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Effect of Familiarityon Probability Correct

    o0)

    Oo

    0.8

    £ 0.6

    0.4

    0.8

    0.0

    familiarunfamiliar

    -6 10 15 26 so 35

    Processing CyclesFigure 7. Simulated growth of response accuracy over the units in a 24-unit module, as a function ofprocessing cycles, for new distortions of previously learned patterns compared with new distortions ofpatterns not previously learned.

    Effects of experimental variables on time-accuracy curves. Applying the assumptionsjust described, we can calculate probabilityof correct response as a function of processingcycles for familiar and unfamiliar patterns.The result, for a particular choice of scalingparameters, is shown in Figure 7. If weassume performance in a perceptual identi-fication task is based on the height of thecurve at the point where processing is cut offby masking (McClelland & Rumelhart, 1981),then familiarity would lead to greater accu-racy of perceptual identification at a givenexposure duration. In a reaction time task, ifthe response is emitted when its probabilityreaches a particular threshold activation value(McClelland, 1979), then familiarity wouldlead to speeded responses. Thus, the modelis consistent with the ubiquitous influence offamiliarity both on response accuracy andspeed, in spite of the fact that it has nodetectors for familiar stimuli.

    But what about priming and the role ofcongruity between the prime event and thetest event? To examine this issue, we carriedout a second experiment. Following learning

    of eight patterns as in the previous experi-ment, new distortions of half of the randomvectors previously learned by the model werepresented as primes. For each of these primes,the pattern of activation was allowed to sta-bilize, and changes in the strengths of theconnections in the model were then made.We then tested the model's response to (a)the same four distortions; (b) four new dis-tortions of the same patterns; and (c) distor-tions of the four previously learned patternsthat had not been presented as primes. Therewas no decay in the weights over the courseof the priming experiment; if decay had beenincluded, its main effect would have been toreduce the magnitude of the priming effects.

    The results of the experiment are shownin Figure 8. The response of the model isgreatest for the patterns preceded by identicalprimes, intermediate for patterns precededby similar primes, and weakest for patternsnot preceded by any related prime.

    Our model, then, appears to provide anaccount, not only for the basic existence ofpriming effects, but also for the graded natureof priming effects as a function of congruity

  • DISTRIBUTED MEMORY 177

    o0)

    fcoo

    Priming and PrimeSimilarity Effects

    1.0

    0.8

    0.6

    0.4cdXiO

    £ o-2

    0.0

    identical primesimilar primeno prime

    10 IB 30 36 40

    Processing CyclesFigure 8. Response probability as a function of exposure time for patterns preceded by identical primes,similar primes, or no related prime.

    between prime event and test event. It avoidsthe problem of multiplication of context-specific detectors which logogen theories fallprey to, while at the same time avoidingenumeration of specific experiences. Congru-ity effects are captured in the compositememory trace.

    The model also has another advantage overthe logogen view. It accounts for repetitionpriming effects for unfamiliar as well as fa-miliar stimuli. When a pattern is presentedfor the first time, a trace is produced just asit would be for stimuli that had previouslybeen presented. The result is that, on asecond presentation of the same pattern, ora new distortion of it, processing is facilitated.The functional equivalent of a logogen beginsto be established from the very first presen-tation.

    To illustrate the repetition priming of un-familiar patterns and to compare the resultswith the repetition priming we have alreadyobserved for familiar patterns, we carried outa third experiment. This time, after learningeight patterns as before, a priming sessionwas run in which new distortions of four ofthe familiar patterns and distortions of four

    new patterns were presented. Then, in thetest phase, 16 stimuli were presented: Newdistortions of the primed, familiar patterns;new distortions of the unprimed, familiarpatterns; new distortions of the primed, pre-viously unfamiliar patterns; and finally, newdistortions of four patterns that were neitherprimed nor familiar. The results are shownin Figure 9. What we find is that long-termfamiliarity and recent priming have approx-imately additive effects on the asymptotes ofthe time-accuracy curves. The time to reachany given activation level shows a mild inter-action, with priming having slightly more ofan effect for unfamiliar than for familiarstimuli.

    These results are consistent with the bulkof the findings concerning the effects ofpreexperimental familiarity and repetition inthe recent series of experiments by Feustel,Shiffrin, and Salasoo (1983) and Salasoo etal. (1985). They found that preexperimentalfamiliarity of an item (word vs. nonword)and prior exposure had this very kind ofinteractive effect on exposure time requiredfor accurate identification of all the letters ofa string, at least when words and nonwords

  • 178 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Priming of Familiarand Unfamiliar Patterns

    o 1.0

  • DISTRIBUTED MEMORY 179

    Repetition Effects forFamiliar & Unfamiliar Patterns

    18

    •—t .O 17

    OT

    U 18

    E~* 15

    OT(Ui—iO>>O

    unfamiliar

    familiar

    i I i I i I i i i-2 -i 10

    Prior PresentationsFigure 10. Time to reach a fixed-accuracy criterion (60% correct) for previously familiar and unfamiliarpatterns, as a function of repetitions.

    elation would be built up between the itemand the learning context during initial train-ing. Such associations would be formed forwords, but because these stimuli have beenexperienced many times before and havealready been well learned, smaller incrementsin connection strength are formed for thesestimuli during training, and thus the strengthof the association between the item and thelearning context would be less. If this view iscorrect, we would expect to see a disadvantagefor pseudowords relative to words if the testingwere carried out in a situation which did notreinstate the mental state associated with theoriginal learning experience, because for thesestimuli much of what was learned would betied to the specific learning context: such aprediction would appear to differentiate ouraccount from any view which postulated theformation of an abstract, context-independentlogogen as the basis for the absence of apseudoword decrement effect.

    Representation of General andSpecific Information

    In the previous section, we cast our distrib-uted model as an alternative to the view thatfamiliar patterns are represented in memoryeither by separate detectors or by an enumer-

    ation of specific experiences. In this section,we show that the model provides alternativesto both abstraction and enumeration modelsof learning from exemplars of prototypes.

    Abstraction models were originally moti-vated by the finding that subjects occasionallyappeared to have learned better how to cat-egorize the prototype of a set of distortedexemplars than the specific exemplars theyexperienced during learning (Posner & Keele,1968). However, pure abstraction models havenever fared very well, because there is nearlyalways evidence of some superiority of theparticular training stimuli over other stimuliequally far removed from the prototype. Afavored model, then, is one in which there isboth abstraction and memory for particulartraining stimuli.

    Recently, proponents of models involvingonly enumeration of specific experiences havenoted that such models can account for thebasic fact that abstraction models are pri-marily designed to account for—enhancedresponse to the prototype, relative to partic-ular previously seen exemplars, under someconditions—as well as failures to obtain sucheffects under other conditions (Hintzman,1983; Medin & Shaffer, 1978). In evaluatingdistributed models, it is important to see ifthey can do as well. Anderson (1977) has

  • 180 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    Table 5Schematic Description of Stimulus Sets Used in Simulations of Whittlesea's Experiments

    Prototype

    PPPPP

    la

    APPPPPAPPPPPAPPPPPAPPPPPA

    Ib

    BPPPPPBPPPPPBPPPPPBPPPPPB

    Ila

    ABPPPPABPPPPABPPPPABBPPPA

    Stimulus set

    lib

    ACPPPPACPPPPACPPPPACCPPPA

    lie

    APCPPPAPCPPPAPCCPPAPPCPPA

    III

    ABCPPPABCPPPABCCPPABBCPPA

    V

    CCCCCCBCBCBCACBABCBACACAC

    Note. The actual stimuli used can be filled in by replacing P with H—I—; A with +H ; B with H h; and C with++++. The model is not sensitive to the fact the same subpattern was used in each of the five slots.

    made important steps in this direction, andKnapp and Anderson (1984) have shown howtheir distributed model can account for manyof the details of the Posner-Keele experi-ments. Recently, however, two sets of findingshave been put forward which appear tostrongly favor the enumeration of specificexperiences view, at least relative to pureabstraction models. It is important, therefore,to see how well our distributed model can doin accounting for these kinds of effects.

    The first set of findings comes from a setof studies by Whittlesea (1983). In a largenumber of studies, Whittlesea demonstrateda role for specific exemplars in guiding per-formance on a perceptual identification task.We wanted to see whether our model woulddemonstrate a similar sensitivity to specificexemplars. We also wanted to see whetherour model would account for the conditionsunder which such effects are not obtained.

    Whittlesea used letter strings as stimuli.The learning experiences subjects receivedinvolved simply looking at the stimuli one ata time on a visual display and writing downthe sequence of letters presented. Subjectswere subsequently tested for the effect of thistraining on their ability to identify letterstrings bearing various relations to the trainingstimuli and to the prototypes from which thetraining stimuli were derived. The test was aperceptual identification task; the subject wassimply required to try to identify the lettersfrom a brief flash.

    The stimuli Whittlesea used were all dis-tortions of one of two prototype letter strings.Table 5 illustrates the essential properties ofthe sets of training and test stimuli he used.The stimuli in Set la were each one step

    away from the prototype. The Ib items werealso one step from the prototype and onestep from one of the la distortions. The SetIla stimuli were each two steps from theprototype, and one step from a particular ladistortion. The Set lib items were also twosteps from the prototype, and each was onestep from one of the Ila distortions. The SetHe distortions were two steps from the pro-totype also, and each was two steps from theclosest Ila distortion. Over the set of five Hedistortions, the A and B subpatterns eachoccurred once in each position, as they didin the case of the Ha distortions. The distor-tions in Set III were three steps from theprototype, and one step from the closestmember of Set Ila. The distortions in Set Vwere each five steps from the prototype.

    Whittlesea ran seven experiments usingdifferent combinations of training and teststimuli. We carried out simulation analogs ofall of these experiments, plus one additionalexperiment that Whittlesea did not run. Themain difference between the simulation ex-periments and Whittlesea's actual experimentswas that he used two different prototypes ineach experiment, whereas we only used one.

    The simulation used a simple 20-unitmodule. The set of 20 units was divided intofive submodules, one for each letter in Whit-tlesea's letter strings. The prototype patternand the different distortions used can bederived from the information provided inTable 5.

    Each simulation experiment began withnull connections between the units. Thetraining phase involved presenting the set orsets of training stimuli analogous to thoseWhittlesea used, for the same number of

  • DISTRIBUTED MEMORY 181

    presentations. To avoid idiosyncratic effectsof particular orders of training stimuli, eachexperiment was run six times, each with adifferent random order of training stimuli.On each trial, activations were allowed tosettle down through 50 processing cycles, andthen connection strengths were adjusted.There was no decay of the increments to theweights over the course of an experiment.

    In the test phase, the model was testedwith the sets of test items analogous to thesets Whittlesea used. As a precaution againsteffects of prior test items on performance, wesimply turned off the adjustment of weightsduring the test phase.

    A summary of the training and test stimuliused in each of the experiments, of Whittle-sea's findings, and of the simulation resultsare shown in Table 6. The numbers representrelative amounts of enhancement in perfor-mance as a result of the training experience,relative to a pretest baseline. For Whittlesea'sdata, this is the per letter increase in letteridentification probability between a pre- andposttest. For the simulation, it is the increasein the size of the dot product for a pretestwith null weights and a posttest after training.For comparability to the data, the dot productdifference scores have been doubled. This issimply a scaling operation to facilitate quali-tative comparison of experimental and sim-ulation results.

    A comparison of the experimental andsimulation results shows that wherever thereis a within-experiment difference in Whittle-sea's data, the simulation produced a differ-ence in the same direction. (Between experi-ment comparisons are not considered becauseof subject and material differences whichrenders such differences unreliable.) The nextseveral paragraphs review some of the majorfindings in detail.

    Some of the comparisons bring out theimportance of congruity between particulartest and training experiences. Experiments 1,2, and 3 show that when distance of teststimuli from the prototype is controlled, sim-ilarity to particular training exemplars makesa difference both for the human subject andin the model. In Experiment 1, the relevantcontrast was between la and Ib items. InExperiment 2, it was between Ha and Heitems. Experiment 3 shows that the subjects

    and the model both show a gradient in per-formance with increasing distance of the testitems from the nearest old exemplar.

    Experiments 4, 4', and 5 examine thestatus of the prototype and other test stimulicloser to the prototype than any stimuli ac-tually shown during training. In Experiment4, the training stimuli were fairly far awayfrom the prototype, and there were only fivedifferent training stimuli (the members of theIla set). In this case, controlling for distancefrom the nearest training stimuli, test stimulicloser to the prototype showed more en-hancement than those farther away (la vs. Illcomparison). However, the actual trainingstimuli nevertheless had an advantage overboth other sets of test stimuli, including thosethat were closer to the prototype than thetraining stimuli themselves (Ila vs. la com-parison).

    In Experiment 4' (not run by Whittlesea)the same number of training stimuli wereused as in Experiment 4, but these were

    Table 6Summary of Perceptual IdentificationExperiments With Experimentaland Simulation Results

    Whittlesea'sexperiment

    1

    2

    3

    4

    4'

    5

    6

    7

    Trainingstimulus

    set(s)

    la

    Ila

    Ila

    Ila

    la

    Ha, b, c

    III

    Ila

    Teststimulus

    sets

    laIbVIlaHeVIlalibHePlaIlaIIIPlaIlaPlaIlaIIIlaIlaIIIIlaHeHI

    Experi-mentalresults

    .27

    .16

    .03

    .30

    .15

    .03

    .21

    .16

    .10

    —.19.23.15

    ————.16.16.10.16.16.19.24.13.17

    Simu-lationresults

    .24

    .15-.05

    .29

    .12-.08

    .29

    .14

    .12

    .24

    .21

    .29

    .15

    .28

    .24

    .12

    .25

    .21

    .18

    .09

    .14

    .19

    .30

    .29

    .12

    .15

  • 182 JAMES L. MCCLELLAND AND DAVID E. RUMELHART

    closer to the prototype. The result is that thesimulation shows an advantage for the pro-totype over the old exemplars. The specifictraining stimuli used even in this experimentdo influence performance, however, as Whit-tlesea's first experiment (which used the sametraining set) shows (la-Ib contrast). This effectholds both for the subjects and for the simu-lation. The pattern of results is similar to thefindings of Posner and Keele (1968), in thecondition where subjects learned six exem-plars which were rather close to the prototype.In this condition, their subjects' categorizationperformance was most accurate for the pro-totype, but more accurate for old than fornew distortions, just as in this simulationexperiment.

    In Experiment 5, Whittlesea demonstratedthat a slight advantage for stimuli closer tothe prototype than the training stimuli wouldemerge, even with high-level distortions, whena large number of different distortions wereused once each in training, instead of asmaller number of distortions presented threetimes each. The effect was rather small inWhittlesea's case (falling in the third decimalplace in the per letter enhancement effectmeasure) but other experiments have pro-duced similar results, and so does the simu-lation. In fact, because the prototype wastested in the simulation, we were able todemonstrate a monotonic drop in perfor-mance with distance from the prototype inthis experiment.

    Experiments 6 and 7 examine in differentways the relative influence of similarity tothe prototype and similarity to the set oftraining exemplars, using small numbers oftraining exemplars rather far from the pro-totype. Both in the data and in the model,similarity to particular training stimuli ismore important than similarity to the pro-totype, given the sets of training stimuli usedin these experiments.

    Taken together with other findings, Whit-tlesea's results show clearly that similarity oftest items to particular stored exemplars isof paramount importance in predicting per-ceptual performance. Other experiments showthe relevance of these same factors in othertasks, such as recognition memory, classifi-cation learning, and so forth. It is interestingto note that performance does not honor thespecific exemplars so strongly when the train-

    ing items are closer to the prototype. Undersuch conditions, performance is superior onthe prototype or stimuli closer to the proto-type than the training stimuli. Even when thetraining stimuli are rather distant from theprototype, they produce a benefit for stimulicloser to the prototype, if there are a largenumber of distinct training stimuli eachshown only once. Thus, the dominance ofspecific training experiences is honored onlywhen the training experiences are few andfar between. Otherwise, an apparent advan-tage for the prototype, though with someresidual benefit for particular training stimuli,is the result.

    The congruity of the results of these sim-ulations with experimental findings under-scores the applicability of distributed modelsto the question of the nature of the represen-tation of general and specific information. Infact, we were somewhat surprised by theability of the model to account for Whittle-sea's results, given the fact that we did notrely on context-sensitive encoding of the letterstring stimuli. That is, the distributed repre-sentation we assigned to each letter was in-dependent of the other letters in the string.However, a context sensitive encoding wouldprove necessary to capture a larger ensembleof stimuli.

    Whether a context-sensitive encoding wouldproduce the same or slightly different resultsdepends on the exact encoding. The exactdegree of overlap of the patterns of activationproduced by different distortions of the sameprototype determines the extent to which themodel will tend to favor the prototype relativeto particular old exemplars. The degree ofoverlap, in turn, depends on the specificassumptions made about the encoding of thestimuli. However, the general form of theresults of the simulation would be unchanged:When all the distortions are close to theprototype, or when there is a very largenumber of different distortions, the centraltendency will produce the strongest response;but when the distortions are fewer, and fartherfrom the prototype, the training exemplarsthemselves will produce the strongest activa-tions. What the encoding would effect is thesimilarity metric.

    In this regard, it is worth mentioning an-other finding that appears to challenge ourdistributed account of what is learned through

  • DISTRIBUTED MEMORY 183

    repeated experiences with exemplars. This isthe finding of Medin and Schwanenflugel(1981). Their experiment compared ease oflearning of two different sets of stimuli in acategorization task. One set of stimuli couldbe categorized by a linear combination ofweights assigned to particular values on eachof four dimensions considered independently.The other set of stimuli could not be cate-gorized in this way; and yet, the experimentclearly demonstrated that linear separabilitywas not necessary for categorization learning.In one experiment, linearly separable stimuliwere less easily learned than a set of stimulithat were not linearly separable but had ahigher degree of intraexemplar similiaritywithin categories.

    At first glance, it may seem that Medinand Schwanenflugel's experiment is devastat-ing to our distributed approach, because ourdistributed model can only learn linear com-binations of weights. However, whether alinear combination of weights can suffice inthe Medin and Schwanenflugel experimentsdepends on how patterns of activation areassigned to stimuli. If each stimulus dimen-sion is encoded separately in the representa-tion of the stimulus, then the Medin andSchwanenflugel stimuli cannot be learned byour model. But if each stimulus dimensionis encoded in a context sensitive way, thenthe patterns of activation associated with thedifferent stimuli become linearly separableagain.

    One way of achieving context sensitivity isvia separate enumeration of traces. But it iswell known that there are other ways as well.Several different kinds of context-sensitiveencodings which do not require separate enu-meration of traces, or the allocation of sepa-rate nodes to individual experiences are con-sidered in Hinton (198la), Hinton, Mc-Clelland, and Rumelhart (in press), andRumelhart and McClelland (in press).

    It should be noted that the motivation forcontext-sensitive encoding in the use of dis-tributed representations is captured by butby no means limited to the kinds of obser-vations reported in the experiment by Medinand Schwanenflugel. The trouble is that theassignment of particular context-sensitive en-codings to stimuli is at present rather ad hoc:There are too many different possible waysit can be done to know which way is right.

    What is needed is a principled way of assign-ing distributed representations to patterns ofactivation. The problem is a severe one, butreally it is no different from the problem thatall models face, concerning the assignmentof representations to stimuli. What we cansay for sure at this point is that context-sensitive encoding is necessary, for distributedmodels or for any other kind.

    Discussion

    Until very recently, the exploration of dis-tributed models was restricted to a few work-ers, mostly coming from fields other thancognitive psychology. Although in some cases,particularly in the work of Anderson (1977;Anderson et al., 1977; Knapp & Anderson,1984), some implications of these models forour understanding of memory and learninghave been pointed out, they have only begunto be applied by researchers primarily con-cerned with understanding cognitive processesper se. The present article, along with thoseof Murdock (1982) and Eich (1982), repre-sents what we hope will be the beginning ofa more serious examination of these kinds ofmodels by cognitive psychologists. For theyprovide, we believe, important alternatives totraditional conceptions of representation andmemory.

    We have tried to illustrate this point hereby showing how the distributed approachcircumvents the dilemma of specific tracemodels. Distributed memories abstract evenwhile they preserve the details of recent, orfrequently repeated, experiences. Abstractionand preservation of information about specificstimuli are simply different reflections of theoperation of the same basic learning mecha-nism.

    The basic points we have been making canof course be generalized in several differentdirections. Here we