A model of context-dependent association using selective desensitization of nonmonotonic neural...

A Model of Context-Dependent Association Using SelectiveDesensitization of Nonmonotonic Neural Elements

Masahiko Morita,1 Kouhei Matsuzawa,2 and Shigemitsu Morokami2

1Institute of Engineering Mechanics and Systems, University of Tsukuba, Tsukuba, 305-8573 Japan

2Doctoral Program in Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573 Japan

SUMMARY

In neural networks based on a distributed informationrepresentation, when different patterns are to be recalledfrom the same input by association depending on the con-text, the usual method is to concatenate the pattern repre-senting the context with the input pattern. However, thereis a serious problem in this approach, and strong constraintsare imposed on the number of inputs and the number ofcontext patterns. This paper tries to apply a different methodof contextual modification to nonmonotonic neural net-works in order to construct a context-dependent associativemodel that can solve the longstanding problem. In theproposed model, the number of associations that can belearned increases almost in proportion to the number ofelements, regardless of the numbers of the input and contextpatterns. In addition, the state transitions among attractorscan be controlled flexibly by switching the context, whichenables the model to simulate the behavior of any finiteautomaton without inducing an explosive increase in thenumber of elements or the training time. The model alsohas a high generalization power ability based on fullydistributed representations and has potential for overcom-ing the limitations of conventional symbol processing.

© 2005 Wiley Periodicals, Inc. Syst Comp Jpn, 36(7):73–83, 2005; Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/scj.10477

Key words: nonmonotonic associative model; tra-jectory attractor; product-type contextual modification;context-directed association; finite automaton.

1. Introduction

In general, human information processing dependsstrongly on context. In order to realize information process-ing systems with human-like intelligence, the problem ofcontext dependence cannot be avoided. Artificial neuralnetworks are no exception, and thus the study of context-dependent processing is very important.

Since “context” has diversified implications androles, it is difficult to discuss context in the general sense.In this paper, the basic viewpoint of our discussion is thatif the context differs, the desired output for the same inputcan differ. In other words, by “context” we mean the infor-mation that affects the input–output relations of the system.The function of adjusting the output by modifying the inputinformation is common to most of the roles played bycontext and constitutes the basic problem in modelingcontext processing.

From this viewpoint, conventional artificial neuralnetworks have very poor ability in context processing.

© 2005 Wiley Periodicals, Inc.

Systems and Computers in Japan, Vol. 36, No. 7, 2005Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J85-D-II, No. 10, October 2002, pp. 1602–1612

Contract grant sponsor: Supported by Special Coordination Funds forPromoting Science and Technology from the Ministry of Education,Culture, Sports, Science and Technology of Japan under the heading“Study of Dynamic Memory Systems for Realization of Context-InitiativeRecognition, Decision and Behavior Functions.”

73

Several models that handle the context based on the distrib-uted information representation have been proposed, butthese can operate effectively only for small-scale problems.As will be discussed in detail in Section 2, the major reasonfor their unsatisfactory operation is the conventionalmethod of contextual modification, that is, the approach ofsimply concatenating the pattern representing the contextwith the input pattern (called direct-sum type modifica-tion). So long as this approach is used, there will be afundamental difficulty in associating a set of input patternswith various sets of output patterns according to variouscontexts.

Other than the above approach, a model called mix-ture of experts is often used when the input–output relationschange greatly according to the context [1]. It consists ofseveral networks that are individually specialized for everycontext. It is true that the problems of nonorthogonality andaveraging (described later) do not arise in this approach, butit requires as many networks as the number of contexts,which greatly degrades the efficiency of elements. A furtherserious problem of this model is that the context is essen-tially handled as a symbol. This not only causes the problemthat no generalization occurs among different contexts butproduces another problem of how to adequately symbolizethe context. Thus, this approach does not sufficiently utilizethe advantages of distributed information processing byneural networks.

This paper presents a distributed contextual modifi-cation method which differs from the direct-sum type, andsolves the above longstanding problem by applying themethod to an associative memory model using nonmono-tonic elements. In the following, we first describe theproblems of conventional contextual modification methodand the nonmonotonic neural network which forms thebasis of the proposed model. Then product-type modifica-tion or the selective desensitization of elements is de-scribed, and it is shown that by applying this method to anonmonotonic neural network, context-dependent associa-tion without constraints of scale is achieved. Also a context-directed associative model is constructed, and it is shownthat the model can simulate the behavior of finite automa-tons.

2. Problems of Conventional Methods

Here we consider the Elman network [2], which is atypical neural network producing context-dependent out-put, as an example, and point out the problems involved inthe conventional distributed context-dependent repre-sentation.

Figure 1 shows the structure of the Elman network,which is a three-layered network with partially recurrentconnections. The first layer is the input part plus the context

units. The second layer is a hidden layer in which theinternal state is maintained. The third layer is the outputlayer. The activity pattern of the context units at discretetime t is identical to the previous pattern y(t – 1) of thehidden layer [or in a more general model, a pattern deter-mined by y(t – 1)].

Since the pattern z(t) of the third layer depends ex-clusively on y(t), in order to output different patterns, theactivity patterns of the second layer must differ. Conse-quently, we ignore the third layer and consider a simpletwo-layered network which outputs y on the basis of thepattern x of the first layer.

Suppose that p(≥ 2) patterns S1, . . . , Sp are input tothe input part of this network, and q(≥ 2) patternsC1, . . . , Cq appear in the context part. Let Sµ and Cn bepatterns of n1 and n2 dimensions, respectively. Combiningthese, there can be a maximum of p × q different (n1 + n2)-dimensional patterns x = (Sµ, Cµ). When we wish to con-struct a network that produces independent patternsT1,1, . . . , Tp,q as output y to these patterns x, the followingproblems arise (the reasoning applies to the case that not allpq patterns appear if the number of patterns is largeenough).

First, some of the patterns in the first layer are similar.For example, (S1, C1) and (S1, C2) have at least n1 elementsin common and (S1, C1) and (S2, C1) have at least n2 com-mon elements. Consequently, even if Sµ and Cn are suffi-ciently distant from each other, the orthogonality withrespect to x is low. This property is pronounced when n1

and n2 differ greatly. In general, it becomes difficult to setthe connection weights so that the desired patterns areoutput when the orthogonality of x decreases. Since thenumber of combinations of similar patterns increases rap-idly with increasing p and q, the above goal is difficult toachieve unless p and q are relatively small and alson1 u n2. This is called the problem of nonorthogonality.

The second point is the problem of averaging byone-to-many correspondence. For a single input pattern S1,

Fig. 1. Structure of the Elman network.

74

for example, there can be q activity patterns T1,1 to T1,q inthe second layer. Thus, the connection weights from theinput part to the second layer must be set such that if S1 isfed to the input part but nothing is fed to the context part, asimilar pattern to the average pattern for T1,1, . . . , T1,q

appears in the second layer. This applies to the case in whichthe connection weights are determined by training; that is,S1 is associated with T

__1 no matter what learning rule is

applied. Similarly, input pattern Sµ is associated with theaverage pattern T

__µ of Tµ,1, . . . , Tµ,q.

As q increases, the averaging effect is more marked,producing T

__1, . . . , T

__p similar patterns. This implies that no

matter which Sµ is input, the signals received in the secondlayer do not differ much, or the connections from the inputpart become totally insignificant. The same reasoning ap-plies to the context part; that is, if a large number of Tµ,n

correspond to a single context Cn, the connections from thecontext part to the second layer are insignificant. It istherefore impossible to realize the desired network unlessp and q are very small.

The latter problem of averaging is fundamental andcannot be easily solved. Even if another hidden layer isprovided between the first and second layers, the problemis simply rewritten as a problem between the first layer andthe hidden layer or between the hidden layer and the secondlayer, which does not bring about any essential solution (inthe case of Elman networks, it rather makes training moredifficult and degrades performance).

Also, introduction of recurrent connections or attrac-tor dynamics does not produce any improvement in quality,though there can be a slight improvement in the quantitativesense. The one-to-many associative model HASP [4] issuch an example. When the number of patterns to be memo-rized is increased, association is not achieved unless apattern which is highly correlated only to the pattern to berecalled is given [5].

Theoretically, the Elman network can produce behav-ior equivalent to that of any finite automaton [3]. However,this is possible when an infinite number of elements areavailable, and does not guarantee that such behavior isacquired by training. In a real network with a number ofelements, a large-scale automaton cannot be realized for theabove reason. Considering the computational complexityand convergence possibility of back-propagation learning,the Elman network can simulate only a limited-scaleautomaton. Since a finite automaton can be considered asthe basic model of context processing using symbols, theabove situation indicates the limitations of contextual proc-essing by the Elman network, which are common to allneural network models using a similar context repre-sentation or contextual modification method.

3. Nonmonotonic Associative Model

Before presenting a solution to the above problem, adynamic associative model based on the nonmonotonicneural network (called nonmonotonic associative model) isbriefly described. For details, see Refs. 6 and 7.

3.1. Network dynamics

The nonmonotonic neural network [8] is a networkin which elements with a nonmonotonic output function arefully connected. In this study, the following function is usedas the nonmonotonic output function:

where c, c′, and h are positive constants, and it is assumedthat the system behaves in accordance with continuous-time dynamics. In other words, the output yi of the i-thelement is

Here, n is the number of elements, ui is the internal potentialof the element, wij is the coupling weight from the j-thelement, zi is the input signal from outside, and τ is the timeconstant.

Since in this model, the sign of the internal potentialui is important, we consider xi = sgn(ui) [sgn(u) is the signfunction, which takes a value of 1 for u > 0, and of –1 foru ≤ 0]. The vector x = (x1, . . . , xn) is called the state of thenetwork. The state x of the network at a given time isrepresented as a point in the state space composed of 2n

possible states. Since the xi change asynchronously, thetransition of x to another state is almost always to anadjacent state (i.e., a state in which only one element isdifferent). Consequently, a continuous trace of x is drawnin the state space with the passage of time. It is called thetrajectory of x.

The nonmonotonic neural network has a great abilityof associative memory, as well as many features which arenot found in other neural network models [8]. One of theadvantages is that memory is possible by a simple Hebbianlearning rule, even if the orthogonality of patterns is rela-tively low [9], which is one of the reasons for using thismodel in this study.

Another important feature related to this study is thata transition can be made stably from any state to any otherstate along a continuous trajectory in the state space. Thistrajectory is called the trajectory attractor. Using this, vari-

(2)

(3)

(1)

75

ous kinds of spatiotemporal pattern processing have beenrealized [7, 10]. When an ordinary monotonic output func-tion is used, it is very difficult to form such a trajectoryattractor unless the nonmonotonic element is realizedequivalently by combining multiple elements [6].

3.2. Learning and association

Consider a set of spatial patterns S1, . . . , Sm which isassociated with another set of patterns T1, . . . , Tm. Here,Sµ and Tµ (µ = 1, . . . , m) are n-dimensional vectors with±1 as components and the former is called the cue patternand the latter the target pattern. We wish to realize recall ofthe target from the cue by m trajectory attractors with Sµ asthe start and Tµ as the end.

The trajectory attractor is formed as follows. First, aspatiotemporal pattern which changes along a continuoustrajectory from Sµ to Tµ is formed (it may be arbitrary solong as it does not cross itself, but usually the shortest pathis considered). Using that pattern as the learning signal r,the network is trained.

Specifically, x = Sµ is used as the initial state and thenetwork is operated in accordance with Eqs. (1) to (3). Atthe same time, r is input to the elements in the form ofzi = λri(t) (ri is an element of r and λ is the input intensity).In parallel to this, all connection weights wij are updated by

Here τ′ is the time constant of learning, and τ′ >> τ; α is thelearning coefficient, which may be a constant but we set α= α′xiyi in this study (α′ is a positive constant) since thelearning performance is improved if α is decreased withincreasing |ui|.

When the learning signal r changes, the network statex makes transitions almost exactly along the trajectory,following slightly behind r. When r reaches the goal, thatis, Tµ, training is continued for a while (until x effectivelyapproaches r), retaining r = Tµ. By performing the aboveprocedure for all µ, a training session is completed. Theprocedure is repeated for several sessions, and it is better toreduce the input intensity λ of r with the progress ofiteration.

Figure 2 is a schematic representation of the energylandscape in the state space of the network after training.Each point on the plane represents a state of the network,and the depth direction represents the potential energy orthe stability of the state. Intuitively, learning rule (4) de-creases the energy of the network in the neighborhood of r.Consequently, when r changes continuously from Sµ to Tµ,a groove is produced, as shown in the figure. In addition,since r leads x slightly, a gentle flow from x to r, that is,from Sµ to Tµ, is formed at the bottom of the groove. Thus,

only by providing Sµ or its neighborhood pattern as theinitial state of the network, x reaches Tµ almost exactlyalong the trajectory of r, and the desired target is recalled.

According to the properties of nonmonotonic net-works, multiple trajectory attractors can exist in a limitedregion unless the trajectories cross or are very close to-gether. Since the dimension n of the state space is suffi-ciently large and there exist many trajectories connectingtwo specified points, even if two cue patterns S1 and S2 arefairly close together, it is possible to associate them witharbitrary target patterns T1 and T2.

3.3. Separation of trajectories by contextualmodification

As described above, the nonmonotonic associativemodel allows a considerable similarity between cue pat-terns, and can control the problem of nonorthogonality.However, it is essentially impossible to associate an identi-cal cue with different targets. In order to make it possible,it is necessary to modify Sµ according to the context andseparate the starting points of trajectory attractors.

The simplest approach to this is to use direct-sumtype modification as used in the Elman network, that is, touse the direct sum vector (Sµ, Cn) of the cue pattern Sµ andthe context pattern Cn as the input pattern. This makestrajectories which overlap in the cue pattern space separatedin the state space of the whole network, making it possibleto handle even spatiotemporal patterns with complex over-lapping trajectories [10].

So long as this method is used, however, the aboveproblem of averaging cannot be avoided. In fact, as will beshown in the numerical experiment in Section 4.3, there isa limit on the number of Sµ and Cn, no matter how manyelements are used. Consequently, we use product-type con-

Fig. 2. Schematic energy landscape of the network withtrajectory attractors.

(4)

76

textual modification instead of direct-sum type modifica-tion to solve the problem.

4. Principle of Context-DependentAssociative Model

4.1. Product-type contextual modification andselective desensitization

Consider a neural element whose sensitivity to theinput is variable. Specifically, the output of the i-th elementof the network is

and the gain gi is variable. Here, ui is the internal potentialof the element (the weighted sum of inputs in the case ofthe discrete-time model) and y

_i is the average output level

of the element. The basic idea of the product-type contex-tual modification is that the sensitivity vector G = (g1, . . . ,gn) depends on the pattern C representing the context,namely, G is a function of C.

When we wish to solve the problem of averagingusing this modification method, the manipulation to set gi

= 0 for some elements, that is, to make yi ≡ yi regardless ofthe input, is essential. This is called selective desensitiza-tion. Here, the set of elements to be desensitized shoulddiffer in every context. Also, in order to maximize the sizeof the pattern space C, or the capacity to represent context,each element should independently be desensitized in vari-ous contexts.

Thus, the proposed scheme is essentially differentfrom the gating mechanism used in mixture of experts, inthat each element is used or desensitized in a very largenumber of contexts. It should be noted that even in adesensitized element, the internal potential ui changes ac-cording to the input signal in the same way as when theelement is not desensitized (there is no essential differenceif ui is kept constant).

By this modification, even if the same input patternis fed in the same state, the output pattern changes accord-ing to C, making it possible to associate the same cue withdifferent targets. In this process, the effect of C is indirectand the set of elements representing C and the set ofelements representing the target are not directly connected.This makes it possible to avoid the problem of averaging.It should be noted that if the output of desensitized elementsis other than yi, the modified pattern y(C) and the contextpattern C are correlated, which potentially allows the prob-lem of averaging to remain.

Elements the same as or similar to Eq. (5) have beenused in several models. The product-type modification

itself, in which G modifies the context, is not very novel.One reason that the problem of averaging remains unsolvedis described below, but a more essential reason seems to bethat the essence of the problem or the significance ofdesensitization in it has not been well recognized (seeSection 5.2).

Below, the simplest case is considered in order tomake the discussion plain: Let y

_i for all elements. This

assumption will be valid if the output function f(u) is an oddfunction, as in Eq. (1) (there is little problem even if y

_i u

0). The sensitivity vector G is regarded in the same light asthe context pattern, assuming one-to-one correspondencebetween them. Specifically, C is assumed to be an n-dimen-sional pattern of binary (±1) components, and we let gi = (1+ ci)/2 (ci is a component of C). The numbers of 1s and –1sin the pattern are set nearly equal (this condition is satisfiedif ci is set at random when n is sufficiently large). Then,approximately half the elements are desensitized to outputyi = 0, and the others output yi = f(ui) like ordinary elements.

The state of the network with such modification isdenoted by x(C) ≡ (g1xi, . . . , gnxn), which is a ternaryvector whose components are 1, 0, or –1. The state x withoutmodification is represented as the state modified by thevector I ≡ (1, . . . , 1), that is, x(I).

Similarly, modification to cue pattern S and targetpattern T can be considered. Specifically, S modified bycontext pattern C is defined as S(C) ≡ (g1s1, . . . , gnsn)(likewise for T). Assuming that S and T have binary com-ponents that take values of ±1, S(C) and T(C) are ternarypatterns as in the case of x(C).

4.2. Application to nonmonotonic associativemodel

Product-type contextual modification and selectivedesensitization can be applied to most neural network mod-els. In order to maximize effectiveness, however, it is desir-able that the model to which the method is applied shouldsatisfy certain conditions.

The first condition is that the pattern to which con-textual modification is applied should be an attractor of thenetwork. For example, let x = S be an attractor. Then thereis little possibility that ui changes signs immediately aftermodification by a context pattern C1. Consequently, xi = si

holds for almost all elements including the desensitizedones. Thus, x(C1) = S(C1) and x u S. If the modification iseliminated in this state, the state returns to x = S almostimmediately. Also, if the context pattern is switched fromC1 to another pattern C2, the state remains as x u S, and wehave x(C2) u S(C2). Thus, when S is an attractor, there is nodanger that information is lost by contextual modificationor context switching.

(5)

77

It is not always true that S(C) is an attractor if S is anattractor. Conversely, S is not always an attractor if S(C) isan attractor. However, if S(Cn) is an attractor for a suffi-ciently large number of Cn, S is certain to be an attractor.Consequently, if training is performed so that S(Cn) is anattractor for several contexts, training in the state S withoutcontextual modification is not necessary for satisfying thecondition.

The second condition is that the problem of nonor-thogonality does not arise. Even if C1 and C2 are completelyindependent patterns, S(C1) and S(C2) are quite similar.Besides, since we use distributed representations for thecontext, the system must be able to handle the case in whichC1 and C2 are similar and the orthogonality between S(C1)and S(C2) is still lower. Thus, the problem of nonortho-gonality must be solved, as in the case of direct-sum typemodification.

It is the nonmonotonic associative model describedin Section 3 that satisfies the above conditions and has asimple structure. By introducing the selective desensitiza-tion method into this model, all of the problems describedin Section 2 are solved, and a large-scale system for con-text-dependent association can be realized. Specifically, themodel is modified as follows.

As regards the equation for the network operation,only Eq. (3) is replaced by Eq. (5), provided that a constantcontext pattern Cn is continuously given in the process ofrecalling the target from the cue. The training procedure isbasically the same as in Section 3.2, but the following pointdiffers: The spatiotemporal pattern changing from Sµ(Cn)and Tµ,n(Cn) is used as the learning signal r, and the trainingis continued while applying the modification by Cn (notraining is applied for the desensitized elements). In addi-tion, if necessary, in order to make Sµ and Tµ,n attractors,training is performed for a short period without applyingcontextual modification (C = I), using these patterns as thelearning signal.

In the network after training, trajectory attractors areformed in different subspaces depending on the context, bywhich context-dependent association is realized. Figure 3shows the process schematically.

The figure represents the n-dimensional state spaceof the network in three dimensions. In other words, we writex = (xa, xb, xc). For convenience of description, we assumethat the sets of elements corresponding to xc, xb, xa aredesensitized by context patterns C1, C2, and C3, respec-tively, to output 0 (actually, about half of the desensitizedelements in a context are desensitized in another context).When a modification by C1, for example, is applied to thenetwork, x is projected onto subspace (xa, xb, O) (O is thezero vector) and the network is operated on the basis of thedynamics in the subspace. Thus, when the cue pattern S isgiven as the initial state, x(C1) makes the transition fromS(C1) to T1(C1) along the trajectory attractor formed in the

subspace. As previously described, if T1 is an attractor, x uT1 is established at this stage and T1 is exactly recalled bymaking C = I. Similarly, as a result of modification by C2

or C3, the state makes a transition along the trajectoryattractor formed in the corresponding subspace, and T2 orT3 is recalled.

What is important here is to avoid interference be-tween trajectory attractors formed in the same or differentsubspaces. For example, if x is projected onto (xa, xb, Xc)(Xc is some pattern with ±1 as its components), stronginterference, which is essentially the same as the problemof averaging, is produced. This interference can be mini-mized by setting the output of desensitized elements to bethe average value for each element.

4.3. Numerical experiment

Computer simulations were performed for the modelobtained by applying the product-type contextual modifi-cation to the nonmonotonic associative model (product-type model), and its properties were examinedexperimentally. For comparison, the model with the direct-sum type contextual modification (direct-sum type model)was also examined.

In the experiments, p cue patterns S1, . . . , Sp and qcontext patterns C1, . . . , Cq were used (p, q ≥ 2). Allcombinations of these were associated with given targetpatterns Tµ,n (µ = 1, . . . , p, n = 1, . . . , q). These patternswere constructed at random so that the components tookvalues of 1 or –1 with probability 1/2. In the product-typemodel, the dimensions of Sµ and Cn were equal to thenumber n of elements in the network. In the direct-sum typemodel, it was assumed that n1 elements represented cuesand targets, and n2 elements represented contexts; thus, Sµ

and Tµ,n are n1-dimensional patterns, and C is an n2 (= n –n1)-dimensional pattern.

Fig. 3. Schematic representation of the process ofcontext-dependent association.

78

We first consider the case in which p = q, and all ofTµ,n are different. The range of p for which the correctassociation can be realized is investigated. The associationis judged to be achieved when the similarity (defined by thedirection cosine between pattern vectors) between the targetand recalled patterns is more than 0.75 (in most cases inpractice, the similarity either reaches 1 or stays at a smallvalue) for every combination of cue and context patterns.In the direct-sum type model, the ratio of n1 and n2 wasvaried in the experiment and the values which seemedoptimal were used. The parameters set in common to thetwo models were as follows:

In the training, the duration of the learning signal r(the time required for the change from the cue to the target)was set as 5τ, with additional training for 1τ with the targetpattern as the learning signal. The training was repeated thenecessary number of times, but 10 to 30 times was sufficientfor each association, and little change occurred if it wascontinued further.

Figure 4 shows the association capacity, namely, themaximum value of m for which association can be achieved,for three values of n. We see that the capacity is consider-ably larger in the product-type model (solid line) than in thedirect-sum type model (dashed line). What is more impor-tant is that the capacity increases almost in proportion to nin the former, while the ratio of the capacity to n decreaseswith increasing n in the latter. Even in the direct-sum type,

the capacity increases in proportion to n, provided that eachSµ is modified only by a small number of Cn among q.Consequently, the above decrease of efficiency is ascribedto the problem of averaging caused by the one-to-manycorrespondence between a cue or context pattern and targetpatterns.

The above tendency is expected to become moremarked when there are overlaps among Tµ,n, so that thetarget has a many-to-many correspondence to Sµ and Cn.We thus performed the same experiment with the targetsrestricted to p patterns, as in Table 1: The µ-th row and then-th column in the table gives Tµ,n, which is one of thepatterns T1, . . . , Tp depending on the value of µ + n. Thus,the average of the target patterns for each row or column isalways the same, and it is expected that a strong averagingeffect will appear.

Figure 5 shows the results. We see that the capacityis extremely small in the direct-sum model (dashed line),compared to the case in which there is no overlap in thetargets (Fig. 4). In addition, the capacity does not increase,but instead decreases, with increasing n. One of the reasonsfor this is the strong averaging effect. Another reason is asfollows. In the experiment shown in Fig. 4, the associationof the target becomes easier when the target is a strongerattractor, since each target pattern corresponds to a singlecue and a single context pattern. But in this case, a targetcorresponds to multiple Sµ and Cn, and such an effect is notproduced.

In the product-type model, on the other hand, thecapacity not only increases in proportion to n, but has avalue larger than in Fig. 4. This indicates that even if thetargets have overlaps, the product-type model is free fromthe problem of averaging; the capacity is even increased dueto the decrease in the number of target patterns.

The above was an experiment under the condition p= q, but the situation remains the same if p ≠ q. Accordingto the results of experiments for various combinations of p

Table 1. Example of associations causing a strongaveraging effect

Fig. 4. Capacity of association when the targets have noduplication.

79

and q, the performance is degraded in the direct-sum typemodel if the scale is enlarged or there are overlaps in thetargets; however, the product-type model maintained a ca-pacity of at least 0.16n in any case.

5. Context-Directed Association

In the numerical experiments described in the pre-vious section, the target pattern Tµ,n and the cue pattern Sµ

were selected completely independently. Actually, how-ever, there can be some common patterns. In particular,when all target patterns form a part of the cue patterns,patterns are successively recalled with the recalled targetpattern as a cue. In this process, it is possible to control therecalled sequence by actively switching the context pattern.Such a sequential association, which is performed by givingthe cue pattern S only at the initial state and later behavioris controlled by inputting the context pattern C, is calledcontext-directed association.

In the context-directed association model, we canregard C as the control input to the network which causesthe state transition from Sµ1 to Sµ2 (it should be noted thatthe relation between S and C is the reverse of that in theElman network). Consider in particular the case in whichdirect transition is possible from an arbitrary Sµ1 to anarbitrary Sµ2. When p = 4, for example, all state transitionscan be realized using q = 4 context patterns C1 to C4 asshown in Fig. 6. Similarly, when p = q = 10, transitionsbetween any states can be realized by the association shownin Table 2.

5.1. Example

The model with n = 1000 was trained for the associa-tion given in Table 2. The parameters and training processof the model are almost the same as in Section 4.3, exceptthat training for r = Sµ(I) is not performed, since it isunnecessary. The number of training operations was 22 foreach association.

Figure 7 shows the behavior of the model after train-ing. The curve shows the time course of the similaritybetween the network state x and the patterns S1 to S10:

(a) is the case in which x = S6 is used as the initialstate and the context pattern C1 continues to be input. xstarts from S6 and finally reaches S1 through states S5, S7,etc. Then, it stays at the same state. This is exactly the statetransition shown in the first column of Table 2.

Fig. 5. Capacity of association when the same pattern isrepeatedly used for many targets.

Fig. 6. Example of the state transition diagram of themodel.

Table 2. State transition rules used for the experiment

80

(b) shows the case in which the condition is the sameas in (a) until t = 10τ but C is switched to C8 for10τ < t ≤ 25τ, C7 for 25τ < t ≤ 38τ, and C5 for38τ < t ≤ 45τ. As is shown in this example, Sµ can berecalled successively in an arbitrary order or can be retainedfor a while. Also in this case, C is switched from C7 to C5

at t = 38τ, indicating that even if the context pattern changesin the course of state transition, the association can becontinued with little problem.

We verified that correct patterns can be recalled forall combinations of (Sµ, Cn). It was also verified that pro-vided that the state is x u Sµ, the intended state transitionscan be made no matter how C is switched.

5.2. Discussion

Figure 6 is the state-transition diagram of a 4-state4-input finite automaton. Table 2 is an example of thestate-transition rules of a 10-state 10-input finite automa-ton. Thus, the above results imply that the finite automatonrepresented by Table 2 has been realized on the network.

Since this automaton has a direct transition pathbetween any states, any finite automaton of the same sizecan be realized. In addition, considering that it is not nec-essary that p = q and that p and q can be unlimitedly

increased in proportion to the number of elements, we seethat this model has the ability to simulate the behavior ofany p-state q-input automaton. Here the number of elementsrequired in this approach is only about n = 10m (m is thenumber of state transition paths and m ≤ pq). Moreover, thenumber of training operations required for forming a trajec-tory attractor is almost constant regardless of n. Since allelements operate completely in parallel, the time requiredfor training increases, in principle, only in proportion to m.

The above situation does not imply that this model iscompletely equivalent to a p-state q-input finite automaton.The first point to note is that the number of states that thenetwork can take is much larger than p. They include notonly Sµ and the states connecting these, but also other statesfor unknown inputs.

Another noteworthy point is that generalization ispossible, since both Sµ and Cn are based on the distributedrepresentation. There can be various levels of generaliza-tion, but it is at least guaranteed by the properties of theattractor that when a pattern sufficiently close to Cn is input,the same state transition as in the case of the Cn input occurs.

Also, it has been verified by a preliminary experimentthat when the same association is learned in several differ-ent contexts, the same target pattern is recalled in an entirelynew context (it is possible to recall a different pattern bytraining for another association in the new context). Suchgeneralization over multiple contexts is not observed inmixture of experts. We have obtained another result sug-gesting that the proposed model has a generalization abilityof a higher level, but the details of this point are left forfuture study.

Incidentally, a neural network model named PATON[11, 12] has been proposed. This model can perform con-text-dependent association and realize the same state tran-sitions as the automaton, and is similar to the proposedmodel in that the output gain of some of the elements is setto 0, depending on the context. However, the PATON modeluses local representations for symbols other than distrib-uted representations, which differs from the approach ofthis study whose objective is to improve capabilities ofneural computation based on the distributed representation.Besides, the model is essentially different from the pro-posed model in that it does not solve the problem ofaveraging, and thus it does not work well unless the size ofthe model is small. Thus, the similarity between the twomodels is only superficial.

So far, there has been no neural network model otherthan the proposed model in which the problem of averagingin context-dependent association is completely solved, andlarge-scale finite automata can be simulated on the basis ofthe distributed representation alone. Of course, this by nomeans implies that there cannot be other such models, butthe authors believe that there cannot be a simpler modelthan the model proposed in this paper.

Fig. 7. Behavior of the context-directed associationmodel.

81

6. Conclusions

This paper has discussed problems involved in theconventional contextual modification method in neural net-works based on the distributed information representation.It has been shown that the problems can be solved byapplying contextual modification with selective desensiti-zation to a nonmonotone neural network. It has also beenshown that the proposed model can recall a large numberof different patterns according to a large number of con-texts, and can simulate any finite automaton.

The obtained results not only greatly enlarge capa-bilities of neural computation, but also suggest the possi-bility that using only the distributed representation, acontext-dependent information processing system that hasability equal to or higher than that based on symbols can beobtained. This may show the way to overcome the limita-tions of the conventional symbol processing arising fromthe symbol grounding problem and the frame problem. Inorder to verify this possibility, it will be necessary toexamine the analogical inference power of the model byintroducing some structure into the representation or simi-larity between patterns, which remains for future study.

This paper does not discuss the biological relevanceof the model, but the authors believe that the selectivedesensitization method is actually used in the brain. Thisidea has a sound basis in computation theory and thephysiology of the brain, but will be discussed in anotherpaper.

Acknowledgment. This study was supported bySpecial Coordination Funds for Promoting Science andTechnology from the Ministry of Education, Culture,Sports, Science and Technology of Japan under the heading“Study of Dynamic Memory Systems for Realization ofContext-Initiative Recognition, Decision and BehaviorFunctions.”

REFERENCES

1. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adap-tive mixture of local experts. Neural Computation1991;3:79–87.

2. Elman JL. Finding structure in time. Cogn Sci1990;14:179–211.

3. Cleeremans A, Servan-Schreiber D, McClelland JL.Finite state automata and simple recurrent networks.Neural Computation 1989;1:372–381.

4. Hirai Y. A model of human associative processor(HASP). IEEE Trans Syst Man Cybern 1983;13:851–857.

5. Kawamura M, Okada M, Hirai Y. Analysis of a cor-relation-type associative memory with one-to-manyassociations. Trans IEICE 1998;J81-D-II:1336–1344.

6. Morita M. Associative memory of sequential patternsusing nonmonotone dynamics. Trans IEICE 1995;J78-D-II:679–688.

7. Morita M. Memory and learning of sequential pat-terns by nonmonotone neural networks. Neural Net-works 1996;9:1477–1489.

8. Morita M. Associative memory with nonmonotonedynamics. Neural Networks 1993;6:115–126.

9. Morita M, Yoshizawa S, Nakano K. Associativememory for correlated patterns using nonmonotonedynamics. Trans IEICE 1992;J75-D-II:1884–1891.

10. Morita M, Murakami S. Recognition of sequentialpatterns using nonmonotone neural network. TransIEICE 1998;J81-D-II:1679–1688.

11. Mochizuki A, Ohmori R. PATON: A dynamic neuralnetwork model of context dependent memory access.J Jpn Soc Neural Networks 1996;3:81–89.

12. Omori T, Mochizuki A, Mizutani K, Nishizaki M.Emergence of symbolic behavior from brain likememory with dynamic attention. Neural Networks1999;12:1157–1172.

82

AUTHORS (from left to right)

Masahiko Morita (member) received his B.S. degree from the Department of Mathematical Engineering and Instrumen-tation Physics, University of Tokyo, in 1986 and completed the doctoral program in 1991. After serving as a special researcherat JSPS and as a research associate at the University of Tokyo, he became a lecturer at the Institute of Electronic InformationScience in 1992, and is now an associate professor at the Institute of Engineering Systems, University of Tsukuba. He has beenengaged in research on biological information processing mechanisms and neural networks. He received a Research Award(1993) and Paper Award (1994) from the Japan Neural Network Society and Encouragement Award (1999) from the JapanPsychological Society.

Kouhei Matsuzawa received his B.S. degree from the Institute of Engineering Systems, University of Tsukuba, in 2000and enrolled in the Graduate School of Systems and Information Engineering. He is engaged in research on neural networkmodels.

Shigemitsu Morokami (student member) received his B.S. degree from the Institute of Engineering Systems, Universityof Tsukuba, in 2000 and enrolled in the Graduate School of Systems and Information Engineering. He is engaged in researchon neural network models.

83

A model of context-dependent association using selective desensitization of nonmonotonic neural...

Documents

Transcript of A model of context-dependent association using selective desensitization of nonmonotonic neural...