Habituation based neural networks for spatio-temporal ...

ELSEVIER Neurocomputing 15 (1997) 2733307

NEUROCOMPUTINC

Habituation based neural networks for spatio-temporal classification1

Bryan W. Stiles, Joydeep Ghosh”

Department of Electrical and Computer Engineering, The University of Texas at Austin, TX 78712-1084, USA

Received 3 April 1995; revised 28 January 1997

Abstract

A new class of neural networks is proposed for the dynamic classification of spatio-temporal signals. These networks are designed to classify signals of different durations, taking into account correlations among different signal segments. Such networks are applicable to SONAR and speech signal classification problems, among others. Network parameters are adapted based on the biologically observed habituation mechanism. This allows the storage of contex- tual information, without a substantial increase in network complexity. We introduce the concept of a complete memory. We then prove mathematically that a network with a complete memory temporal encoding stage followed by a sufficiently powerful feedforward network is capable of approximating arbitrarily well any continuous, causal, time-invariant discrete-time system with a uniformly bounded input domain. The memory mechanisms of the habituation based networks are complete memories, and involve nonlinear transformations of the input signal. In networks such as the time delay neural network (TDNN) [35] and focused gamma networks [8], nonlinearities are present in the feedforward stage only. This distinction is made important by recent theoretical results concerning the limitations of structures with linear temporal encoding stages. Results are reported on classification of high dimensional feature vectors obtained from Banzhaf sonograms.

Keywords: Dynamic neural networks; Habituation; Classification; Spatio-temporal signals; Recurrent networks

‘This work was supported in part by NSF grant ECS 9307632 and ONR contract N00014-92C-0232. Bryan Stiles was also supported by the Du Pont Graduate Fellowship in Electrical Engineering. We thank Prof. I. Sandberg for several fruitful discussions. We would also like to thank the reviewers for many useful

suggestions for improving the paper.

*Corresponding author. E-mail: [email protected].

0925-2312/97/%17.00 Copyright 0 1997 Elsevier Science B.V. All rights reserved

PII SO925-2312(97)00010-6

214 B. W. Stiles, J. GhoshlNeurocomputing 15 (1997) 273-307

1. Introduction

Many tasks performed by humans and animals involve decision-making and behavioral responses to spatio-temporally patterned stimuli. Thus the recognition and processing of time-varying signals is fundamental to a wide range of cognitive processes. Classification of such signals is also basic to many engineering applications such as speech recognition, seismic event detection, sonar classification and real-time control [21,22-J.

A central issue in the processing of time-varying signals is how past inputs or “history” is represented or stored, and how this history affects the response to the current inputs. Past information can be used explicitly by creating a spatial, or static, representation of a temporal pattern. This is achieved by storing inputs in the recent past and presenting them for processing along with the current input. Alternatively, the past events can be indirectly represented by a suitable memory device such as a series of possibly dispersive time-delays, feedback or recurrent connections, or changes in the internal states of the processing cells or “neurons” [22, 141.

The past few years have witnessed an explosion of research on neural networks for temporal processing, and surveys can be found in [13,22,23] among others. Most of this research has centered on artificial neural network models such as time delayed neural networks, the recurrent structures of Elman [lo] and Jordan [18], infinite impulse response (IIR) networks, etc., that can utilize the extension of the back-propagation algorithm to recurrent architectures [37,38]. Another large class of networks, inspired by physics, are based on transitions between attractors in asymmetric variants of the Hopfield type networks

c3, 171. Some researchers have also studied spatio-temporal sequence recognition mecha-

nisms based on neurophysiological evidence, especially from the olfactory, auditory and visual cortex. Representative of this body of work is the use of non- Hebbian learning proposed by Granger and Lynch, which, when used in networks with inhibitory as well as excitatory connections, can be used to learn very large temporal sequences [15]. Similar networks have been used to act as adaptive filters for speech recognition [20], or provide competitive-cooperative mechanisms for sequence selection [S]. At the neuronal level, irregularly firing cells have been proposed as a basic computational mechanism for transmitting temporal information through variable pulse rates [7]. All these efforts are concen- trated on neurophysiological plausibility rather than being geared toward algorithmic classification of man-made signals. Issues such as computational efficiency and ease of implementation on computers, are secondary. In contrast, we introduce new mechanisms in this paper which have their roots in biological systems, but are adapted for efficient and practical applications involving temporal processing.

Among biological mechanisms that can encode temporal information, is a particularly simple and well understood phenomenon known as habituation [l, 4,25,36-j. Primarily, habituation is a means by which biological neural systems vary their

B. W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307 215

synaptic strengths in order to ignore repetitive, irrelevant stimuli. Habituation serves as a novelty filter. If the presynaptic neuron is active for a short period of time, habituation tends to decrease the synaptic strength which then recovers only after the period of activity is over. The longer the presynaptic neuron is active the slower it recovers. It is important to note that habituation does not act in a vacuum. Other learning mechanisms, such as sensitization and Hebbian learning, may also be operating concurrently to alter synaptic strengths based on the utility of the information from presynaptic neurons.

Several researchers in neurophysiology have developed mathematical models of habituation [l, 4,361. A discrete time version of the Wang Arbib [36] habituation model for varying the strength, w(t), of a single synapse is summarized by

w(t + 1) = IV(t) + z(az(t)(W(O) - w(t)) - LV(t)Z(t)); (1)

z(t + 1) = z(t) + yz(t)(z(t) - 1)1(t). (2)

In this model, 1(t) is the activation of the presynaptic neuron at time t, z is a constant used to vary the habituation rate and CI is a constant used to vary the ratio between the rate of habituation and the rate of recovery from habituation. The function z(t) monotonically decreases with each activation of the presynaptic neuron. This function is used to model long term habituation. Due to the effect of z(t), after a large number of activations of the presynaptic neuron, the synapse recovers from habituation more slowly.

Aside from its primary function, habituation has also been suggested to be a means of encoding short term temporal information [25]. In this paper, we introduce mechanisms for using habituation to encode temporal information in an artificial neural network. In Section 2 we describe our mechanisms, and provide mathematical proofs of their capabilities. We demonstrate that these mechanisms are special cases of a general neural network structure which is capable of approximating arbitrarily well any continuous, causal, time-invariant, discrete time system. The habituation based network structures differ significantly from other approaches such as the focused gamma network [S] in that they make use of nonlinear temporal encoding mechanisms. Networks with linear temporal encoding mechanisms have certain inherent limitations which are indicated in Section 2, and are explained in further detail in [32]. A full version of this paper [33] which includes proofs may be obtained from the world wide web at www.lans.ece.utexas.edu. In Section 3, we discuss the general issues involved in spatio-temporal processing with artificial neural networks, motivate the data sets used, and give experimental results for our networks on the classification of artificial Banzhaf sonograms. We demonstrate that our network out-performs short time window TDNNs for a number of classification problems involving long term temporal information. On these data sets, the network performs similarly to the focused gamma network, arguably the current state of the art in neural networks for modeling spatio-temporal problems. Finally in Section 4, we draw conclusions based upon our theoretical and experimental results.

216 B. K Stiles, J. GhoshlNeurocomputing I5 (1997) 273-307

2. Habituation mechanisms for encoding temporal information

2. I. General structure

We have designed short term habituation units based upon the Wang and Arbib model of habituation and used them in a spatio-temporal classification network. A set of habituated values is first obtained from the input Z(t). If the input is multidimensional, one set is extracted for each component. These weights are affected by the past values of the input, and implicitly encode temporal information. Spatio- temporal classification can thus be achieved by using such habituated weights as inputs to a static classifier. The model equation is shown as follows.

W!& + 1) = W!&) + r&(1 - Wk(0) - w,(t)Z(r)). (3)

This equation is derived from Eq. (1) by setting z(t) = 1 to eliminate long term habituation effects, and letting W,(O) = 1. Long term habituation is eliminated so that the ability of Wk(t) to recover from habituation does not vary over time. Otherwise the Wk(t) values would eventually decrease to zero for all but the most infrequent of inputs. The k index is used to indicate that multiple values Wk(t + 1) are determined for an input signal Z(t). It was found mathematically, that multiple habituation values are better able to encode temporal information. This fact may also have biological context, because it is known that pairs of neurons often have multiple synapses between them. Notice that Eq. (3) is a nonlinear difference equation. It is not linear because of the lV,(t)Z(t) term in which the output of the system at time t is multiplied by the input of the system. Such a system is commonly referred to as bilinear. An equivalent equation in which the recursion has been eliminated is

t-1 t-1 f-l

w,(t) = akTk + ukTk 1 n (1 - wk - Tkl(h)) + n (1 - ukrk - zkl(ih (4) j=l h=j i=O

which makes the nonlinear nature of Eq. (3) more explicit. The parameters, rk and Ek affect the rate at which habituation occurs, thereby

determining the temporal resolution and range of the information obtained. The issues and tradeoffs involved are akin to memory depth versus resolution in dispersive delay line based models [8,23]. We set w,(O) = 1 for all k, employ positive values of Mk and zk such that &rk + zk < 1 and normalize the input such that Z(t) E [0, 11. With these specifications, we can guarantee that the habituation process is stable. In fact we can guarantee that Wk(t) E [0, l] for all values of k and t.

In this paper, dynamic classification is achieved by training a suitable nonlinear feedforward network using habituated signals as inputs. In [36] Wk(t) represents a synaptic strength, and Z(t) the activity of the presynaptic neuron, but because our designs use habituated values as network inputs rather than weights, the variables are redefined accordingly. We do not mean to imply that this network construction is either the most biologically feasible or the only method in which habituation might be used. A more biologically inspired approach would be to reflect Wk(t) as modulating

B. W. Stiles, J. GhoshlNeurocomputing 15 (1997) 273-307 211

Nonlinear Feed-Forward Network

(MLP, RBF, etc)

n = 1 Habituation Unit -

Fig. 1. Structure of habituated neural networks.

weights of the inputs. We found by experiment, however, that this approach, although more biologically feasible, does not encode temporal information as well for the classification problems which we studied. Moreover, the structure of Fig. 1 can be shown mathematically to be very powerful.

Two different memory mechanisms are presented below. Each extracts a set of habituation values which are then presented to the feedforward stage. Both of these mechanisms make use of Eq. (3). As mentioned previously, if the input is multidimensional one set is extracted for each component. The first memory mechanism is a set of m habituation units with outputs, W,, 1 I k I m, that are extracted from the raw input I. Each habituation unit makes use of different ak and rk parameters. At each time instant t, Wk(t + 1) is computed using Eq. (3). These values are then presented to the feedforward stage as inputs. During training and testing, the habituation values are never reinitialized, but continue to vary as functions of the past history of the input. Fig. 1 shows the generic structure of a two-stage structure using this memory mechanism. Such a network is referred to as a Habituated Neural Network. More specifically, if a multilayered perceptron (MLP) (alt. radial basis function network) is used, the overall network is a habituated MLP (alt. habituated RBF).

The second memory mechanism considered is comprised of a set of cascaded habituation units with a single set of parameters a and z. The first habituation unit (WI) operates on the inputs using Eq. (3). Each successive habituation unit operates

278 B. W. Stiles, J GhoshjNeurocomputing 1.5 (1997) 273-307

1 f outputs

Nonlinear Feed-Forward Network

(MLP, RRF, etc)

w3 w, 1.0 1.0

icl- I_

+

z-l - c-+ Lb-

z.l -

= 1 Habituation Unit

0 -1

Z =I Unit Delay

Fig. 2. Structure of cascaded habituation neural networks.

on the output of the one before it as follows:

WJt + 1) = WJ$) + z(a(l - W&)) - W&)(1 - Wk-1(t))). (5)

As was the case with the first network, the memory stage is followed by a feedforward neural network, and the habituation values are never reinitialized. Such a structure is referred to as a Cascade Habituated Neural Network (i.e. Cascade Habituated MLP, Cascade Habituated RBF, etc.) and is illustrated in Fig. 2. This system has the same universal approximation capabilities as the first network without the necessity of varying the habituation parameters. Varying those parameters, however, can be useful for selecting a more efficient model.

2.2. Mathematical properties

The following is a mathematical proof of the ability of a general category of neural networks, including habituation based networks, to approximate arbitrarily well any continuous, causal, time-invariant mapping f from one uniformly bounded discrete time sequence to another. Since all functions realized by arbitrarily complex TDNNs or focused gamma networks are continuous, causal, and time-invariant, the proof given also implies that habituation based networks are at least as powerful in terms of their approximation capabilities. (For proofs of the TDNN and focused gamma

B. K Stiles, J. GhoshlNeurocomputing 15 (1997) 273-307 279

networks capability see [26-281. The key to the proof is to show that the memory structure realized by the habituated weights is a compEete memory. Then so long as the feedforward stage is capable of uniformly approximating continuous functions, the overall network will have the desired universal approximation capability.

First, we introduce a few definitions. Henceforth the notation (fg)(t) shall be taken to mean the operator f operating on the sequence g at time t. This should not be confused with multiplication of two sequences which is denoted byf(t)g(t). Let X be the set of mappings from the set of integers, 3, to the set of real numbers, 9, with the following constraints. For all I E X, I(t) = 0 if t I 0; and Vy’t E 3, I(t) E [0, 11. Let R be the set of all mappings from the nonnegative integers, ZZ”, to 9. Let the operator T, be a mapping from X to X defined as follows. For all x E X, and t and /I E b+,

We say a mapping M from X to R is time-invariant if for each B E %?+ and each x E X the following is true:

(MT/+)@) = i YMx)@ _ /?) rtLe;;b, A function is time-invariant if the absolute time at which an input pattern occurs does not effect the output of the function. Time-invariance does not imply context independence.

Let the mapping P, from X to X be defined as follows:

Let C, denote the intersection of the two sets, [0, a] and d+. We say a mapping M from X to R is causal if ‘da E b+, the statement P,x = P,y implies (Mx)(t) = (My)(t) for all t E C,. For a causal M, the value of the sequence Mx at any point t is independent of the future values of x.

First we shall show that a two layer neural network with an exponential activation function and a general structure for processing the inputs can universally approximate f: Then we will show that habituated and cascade habituated networks are specific cases of the generalized structure. This generalized structure has a memory mechanism comprised of operators drawn from a complete memory. A complete memory is a set of operators on discrete time sequences which has four definitive properties. First, the elements are bounded input bounded output stable. Secondly, for any two input signals x and y which differ prior to time t, there is an element b which can discriminate between x and y at time t. Third, the elements satisfy a mild form of time-invariance. The final property is that the elements of a complete memory are causal. The precise definition is:

Definition 1. A set B of functions from X to R is a complete memory on X if it has the following four properties: First, there exist real numbers a and c such that

280 B. W. Stiles. J. Ghosh/Neurocomputing I5 (1997) 273-307

@x)(t) E (a, c) for all t E T’, x E X, and b E B. Second, for any t E Z.7’ and any to such that 0 < to 5 t, the following is true. If x and y are elements of X and x(t,,) # y(t,), then there exists some b E B such that bx(t) # by(t). Third, if b E B then (bT,x)(t) = (bx) (t - p) for all t E b+, all x E X and any /3 such that 0 I p I t. Fourth, every b E B is causal.

Theorem 1. Let f be a continuous, causal, time-invariant function from X to R, and let B be a set of mappings from X to R. If B is a complete memory on X then the following is true. Given any E > 0 and any arbitrarily large positive integer, to, there exist real numbers, aj and cjk, elements of B, bjk, and natural numbers, p and m, such that the following inequality holds for all x E X and all t < to.

(fx)(t) - jgl ajexP f cjk(bj/cX)(t) >I

< c. (6) k=l

A proof of this theorem can be found in Appendix A. Theorem 1 describes the approximation capabilities of a two layer static neural

network with an exponential activation function and inputs operated on by elements of complete memory, B. This result can be extended to include networks with MLP and RBF feedforward stages and to include multiple (d > 1) spatial input dimensions, Z,,(t), 1 I h I d. Rigorous proofs of these extensions are not given here for the sake of conciseness, but can be found in [34]. The full version of [34] which is under review may be obtained from the world wide web at www.lans.ece.utexas. In order to show that MLPs and RBFs can be used to perform the same approximations it is sufficient to show that the exponential function can be approx- imated arbitrarily well by a summation of sigmoids or gaussian functions. This is a special case of theorems which have already been proven for sigmoids by Cybenko [6] among others and for gaussian functions by Park and Sandberg [24]. The expansion of the result to multiple spatial dimensions follows directly from the proof of Theorem 1. Theorem 1 is concerned with approximations on a finite but arbitrarily large time interval. A similar result is obtained for an infinite interval in [34] with the additional assumption that f has approximately finite memory.

Henceforth we shall refer to any structure with the approximation capability described in Theorem 1 as a universal approximator. Although this term is also used in conjunction with static networks, the context in which it is used should eliminate any confusion.

It is important to notice that the input processing functions, bjk used in Theorem 1 depend on j and thus the habituation parameters used also depend on j. This means that different hidden units in the feedforward network may have different input values. This dependency is not present in the structures illustrated in Fig. 1 or Fig. 2. However, we can show that for any approximation g of the form discussed in Theorem 1, there is an equivalent network without this dependency. Let g be an approximation function of the form I:= iaj exp(CF= 1 CjkbjkX). It is easy to see that

B. W. Stiles, J. Ghosh/Neurocomputing 15 (1997) 273-307 281

given any such g one can find an h of the following form such that g(x) = h(x) for all x E x:

(7)

Here wji are real numbers which serve as weights to the hidden units and si are elements of a complete memory B.

Simply choose A4 to be the number of distinguishable functions bj, used in g and let the sequence {si} be the list of these distinguishable functions. For a particular si and a particular hidden node j, set wji to zero if the original bj, corresponding to si was not present at hidden node j, otherwise set wji to the appropriate cjk.

In order to show that habituation based networks are special cases of this type of generalized structure we state the following theorems which are also proven in the appendix. Theorem 2 states that the memory mechanism of a habituated neural network is a complete memory. Theorem 3 states the same thing for cascade habituated neural networks. The result in Theorem 3 is somewhat more powerful in that the habituation parameters may be fixed.

Theorem 2. Let b0 = W(0) = 1. A habituation function is defined recursively by

(bx)(O) = W(1) = b,, + a~(1 - b,) - zb,,x(O) (8)

and

(bx)(t) = W(t + 1) = bx(t - 1) + ~$1 - (bx)(t - 1)) - z(bx)(t - 1)x(t). (9)

Let B be the set of all such functionsfor all c( and z E 92 such that c1 > 0, z > 0, z < 1, and CR + z < 1. B is a complete memory.

In Theorem 2, b is the operator which produces habituated values, that is, (bl)(t) = W (t + 1) where I is the input and W is the habituated value. For any specific example of a habituated neural network m such functions are chosen from B to be used to produce the m habituated values (Wk}.

Theorem 3. Let b0 = 1. Let CI and z be positive real numbers such that z < 1, and az + z < 1. A habituation function b is de$ned recursively by

(bx)(O) = b,, + m(1 - b,) - zbOx(0) (10)

and for t > 0

(bx)(t) = bx(t - 1) + m(1 - bx(t - 1)) - z(bx)(t - 1)x(t).

Let hI = b and let hi for i > 1 be defined by

(hix)(O) = 1

(11)

(12)


and for t > 0

(hiX)(t) = (hiX)(t - 1) + Ur(l - hiX(t - 1)) - T(hiX)(t - l)(l - hi_lX)(t - 1). (13)

Let B be the set of all functions hi for each positive integer i. B is a complete memory.

In Theorem 3, hi is the operator which generates a cascade habituated value ((hil)(t) = W’i(t)). For a specific example of a cascade habituated neural network, m elements of B (usually the first m) are chosen to produce the m cascade habituated values which are then presented to the feedforward stage.

2.3. Comparing diflerent two stage structures

This paper focuses upon two stage networks in which a memory stage for temporal encoding is followed by a memoryless stage without feedback. The behavior of two stage networks is simpler to analyze than that of fully recursive networks. Networks with more complex feedback connections can also have stability problems which are sometimes difficult to diagnose. Additionally, the training mechanisms required for such networks tend to be complicated. Therefore it is often simpler to make use of a two-stage network.

Like the habituation based networks, TDNNs and gamma networks consist of complete memory temporal encoding stage followed by a feedforward neural network [34]. We are excluding the case of TDNNs with hidden unit delays [35] in order to focus attention on simpler two-stage models. Henceforth the term TDNN is used to mean a network without hidden unit delays. All of these networks are universal approximators, but such capability assumes no restrictions in size of memory stage or number of hidden units, as well as the ability to find the optimum set of weights. In practise, given a finite structure and training data, and possible limitations of a learning rule, the question that remains is which are more ejticient. The answer depends on the nature of the function that is being realized. The focused gamma network and TDNNs have limitations which may make them inappropriate for some problems [32]. The reason for these limitations is that the memory stage of a TDNN or focused gamma network is a linear system. As discussed in [32], there is a wide class of problems which cannot be efficiently modeled by two-stage structures with a linear memory stage. Informally stated, the reason for this problem is that in order to model a particular function using a linear memory structure of limited complexity, it is necessary to tradeoff memory depth against memory resolution. There are certain simple classification problems in which such a tradeoff is not advantageous. These problems require high memory resolution and depth in order to classify certain simple input patterns. Linear memory structures cannot efficiently solve these problems. On the other hand, nonlinear memory structures can efficiently solve some of these problems because the effective memory depth and resolution can be made to depend on the particular patterns of input values which have been seen in the recent past. Effectively certain patterns can be remembered longer than others with greater resolution. A particularly simple example of a problem which cannot be efficiently

B. W. Stiles, J GhoshlNeurocomputing 15 (1997) 273-307 283

represented by a linear memory structure is the following. Suppose we wish to distinguish between two types of signals:

Case 1: In the past one thousand time instants no input values in the range (0.3,0.7) have occurred.

Case 2: In the past five hundred time instants at least one input value on the range (0.4, 0.6) has occurred.

For a linear memory stage to encode enough information to solve this problem the output of memory stage must be at least 500-dimensional. This is a special case of Theorem 3 in [32]. In order to produce a simple two-stage network solution for such a problem it is necessary to use a nonlinear memory stage such as the memory stages of the two habituation based network models. This limitation of linear memory structures motivates the design and development of simple nonlinear memory structures.

Aside from the limitations of general linear memory structures, habituation based networks and the focused gamma network have advantages over TDNNs. The complexity of a TDNN depends on n, the input window size. The number of weighted inputs to each hidden unit in a TDNN is nd. For functions which only depend on recent values of the inputs, TDNNs can be quite efficient; but for functions which depend on long term temporal information or variable amounts of temporal information, TDNNs are not efficient solutions unless additional preprocessing of the inputs is used (i.e., multiple sampling rates, etc.). For habituation based networks, the required memory depth and resolution affects the choice of a and z in Eq. (3), and the number of habituation units. Gamma memories also provide a trade-off between memory depth and resolution [S].

Since the output of the temporal encoding stage is different for TDNNs, gamma networks, and habituated networks, the complexity (number of hidden units) of the feedforward network needed at the output stage may also differ. For certain problems, habituated networks require a smaller feedforward output stage as com- pared to TDNNs for a given level of approximation. We have previously performed experiments using habituated MLPs to classify real SONAR data and have found that small habituated networks outperformed larger TDNNs. In fact we found that even m = 1 networks dramatically outperformed TDNNs with time window length of 5 or more [29, 301. Unfortunately due to the proprietary nature of the real SONAR data sets, they cannot be made public. Therefore, in the next section, we discuss experimental results on artificial Banzhaf sonograms, which can be easily generated and verified by other researchers. Using these sonograms we compare the performance of habituation based networks with that of focused gamma networks and TDNNs.

3. Experimental results

Before describing the data sets used and providing comparative experimental results, some fundamental issues and terminology pertaining to classification of spatio-temporal signals are summarized in Section 3.1.

284 B. W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307

3.1. Basic issues in spatio-temporal classification

The signals of interest in speech or sonar processing exhibit time-varying spec- tral characteristics, and can also be of varying lengths or time durations. Before classification can be attempted, a signal has to be separated from background noise and clutter, and represented in a suitable form. In order to use a static neural network such as the MLP or RBF, the entire signal must first be described by a single feature vector, i.e., a vector of numerical attributes, such as signal energy in different frequency ranges or bands [12]. The selection of an appropriate set of attributes is indeed critical to the overall system performance, since it funda- mentally limits the performance of any classification technique that uses those attributes [ 171.

The use of static classifiers is not ideal because representing each signal by a single feature vector results in a blurring of the spatio-temporal pattern characteristics and may lead to a loss of information that would be useful for discrimination. Another, more detailed way to view a time-varying nonstationary signal is to treat it as a concatenation of quasi-stationary segments [ 161. Each segment can be represented by a feature vector obtained from the corresponding time interval. Thus the entire signal is represented by a sequence of feature vectors that constitute a spatio-temporal pattern. A popular way of representing such signals is by a two-dimensional feature- time plot which shows the sequence of feature vectors obtained by extracting attributes from consecutive segments. The number of segments representing a signal is not necessarily pre-determined. Segmentation of nonstationary signals is a difficult problem and has been extensively studied [9].

Any neural network applied to the signal detection task must also address the following issues.

(i) Cueing: Generally, any network that performs a recognition task must be informed of the start and end of the pattern to be recognized within the input stream. If this is so, the classifier is called a cued classifier. An uncued classifier, on the other hand, is fed a continuous stream of data from which it must itself figure out when a pattern of interest begins and when it ends.

(ii) Spatio-temporal Warping: Often the original pattern gets distorted in temporal and spatial domains as it travels through the environment. A robust classifier must be insensitive to these distortions. If the classifier is not sophisticated in this as- pect, then the distorted signal received needs to be preprocessed to annul the effects of the environment. This is called equalization and is widely used in communication systems. The characteristics of the communication or signal transmission me- dium are “equalized” by the preprocessor, so that the output of this processor is similar to the original signal. Alternatively, mechanisms such as the dynamic time warping algorithm [19] can be built into the classifier to make it less susceptible to the distortions.

Since a general purpose cued classifier must detect the presence of a signal as well as classify it, measures of quality include not only classification accuracy, but false alarm rate, the number of times the system indicates presence of a signal when there is none, and missed detection rate, the number of real signals that are present but not detected.


Also important are the computational power required and confidence in classification decisions when they are made [12].

Another problem inherent in spatio-temporal classification, that is not present in static classifiers, is when to decide to classify a signal. With a static classifier, after each pattern is presented some decision must be made, but in spatio-temporal classification the network is presented sequentially with some number of subpatterns. The network must determine at what point it has enough information to classify a signal. One method of making this decision is to classify a signal only when a particular class membership is indicated over several consecutive subpatterns.

In this paper, a signal or pattern refers to a sequence of feature vectors. Each component feature vector is referred to as a sample or subpattern.

Banzhaf signals, consisting of 30-dimensional feature vectors, with sequence length typically between 30 and 50, and including the effects of spatio-temporal warping, are used to evaluate the habituation networks. These signals are described in the next section. Also note that the proposed classifier is uncued and has to tackle both detection and classification. Therefore, it involves selection of thresholds to yield a range of missed detection rate vs. confidence in classification tradeoffs, as discussed in Section 3.3. Background noise is itself considered to be a signal class, so that false alarms are a subset of incorrectly classified signals.

3.2. Banzhaf sonograms

This section describes Banzhaf sonograms [2], which were selected to obtain comparative experimental results on high dimensional spatio temporal classification. Banzhaf sonograms were used for several reasons. First, they are easily reproduced by other researchers. Secondly, they can be used to vary the difficulty of the data sets, the dependence of performance on temporal information, and the amount of warping in both time and space. Finally, it is known that superpositions of gaussian functions can be used to model any continuous spectrogram arbitrarily well and several interesting spectrograms (sonar, radar) are well modeled using small numbers of gaussian components.

A Banzhaf mother signal is generated by superposing two dimensional gaussian functions. The pth component gaussian, G,(x, t), is completely described by the

constant parameters, xo [PI, to [PI, 0 [PI, h, [PI, h bl, 1, Cpl. The pair @O bl, to LPI) is the coordinates in time and space of the center of the gaussian, 13[p] is the angle of axial tilt, h,[p] and h,b] determine the height of the peak, and ,J,b] and &[p] determine the width of the gaussian in space and time respectively. Values for each component gaussian, G,(x, t), are determined for integer values of x and t such that 0 I x I 29 and 0 I t < 40. For each value of x and t, the computation is performed in the following manner. First the following permutation is applied to rotate the gaussian by an angle of O[p] about its center point, (x,[p], to[p]):


Next the value, G,(x, t) is calculated as follows:

G,(x, t) = kX~1 exp - [ (“*~~“‘)‘]h,C~,exp[-~*~~~~‘~]. (15)

The Banzhaf mother signal, G(x, t), is then computed by summing the component gaussians. In our experiment we generated six different mother signals using six different sets of parameter values. Each mother signal serves as the prototypical member of a class. Fig. 3 shows the set of mother signals for each class. Table 1 lists the parameter values for the mother signals. The values for 19Cp] are given in radians. Classes B and E have two component gaussians. All other classes have three. Training and test examples of each class were generated by adding uniformly distributed noise in the range [ - 0.1, 0.11 to G(x, t) and also perturbing the parameters in order to rotate, scale, and warp the mother signals to some extent. When the signals were time warped, the number of time samples calculated was varied accordingly, in order to create signals with a variable length in time. The number of spatial samples, however, was held constant at 30. A seventh “noise only” class was also constructed. The training and test sets were made up of 100 examples of each signal class and 100 examples from the “noise only” class. An additional validation data set was also used with 20 examples of each class. Once the data sets were compiled, they were nor- malized so that all data values were in the range [0, 11. Fig. 4 gives typical examples of each signal class after rotating, scaling, and warping. The data set illustrated in Figs. 3 and 4 is designated as data set 1 (DSl).

Clearly, classification of DSl is a problem which requires relatively long term temporal information. It is impossible to uniquely classify any signal based on only a short temporal window of inputs. For example, consider the mother signals of

Fig. 3. Banzhaf mother signals for classes A-F (clockwise from the top left) of data set 1. The vertical and horizontal axes depict time (increasing top to bottom) and frequency (increasing left to right), respectively.


Table 1 Parameters of mother signals of data set 1

Class A Class B Class C Class D Class E Class F

x0 Cl1 1.5 1.5 15.0 7.5 15.0 15.0 to Cl1 1.5 1.5 1.5 1.5 1.5 1.5 WI - 0.785 - 0.785 0.0 - 0.785 0.0 0.0 k Cl1 1.0 1.0 1.0 1.0 1.0 1.0 h, Cl1 1.0 1.0 1.0 1.0 1.0 1.0 1, PI 2.0 2.0 2.0 2.0 2.0 2.0 2, Cl1 7.6 7.6 6.6 7.6 6.6 6.6 x0 CA 15.0 15.0 15.0 15.0 15.0 15.0 to PI 27.5 21.5 27.5 21.5 27.5 27.5 0 PI 0.0 0.0 0.0 0.0 0.0 0.0 k PI 1.0 1.0 1.0 1.0 1.0 1.0 h, PI 1.0 1.0 1.0 1.0 1.0 1.0 1, PI 2.0 2.0 2.0 2.0 2.0 2.0 1, ITI 9.2 9.2 9.2 9.2 9.2 9.2 x0 c31 1.5 7.5 22.5 22.5 to c31 27.5 21.5 21.5 21.5 6 c31 0.0 0.0 0.0 0.0 k C31 1.0 1.0 1.0 1.0 h, C31 1.0 1.0 1.0 1.0 & c31 5.94 5.94 5.94 5.94 2, Cl1 2.0 2.0 2.0 2.0

Fig. 4. Sample signals from data set 1

classes A, B, and C as illustrated in Fig. 3. The signals in classes A and B are identical for the first 20 time samples, while classes A and C are identical for the last twenty time samples. Additionally, there is no time window of less than ten samples in any of the three signals that is not identical to a time window in one of the other two signals. This

288 B. W. Stiles, J. GhoshlNeurocomputing I.5 (1997) 273-307

classification problem is obviously difficult for short time window TDNNs (no hidden unit delays) unless additional preprocessing of the inputs is performed. In order to demonstrate the effectiveness of habituation for problems in which short term temporal information is more important we have constructed another data set which does not depend as severely on long term temporal information. Data set 2 (DS2) was generated using the same parameters as DSl except that the centers of the component gaussians were shifted to reduce the overlap among the classes. Individual component gaussians were shifted so that a particular gaussian component might act as a tag for identifying the class membership of the signals. One would expect TDNNs to perform relatively better on DS2 than on DSl.

3.3. Training methodology and classijcation heuristic

In this section we discuss the methods which were used to train each of the networks used in the experiments. The specific details of each network design are also presented as well as the heuristic used to detect and classify signals from the network outputs. Four different network designs were investigated: habituated MLPs, cascade habituated MLPs, TDNNs, and focused gamma networks. Most of the details of the habituation based designs have already been presented in Section 2.1 with one exception. From Eq. (3) it can be easily shown that the min- imum possible output value of a habituation unit, Wk(t) is a&l + cl,J. Similarly the maximum achievable value is 1. It seems reasonable that the feedforward network performance might be improved if the dynamic ranges of all its inputs are the same. In order to achieve this symmetry, the habituation units shown in Figs. 1 and 2 are modified so that their output range is [0, 11. The units are modified by a scale and an offset so that their outputs at time t are (CQ + l)W,(t + 1) - elk instead of W, (t + 1). This slight modification does not effect any of the mathematical results in Section 2.2.

The TDNN architecture investigated are simple tapped delay lines followed by an MLP. This is a particularly simple, but limited architecture. It is limited in that the number of inputs to the feedforward stage is the product of the memory depth and the number of input channels. For problems in which long term memory is required, such a network may be required to be extremely complex in order to solve the problem. For such problems structurally different TDNNs are commonly used, such as TDNNs with hidden unit delays or additional preprocessing before the tapped delay line stage. In this paper, our primary focus is on simple two-stage networks, in which a simple temporal encoding stage is followed by a feedforward neural network. Although TDNNs with hidden unit delays as well as other more complex recurrent network schemes have been proposed [lo, l&35,37], they are difficult to analyze mathematically, relatively difficult to train, and often have complex stability requirements. Additional preprocessing schemes may improve TDNN performance, but then they can also be useful when used with the other network schemes. Since we are interested in comparing the various networks rather than applying them to a specific problem, we have generally avoided such issues for the sake of a concise experimental format. However, due to the extreme simplicity of the TDNN architecture examined, we also

B. W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307 289

present results for TDNNs in which the input sample rate is varied. For a particular subsampling rate, s, each group of s consecutive feature vectors in the original data set is averaged together to produce a single subsampled feature vector. Another network used for comparison purposes is the focused gamma network. Like the previously mentioned TDNN architecture, the focused gamma network consists of a simple temporal encoding stage followed by an MLP. Unlike the TDNN, the gamma network has infinite memory. This property makes it more useful for comparison purposes since the habituation based networks are also infinite memory networks. Time delay neural networks were used as well because their performance more clearly illustrates the required memory depth of each data set. The set of functions in the temporal encoding stage of a focused gamma network, B,, is defined as follows for a particular real number p. For all x E X and t E ZZ’“+

(b, 0 x)(t) = x(t).

Forj>Oandt=O(bjox)(t)=O.Forj>Oandt>O

(bj 0 X)(t) = (1 - p)bj(t - 1) + /d_l(t - l),

(16)

(17)

B, = {bj:j E a+}. (18)

Like the cascade habituated network, the focused gamma network is a universal approximator even for a single fixed parameter value (p). In all structures examined in this paper, the feedforward network was a two-layer MLP with 10 hidden units. Increasing the number of hidden units seemed to have little effect on performance.

Classification and detection of signals is accomplished using two thresholds, H and L. Detection occurs whenever the largest network output value, O,,,, is greater than H, and all other oUtput values are less than 1 - H, for L consecutive input presenta- tions. Classification is considered to be correct for a given signal if that signal is detected at least once and the only class detected within the length of the signal is the desired class. The best values of H and L may vary from network to network. In order to ensure a fair comparison, for each network we select the values of H and L which yield the best overall classification rate (CR) on the validation set. The CR is the percentage of signals which are detected at least once and classified correctly every time they are detected. Because background noise is identified as class 7, false alarms occur whenever signals of class 7 are misclassified. Therefore the false alarm rate is accounted for in the overall classification rate. After training, the CR on the test set is determined using the selected L and H values and reported.

The networks are trained in the following manner. First the sequence of feature vectors in the training data set are presented in order to the temporal encoding stage, and the resultant sequence of state vectors is computed. Secondly, the state vectors are randomly shuffled and used to train the MLP for a sufficiently long time (50 epochs). The cost function used during training on the gamma and habituation based networks is not the standard MSE approach, but rather a method we refer to as Distinctive

Subset Mean Square Error (DSMSE) which is more appropriate for classifying sequences. When the MSE approach is used, a large percentage of the overall error


value, comes from classification mistakes made near the beginning of each signal. At this part of the signal, the classifier simply does not have enough information to make an intelligent decision especially in the case of (DSl) in which the first few feature vectors of signals from different classes are often identical. Since the mistakes at the beginning of the signal are unavoidable it makes little sense to include them in the cost function. Their inclusion can lead to overtraining based on coinciden- tal information in the training set. Interference from such error values may also diminish the relative importance assigned by the network to potentially useful information in the latter portion of the signal. One method previously considered for eliminating this problem is the reduced training set (RTS) approach in which state vectors corresponding to the early portions of each signal are not used to train the feedforward network [31]. Unfortunately, when training an uncued classifier, this method can lead to an increased rate of false classification near the beginning of each signal. In the DSMSE approach, we train the network on all the state vectors, but for vectors near the beginning of a signal we do not train the output corresponding to the correct classification. We effectively assign the desired value of that output to be “don’t care”. This makes the problem the network is asked to solve easier in that it does not have to make difficult distinctions at the beginning of a signal, but still seeks to minimize the number of incorrectly classifications throughout the signal. In order to quantify what is meant by “beginning” and “end” of the signal, we make use of the term distinctive subset size (DSS). The DSS is the number of consecutive samples at the end of a signal for which the error corresponding to the correct classification output is used in training. The DSMSE and RTS approaches both make the assumption that the network remembers all necessary information about a signal at the end of that signal. For the TDNN architectures considered this assumption is clearly not true, so a simple MSE cost function is used for TDNN networks. Fig. 5 illustrates the effect obtained on the test set performance of a cascade habituated network on DSl by varying DSS. The particular network parameters used in the experiment are a = 0.5, z = 0.2, and m = 5. The performance is mildly effected by variation in DSS for a relatively large set of values, but when DSS is set to 50 (effectively, the same as simple MSE cost function) the decrease in performance is substantial. For the rest of the experiments DSS = 15 is used for all networks except TDNNs. During training the MLP weights are saved after each epoch. After the final epoch, each set of stored MLP weights is then applied to the validation set. The set of weights which yields the lowest DSMSE (MSE for TDNN) on the validation set is then selected. Its classification performance is determined on the validation set using each set of L and H values: (H = O.O,O.l, 0.2, . . . ,0.8; L = 1,2,3,4, 5, . . . ,15). The L and H values which yield the highest overall classification rate are selected for use on the test set. The overall classification rate (CR) on the test set is then reported.

3.4. Comparisons among networks

In this section we discuss the performance results obtained for each architecture on data sets 1 and 2 (test set results only). In Figs. 6 and 7 results are listed for all four

B.W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307 291

CR

90.00 .

89.00

88.00

87.00

86.00

85.00

84.00

83.00

82.00

81.00

80.00

79.00

78.00

77.00

76.00

75.00

74.00 a

73.00

72.00 \

71.00 DSS

10.00 20.00 30.00 40.00 50.00

Fig. 5. Effect of distinctive subset size on classification performance (test set results only)

networks on data sets 1 and 2, respectively. (The abbreviations FGN, CHMLP, and HMLP stand for focused gamma network, cascade habituated MLP, and habituated MLP, respectively.) The results listed include the best results obtained for each architecture (original sampling rate). The x-axis is m, the number of state inputs to the feedforward network for each input channel. Results for m = 1 for the TDNN and the focused gamma network are not included, since this is the degenerate case of a static MLP and therefore is not expected to do particularly well on either data set. The figures illustrate performance versus network complexity since the number of weights and biases in the feedforward network is 310m + 77. For the focused gamma networks used on DSl and DS2, values of p = 0.2 and ,u = 0.3 were used, respectively. The cascade habituated network parameters are {a, r> = {0.5,0.2) and {a, z} = {0.5,0.4}, respectively, for DSl and DS2. The habituated network parameters ak and rk were

292 B. W. Stiles, J. Ghosh JNeuvocomputing 15 (1997) 273-307

m

95.m

90.m

lJ.co

6om

R.ca

mm

6(m

fa.ol

YMO

4lr.m

4o.m i.m 2.ca 3.00 4.m 5.00 6.00 7.66 s.06

Fig. 6. Performance of various networks on data set 1 (test set results only).

al

mm

Mm

MAO

84Lo

uaa

corn

%.al

94.0 ,’ EL

lKdY ?DlWJ

9Ulo

-. --.__

I

Fig. 7.

76m

76.ca

74.m

nal

mm

66.66 P

,.a, 2.00 3110 4.m J.66 6m 7M 61x)

Performance of various networks on data set 2 (test set results only).

assigned randomly on the interval (0.05,0.7), and (0.05,0.3), respectively. These ranges were chosen to include most of the permissible a and r values with the exception of extremely small values which correspond to extremely slow variations over time. From the results in Fig. 6 it is apparent that for large values of m the focused gamma

B. W. Stiles, J. GhoshJNeurocomputing I.5 (1997) 273-307 293

network outperforms all other networks on DSl. The best CR obtained for the gamma network was slightly higher (4%) than that of the next best network, the cascaded habituated network. However, for simpler networks (m 5 4) the habituation based networks (particularly, the cascade habituated design) perform better. This seems to indicate that a small number of habituation units can be used to encode relatively complex temporal information. Although it does perform best for the simplest network structure (m = l), the habituated network does not do as well on average as either the focused gamma or the cascade habituated network. This result is probably explained by the fact that its parameters are assigned randomly. (Each data point is actually the average of four trials in order to eliminate some of the inherent randomness in the results.) The small number of parameters required by the focused gamma network and the cascade habituated network were optimized by retraining the feedforward network for a set of possible parameter values and choosing the peak performance. Even this amount of optimization is not theoretically required for these two networks, because they are both universal approximators even when their parameter values are fixed. However, in order to ensure the same mathematical capabilities, the habituated network requires multiple habituation units for each input channel with different values for elk and rk. This large number of parameters proscribes the use of the simple optimization technique used for the other two networks. In Section 3.5 we discuss an optimization technique based upon the real time recurrent learning algorithm which can be used for some problems. However, for reasons detailed in Section 3.5 this method is poorly suited for the particular classification problems discussed in this paper. Since the TDNN structure considered has exactly finite short term memory, it does not perform well at all on DSl. This result indicates that long term temporal information is necessary for classifying DS 1.

Because of the added local time information, all of the networks performed better on DS2. In particular due to its dependence on short term temporal information, the TDNNs with relatively large values of m (depth) did quite well. Once again the habituation based designs did better than the other networks for small values of m, (m I 2). However, simpler gamma networks performed somewhat better on DS2 than DSl. The focused gamma network once again yielded the best overall result by 4% over the cascade habituated network.

Since the TDNN architecture considered is a particularly simple one, for the sake of fairness we briefly discuss the increase in TDNN performance which may be obtained by subsampling. Figs. 8 and 9 illustrate how varying the sampling rate (SR) of the input data can effect TDNN performance on data sets 1 and 2, respectively. The performance of the CHMLP with SR = 1 is also included in the figures for comparison. Each consecutive series of SR feature vectors in the original data is averaged together to produce a single subsampled feature vector. The fact that subsampling greatly improved the performance of the TDNN architecture illustrates the relatively wide applicability of the TDNN architecture. However, simple subsampling is only useful when information on a single time scale is sufficient to solve the problem. For more complex problems, the inputs might be further preprocessed by subsampling at a variety of rates and then combining the subsampled feature vectors. Such


at

9om

a3m

mm

75m

m.m

65m

mm

sm

50m

urn

43m

35.00

2.m 4.~ 6m MO

Fig. 8. Effect of sample rate on TDNN performance on data set 1 (test set results only).

94m

9200

90.00

aam

Mm

urn

vm

mm

76.00

76.O.l / ’ . I I I

;

74.66 I ? I I I

I

nm I :’ I I

lm 4m 6.W a.00

Fig. 9. Effect of sample rate on TDNN performance on data set 2 (test set results only).

a procedure may prove complicated however since it involves a large number of issues to be considered (i.e., number of sampling rates, value of each individual sampling rate, depth of tapped delay line applied to each subsampled input sequence, etc.). Additionally, issues involving redundancy in the information must be considered. For

B. W. SHes, J. GhoshlNeurocomputing 15 (1997) 273-307 295

example, if the state vector includes the original feature vector at time instants t and t - 1, it is not necessary to also include the average of the two. For problems which cannot be solved by simpler TDNNs, it is often easier to make use of a focused gamma or cascade habituated network which can also encode information about a variety of time scales, and only requires 1 or 2 parameters to be optimized. Additionally, subsampling is in no respect limited to TDNNs, and could be used to good effect by any of the other networks discussed in this paper.

3.5. Network parameter optimization

For the cascade habituated network, further study was done to examine the effect of varying c1 and z on test set performance on DSl and DS2. Figs. 10 and 11 show the effects of varying 01 on the performance on data sets 1 and 2, respectively. Figs. 12 and 13 show the effects of varying z on the performance on data sets 1 and 2, respectively. Similar optimization is performed on the p parameter for the focused gamma network. The results on DSl and DS2 are illustrated in Figs. 14 and 15, respectively. The simple method of retraining the feedforward network for a variety of parameter values is sufficient for solving a wide variety of problems using a focused gamma network or a cascade habituated network, because both of these networks are universal approximators even for fixed values of their parameters. By varying the two a and z parameters in the cascade habituated network or the single p parameter in the focused gamma network, one can optimize the efficiency of the encoding in order to ensure

CR

mm .

89m um nm wnl urn

mm I II I I I

Fig. 10. Effect of a on cascade habituated MLP performance on data set 1 (test set results only, z = 0.2, m = 5).

296 B. W. Stiles, J. GhoshlNeurocomputing I.5 (1997) 273-307

xlom 4mm 6alm &lo.al

Fig. 11. Effect of u on cascade habituated MLP performance on data set 2 (test set results only, 5 = 0.4, M = 5).

CR

9nm

88.00

86.00

urn am BDm

78.00 mm 74m nm m.ca 6s.00 6600 fNJo 6x0 mm 58.00 56611) 54m J2.03

3

Ll.m Imm Mom xom 4mm JDNlO 60011)

Fig. 12. Effect of T on cascade habituated MLP performance on data set 1 (test set results only, c( = 0.5, m = 5).

B.W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307 291

G-3 lrn.rn 2mm mom 4com Iwoa, 60~1

Fig. 13. Effect of T on cascade habituated MLP performance on data set 2 (test set results only, CL = 0.5, m = 5).

al

I I .

9sm

90.m

8s.m

mm

nm

mm

b%m

mm

55.m

x8.m

4s.m I I I. I ’

Fig. 14. Effect of p on focused gamma network performance on data set 1 (test set results only, m = 6).

298 B. W Stiles, J. GhoshlNeurocomputing 1.5 (1997) 273-307

Y / I

I I I nw I lo-3 mom 4aNm 6oom 8mm

Fig. 15. Effect of p on focused gamma network performance on data set 2 (test set results only, m = 6).

a simple approximation structure for a wide class of problems. For DSl, the performance of the cascade habituated network is robust with regard to changes in the CI parameter. There is only a 4% variation in classification rate over the range tl = 0.5-0.8. On DSl the value of r is somewhat more important. Since r controls the rate at which habituation occurs, it seems that for DSl, information from a particular time scale is important. Performance on DS2, however, is robust with both habituation parameters. This suggests that DS2 has information useful to the networks in question on a variety of time scales. This conclusion is borne out somewhat by the fact that the gamma network is also more robust with respect to p on DS2 than DSl. Both the focused gamma network and the cascade habituated network seem to be capable of encoding information from a range of time scales.

Unlike the cascade habituated network, the habituated network memory parameters were not optimized. Obviously, there are too many memory parameters to optimize in a straightforward manner such as that done for the gamma and cascade habituated networks. However, for some problems a version of the real time recurrent learning (RTRL) procedure used for training recurrent networks may be applicable. The RTRL procedure is an extension of backpropagation for dynamic networks [38]. In backpropagation, the approximate derivative of the error with respect to each parameter is calculated using the chain rule. The RTRL procedure simply extends this procedure for dynamic networks. In the case of a habituated network, the partial derivative of the error at the output of the feedforward network with respect to its state inputs is calculated as in backpropagation. The partial derivatives

B. W. Stiles, J. GhoshlNeurocomputing I.5 (1997) 273-307 299

of each state value with respect to ak and rk can then be computed using the following equations:

aw,(t + 1) aw,(t) =--

auk

+ Z(t + l)? k

+ Wk(t) - wk(o) 9

awk(t + 1) awk(t) =--

ark

+ Wk@ + 1) - Wk(t)

=k

(19)

(20)

By making using of the same update rule used in standard backpropagation one can then update ak and rk. There is an important difference, however. Since Eqs. (19) and (20) are iterative, the state vectors must be presented to the feedforward network in order during training. The training method used when a and r are not varied is to proprocess the entire training data set to determine the feature vectors to be presented to the feedforward network. These feature vectors are then shufJEed and used to train the network. If the a and r parameters are modified using RTRL this shuffling of feature vectors cannot take place. Without shuffling, the short time variability of the desired outputs is too low for the data sets investigated in this paper. The feedforward network trains on the same desired output patterns for 40 or more iterations. This leads to an oscillatory behavior in the training of the feedforward stage. Such behavior in turn leads to a reduction in performance which is greater than any gain made by adapting ak and rk, therefore, ak and rk are assigned randomly and not optimized in any of the experiments with DSl and DS2. However, habituated networks with RTRL may be useful for other problems where the local time variability is greater (i.e., time series prediction, system identification, adaptive control, etc.).

4. Conclusions

We have introduced two novel biologically inspired structures for modeling discrete time systems. Either structure is capable of approximating arbitrarily well any continuous, causal, time-invariant mapping on a uniformly bounded discrete-time input. These structures are examples of a particularly simple general architecture in which a temporal encoding stage is followed by a feedforward neural network. This general structure includes TDNNs and the focused gamma network which employ linear temporal encoding stages. In contrast, the novel habituation based networks discussed in this paper, make use of nonlinear temporal encoding stages. This distinction is theoretically interesting because of recent results which suggest inherent limitations in linear memory structures. The usefulness of the models in question


is also demonstrated by their performance on the problem of classifying artifical Banzhaf sonograms. On a data set which is dependent on long term temporal information (DSl), the cascade habituated network performed similarly to the commonly used focused gamma network. Both habituation based architectures showed marked improvement over the other networks for the case in which the network complexity is small (m I 4). On DS2, which was richer in short term temporal information, the cascade habituated networks once again performed similarly to the gamma network. The simplest habituation based networks once again outperformed other networks with the same complexity. The results on both data sets suggest that habituation based designs are capable of utilizing a variety of relatively complex short term and long term temporal information, and that very simple versions of these networks are particularly efficient. A topic for further study is to investigate other input transformation mechanisms, since Theorem 1 is valid for any set of functions B which is a complete memory.

Appendix A

A.I. Proof of Theorem 1

In order to prove the theorem we need the following lemma.

Lemma 1. Let to be an arbitrarily large natural number, then there exist natural numbers, p and m, real numbers, aj and cjk, and elements of B, bjk, such that the following inequality holds:

(fx)(to) - j$l aj exp 5 cjk(bpx)(to) < E. k=l >I (21)

Proof. Let g be a mapping from [0, 11’0 to 9 defined as follows:

g(x) = f ~&o), (22)

where I, is an element of X such that Z,(i) = xi, the ith component of x for all i E {1,2 , . , . , to}. Observe that g is a continuous function on a compact metric space. It is continuous because f is continuous, and it is a function because f is causal and I,(t) is zero for all t 5 0. Similarly, we define S a set of all functions from [0, 11’” to W, of the form b*(x) = bZ,(t,) for each b E B. Now we define K to be the set of all functions from [O,llro to a of the following form:

(23)

Since g is a continuous real-valued function on a compact metric space, by the Stone-Weierstrass theorem we can state the following. If K is an algebra that


separates the points of [O,llro, and does not vanish on [0, llfO, then there exists an s E K such that lg(x) - s(x)1 < E for all x E [0,1]‘0. We now show that K has the three required properties. First, clearly, K does not vanish because exp(y) is nonzero for any real value y. Secondly, it can be readily shown that if a and b are elements of K then the pointwise product ab E K, (a + b) E K, and ya E K for any y E W. Therefore K is an algebra. All that remains to complete the requirements of the Stone-Weierstrass Theorem is that K separates the points of [0, 11’0. Let x and y be elements of [0, 11’0 such that x # y. If for any such x and y, there exists some s E K such that s(x) # (s(y) then K separates the points of [0, 11’0. The fact that x is not equal to y implies that there exists some integer k such that 1 < k I to and Z,(k) # Z,,(k). Therefore by the second property of B given in Theorem 1, there exists some b E B such that bZ,(t,) # bZ,(t,). By the definition of S, therefore, there exists, a b* E S such that b*(x) # b*(y). Since the exponential function is strictly monotonic, exp(b*(x)) # exp(b*(y)). Since exp(b*) E K, K separates the points of [0, 11’0. By the Stone- Weierstrass Theorem, there exist real numbers, aj and cjk, natural numbers, p and m, and elements of S, b& such that the following inequality is true for all x E [0, 11%

(24)

We now make a couple final observations to complete the proof. First recall that g(x) =fZ,(t,) and bj*k(x) = bj,Z,(t,). Finally, observe that because of the causality off and bjk, for each Z E X there is an x E [0, 11’0 such thatfZJt,,) =fZ(to) and bjkZx(t0) = bjkZ(to). This completes the proof of the lemma.

From Lemma 1 and the fact that T,Z E X for all positive values of /I, the following inequality is true for all p > 0 and for all Z E X:

WaNto) - f aj exp j=l

kEl Cjk(bjk~~4(lO))I < E. (25)

Due to the time-invariance off fT,J(t,,) =fZ(t, - J?) for /I < to. By the third property of B in Theorem 1, bj,T,Z(to) = b,Z(to - /?) for p < to. From these two observations it is apparent that the following inequality is true for all ~9 such that 0 I fl < to and for all Z E X:

(fO(to - P) - i uj exp f Cjk(bjJ)(tO - PI < E. j=l k=l

Let t = to - /I and the proof of the theorem is complete.

A.2. Proof of Theorem 2

(26)

In order to show that B, the set of all habituation functions, is a complete memory, it is necessary to show that it meets the four properties detailed in Definition 1. First we will show the first property, that there exists real numbers a and c such that

302 B. W. Stiles, J. GhoshjNeurocomputing 15 (1997) 273-307

bx(t) E (a, c) for all b E B, x E X, and t E 3”+. It is sufficient to show that bx(t) E [0, 11. This can be readily proven by using mathematical induction and recalling the range of values c1 and r can take. Since b0 = 1, bx(0) = 1 - W(O). Because r and x(O) are elements of [0, 11, bx(0) E [0, 11. All that remains to be shown is that bx(k) E [0, l] implies bx(k + 1) E [0, 11. Because of the range of z and c1 values, the following inequalities hold:

0 I (1 - zx(k + l))bx(k) < bx(k),

0 I ~(1 - bx(k)) < 1 - bx(k).

(27)

(28)

Since bx(k + 1) = (1 - zx(k + l))bx(k) + az(1 - bx(k)), the following inequality also holds:

0 I bx(k + 1) I bx(k) + (1 - bx(k)) = 1. (29)

So B satisfies the first property of a complete memory. Next we show that B satisfies the second property, for any t E ZC?+ and any t,, such

that 0 -C t,, I t the following is true. If x and y E X and x(t,,) # y(te), then there exists b E B such that bx(t) # by(t).

First it is necessary to show the following lemma.

Lemma 2. If b E B is a habituation function with habituation parameters c( and z as defined in Theorem 2, an equivalent definition for b is the following:

bx(t) = az + CIZ i fi (1 - CIZ - zx(h)) + bO fi (1 - ar - 7x(i)). j=l h=j i=O

(30)

The lemma is readily proven using mathematical induction. Let x and y be elements of X such that x(to) # y(to) for some to I t. This implies

that there exists a natural number i with the following properties, 0 I i I t, there exists a 6 > 0 such that Ix(i) - y(i)1 > 6, and i -C j I t implies x(j) = y(j). The number i represents the latest time prior to t at which x and y differ. Now we use i to define a value /I. If i < t b is defined as follows:

fi = c1r fi (1 - uz - zx(h)) + uz fi (1 - ctr - zy(h)). (31) h=i+j h=i+l

If i = t, then j3 = az. Observe /I > 0 because CIZ + z < 1. Using Lemma 2 and some algebraic manipulation we can derive the following equation:

b(x) - b(y) = /? i fi (1 - az - zx(h)) - i (1 - az - zy(h)) j=O h=j h=j 1


Since we have restricted z and CI to have positive values such that az + t < 1 we can make an important observation. The following inequalities hold for all x E X and t E 6+.

0 < (1 - tlz - r) I(1 - ar - 7x(t)) 5 (1 - az) < 1. (33)

Now we can exhibit a lower bound on lb(x) - b(y)1 in terms of i, 6, and p.

i+l

lb(x) - b(y)1 2 gzs - /3 c [(l - a+ - (1 - CU - r)j] j=2

- 8(- ) 5 ((1 - ,++i -(l -cU-rz)i+i).

Because (1 - ar) > (1 - CIZ - r) and i > 0 the following set of inequalities hold. Let 4 = (l/(ar)(crz + r)) - 1. Let y = (1 - ~r)~/ar~.

i+l

jT2 C(1 - 4 - (1 - a7 - +‘I 5 f [( 1 - az)j - (1 - ar - r)jl = r4, (35) j=2

(1 - .,)i+l - (1 - az - T)i+l I(1 - ~r)~+l 5 (1 - ar)2, (36)

( > e ((1 - ar)i+l - (1 _ a7 _ ,)i+l I zy, (37)

IW) - by(t)I 2 Pz(J - 4 - 24. (38)

Since /3, r, and 6 are positive values, it is sufficient to show that the quantity Cp + y can be made arbitrarily close to zero by selecting appropriate values for a and z. The upper bound on the range of a is given by the inequality az + z < 1. For any c such that (1 - z) > c > 0 the following is an acceptable value for a:

a=l-(t+c) 7 *

(39)

Since r can take values arbitrarily close to zero, we can complete the proof of the second property by demonstrating that we can choose appropriate values r and c so that 4 + y < 6. Let E > 0. For any E < 0.5 we can choose z = E and c = E. Then

4+Y= 1 + (1 - ar)3(1 + a)

az(az + 7)

_ 1 = 1 + (z + d3(1 + (1 - (7 + 44 _ 1.

(1 -(r + c))(l - c) (4o)

If we plug in our assigned values for r and c we get

- E) (1 + 8s3 + 8s2 - 16~~) - 1. (41)

304 B. W. Stiles, J. Ghosh JNeurocomputing 15 (1997) 273-307

Taking the limit as E approaches zero we get

4 + y = lii(l)(l + 0 + 0 - 0) - 1 = 0. (42)

Since choosing any arbitrarily small E yields values of CI and z in the proper range, there must be acceptable values of cc and z for which 4 + y < 6. Thus B satisfies the second property of a complete memory.

Next we show the third property, if b E B then (bT,x)(t) = (bx)(t - fi) for all t E b+, all x E X and any fl such that 0 I p I t. By using mathematical induction it can be readily shown that (bT,x)(t) = 1 for all t < /?. Using this fact, it is then easy to show that the recursive definition of habituation given in Theorem 2 satisfies the third property. Once again we use mathematical induction. Because b0 = (bT,x)(/? - 1) = 1 and x(0) = T@x(/?), (bT,x)(P) = (bx)(O); and since TBx(t) = x(t - p) for all t > /?, we can readily show that the assumption, bT,x(/? + k) = bx(k) implies bT,x(/? + k + 1) = bx(k + 1) for any integer k > 0.

Therefore B satisfies the third property of a complete memory. The fourth requirement for B to be a complete memory is that the elements of B are causal. Causality is readily apparent from the definition of B given in Theorem 2, so B is a complete memory and the proof of Theorem 2 is complete.

A.3. Proof of Theorem 3

Once again in order to show that B is a complete memory, we show that it satisfies the four necessary properties. First we will show the first property, that there exists real numbers a and c such that hix(t) E (a, c) for all hi E B, x E X, and t E T”+. From the proof of Theorem 2 we already know that bx(t) E [0, l] for all t. Since hI = b we know that h,x(t) E [0, l] for all t. We now use mathematical induction to demonstrate that hix(t) E [0, l] for all i and t. It is sufficient to show that the assumption that hkx(t) E [0, l] for all t implies h,, Ix(t) E [0, l] for all t. Because of the range of a and z values the following inequalities hold:

0 I (1 - $1 - hkx(t - l)))h,+,x(t - 1) I hk+lx(t - l),

0 I ~$1 - I~~+~x(t - 1)) I 1 - hk+lx(t - 1).

(43)

(44)

Since hk+lx(t) = (1 - ~(1 - hkx(t - l)))h,+,x(t - 1) + az(1 - hk+lx(t - l)), the following inequality also holds:

0 2 hk+lx(t) I hk+lx(t - 1) + 1 - hk++(t - 1) = 1. (45)

Therefore B satisfies the first property of a complete memory. Next we show that B satisfies the second property of a complete memory, for any

t E 2” and any to such that 0 < to I t the following is true. If x and y E X and x(tO) # y(t,), then there exists b E B such that bx(t) # by(t). The conditions on to and t imply that there is a j such that 0 < j < t, x(j) # y(j) and for all positive integers k <j, x(k) = y(k). From the definition of hI it is easy to see that h,x(k) = hly(k) for all

B. W. Stiles, J. GhoshlNeurocomputing I5 (1997) 273-307 305

k <j and h,x(j) # hiy(j). U sin mathematical induction we prove the following g lemma.

Lemma 3. For all positive integers i, hix(k + i - 1) = hiy(k + i - 1)for all k < j and hiX( j + i - 1) # h,y( j + i - 1).

Since we already know the lemma is true for i = 1, assume it is true for i = m. Recall that

h,, Ix(t) = h,+ Ix(t - 1) + ar(1 - h,, Ix(t - 1))

- zh,+Ix(t - l)(l - h,x(t - 1)). (46)

Since h ,+Ix(O) = 1 = h,, ly(0) and h,x(k + m - 1) = h,y(k + m - 1) for all k <j,

h,+ Ix(k + 4 = h,+ ,y (k + m) for all k < j. Since h,x( j + m - 1) # h,y( j + m - l), h,+Ix(j + m) # h,+,y(j + m). This completes the proof of the lemma. Let i = t -j + 1. By the lemma, hix(t) # hiy(t). S ince hi E B, B satisfies the second property of a complete memory.

Now we examine the third property, b E B implies @T&(t) = bx(t - /I?) for all t E d+, all x E X and any /? such that 0 I /I I t. From the proof of Theorem 2, we know that this statement is true for hI. Now we show by induction that it is true for hi for each positive integer i. Assume that h, meets the requirement. We now show that this also implies h,+l meets the requirement. It can be readily shown that

,+lTfl~)(t) = 1 for all t c p. Because (h,+l

i:m+ i I+)(P) =

T,x)(p - 1) = 1 and h&O) = h,T,x(/?),

(h ,+ix)(O); and since h,T,x(t) = h,x(t - /3) for all t > /3, we can readily show that the assumption, h,+ lTBx(/? + k) = h,+ ,x(k) implies h ,+J,x(B + k + 1) = hm+l x(k f 1) for any integer k > 0. Therefore B satisfies the third property of a complete memory.

The fourth property of a complete memory is that b E B implies b is causal. Each element of B is causal by inspection. So B is a complete memory, and the proof is complete.

References

[l] C. Bailey, E. Kandel, Molecular approaches to the study of short-term and long-term memory, in: Functions of the Brain, Clarendon Press, Oxford, 1985, pp. 98-129.

[Z] W. Banzhaf, K. Kyuma, The time-into-intensity-mapping network, Biol. Cybernet. 66 (1991) 115-121. [3] T. Bell, Sequential processing using attractor transitions, in: Proc. 1988 Connectionist Models

Summer School, 1988, pp. 93-102. [4] J.H. Byrne, K.J. Gingrich, Mathematical model of cellular and molecular processes contributing to

associative and nonassociative learning in aplysia, in: J.H. Byrne, W.O. Berry (Eds.), Neural Models of Plasticity, Academic Press, San Diego, 1989, pp. 58-70.

[S] J.-P. Changeux, S. Qehaene, J.-P. Nadal, Neural networks that learn temporal sequences by selection, Proc. National Acad. Sci. USA 84 (1987) 2727-2731.

[6] G. Cybenko, Approximations by superposition of a sigmoidal function, Math. Control, Signals Systems, 2 (1989) 303-314.

306 B.K Stiles, J. GhoshlNeurocomputing 15 (1997) 273-307

[7] J. Dayhoff, Regularity properties in pulse transmission networks, in: Proc. 3rd Internat. Joint Conf. on Neural Networks, 1990, pp. III: 621-626.

[8] B. de Vries, J.C. Principe, The gamma model ~ a new neural net model for temporal processing, Neural Networks 5 (1992) 565-576.

[9] P.M. Djuric, SM. Kay, G.F. Boudreaux-Bartels, Segmentation of nonstationary signals, in: Proc. ICASSP, 1992, pp. V: 161-164.

[lo] J.L. Elman, Finding structure in time, Cognitive Sci. 14 (1990) 1799211. Cl 11 J. Ghosh, L. Deuser, S, Beck, Impact of feature vector selection on static classification of acoustic

transient signals, in: Government Neural Network Applications Workshop, 1990. [12] J. Ghosh, L. Deuser, S. Beck, A neural network based hybrid system for detection, characterization

and classification of short-duration oceanic signals. IEEE J. Ocean Eng. 17 (4) (1992) 351-363. [13] J. Ghosh, L. Deuser, Classification of spatio-temporal patterns with applications to recognition of

sonar sequences, in: T. McMullen E, Covey, H. Hawkins, R. Port (Eds.), Neural Representation of Temporal Patterns, 1995.

[14] J. Ghosh, S. Wang, A temporal memory network with state-dependent thresholds, in: proc. IEEE Internat. Conf. on Neural Networks, San Francisco, 1993, pp. I: 359-364.

[15] R. Granger, J. Ambros-Ingerson, G. Lynch, Derivation of encoding characteristics of layer II cerebral cortex, J. Cognitive Neurosci. (1991) 61-78.

[16] J.-P. Hermand, P. Nicolas, Adaptive classification of underwater transients, in: Proc. ICASSP, 1989, pp. 2712-2715.

[17] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, Addision- Wesley, Reading, A, 1991.

[18] MI. Jordan, Serial order, a parallel, distributed processing approach, in: J.L. Elman, D.E. Rumelhart, (Eds.), Advances in Connectionist Theory: Speech, Lawrence Erlbaum, Hillsdale, 1989.

[19] S.Y. Kung, Digital Neural Networks, Prentice-Hall, Englewood Cliffs, NJ, 1993. [20] S. Kurogi, A model of neural network for spatiotemporal pattern recognition, Biol. Cybernet. 57

(1987) 103-l 14. [21] R.P. Lippmann, Review of neural networks for speech recognition, Neural Comput. l(1) (1989) l-38. [22] A. Maren, Neural networks for spatio-temporal recognition, in: A. Maren, C. Harston, R. Pap (Eds.),

Handbook of Neural Computing Applications, Academic Press, London, 1990, pp. 295-308. [23] M.C. Mozer, Neural network architectures for temporal sequence processing, in: A.S. Weigend, N.A.

Gershenfeld (Eds.), Time Series Prediction, Addison Wesley, Reading, MA, 1993, pp. 243-264. [24] J. Park, I.W. Sandberg, Universal approximation using radial basis function networks, Neural

Comput. 3 (2) (1991) 246-257. [25] D. Robin, P. Abbas, L. Hug, Neural response to auditory patterns, J. Acoust. Sot. Amer. 87 (4) (1990)

1673-1682. [26] I.W. Sandberg, Structure theorems for nonlinear systems, Multidimensional Systems and Signal

Processing 2 (1991) 267-286. 1271 I.W. Sandberg, Multidimensional nonlinear systems and structure theorems, J. Circuits Systems

Comput. 2 (4) (1992) 3833388. [28] I.W. Sandberg, L. Xu, Network approximation of input-output maps and functionals, in: Proc. 34th

IEEE Conf. on Decision and Control, 1995. [29] B.W. Stiles, Dynamic neural networks for classification of oceanographic data, Master’s thesis,

University of Texas, Austin, Texas, 1994. [30] B. Stiles, J. Ghosh, Habituation bases neural classifiers for spatio-temporal signals, in: Proc. of

ICASSP-95, 1995. [31] B. Stiles, J. Ghosh, A habituation based mechanism for encoding temporal information in artificial

neural networks, in: SPIE Conf. on Applications and Science of Artificial Neural Networks, vol. 2492, Orlando, FL, April 1995, pp. 404-415.

[32] B. Stiles, J. Ghosh, Some limitations of linear memory architectures for signal processing, in: Proc. 1996 Internat. Workshop on Neural Networks for Identification, Control, Robotics, and Signal/ Image Processing, Venice, Italy, 1996.

[33] B. Stiles, J. Ghosh, Some limitations of linear memory architectures for signal processing. Technical Report 96-04-104, The University of Texas at Austin, Computer and Vision Research Center, 1996.

B. W. Stiles, J. Ghosh JNeurocomputing 15 (1997) 273-307 301

c341

c351

C361

c371

C381

B. Stiles, I. Sandberg, J. Ghosh, Complete memory structures for approximating nonlinear discrete time mappings, IEEE Trans. Neural Networks, submitted. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang, Phoneme recognition using time-delay neural networks, IEEE Trans. Acoust Speech Signal Process. 37 (1989) 328-339. D. Wang, M.A. Arbib, Complex temporal sequence learning based on short-term memory, Proc. IEEE 78 (9) (1990) 1536-1544. R.J. Williams, J. Peng, An efficient gradient-based algorithm for on-line training of recurrent networks trajectories, Neural Comput. 2 (4) (1990) 490-501. R.J. Williams, D. Zipser, A. learning algorithm for continually running fully recurrent neural networks, Neural Comput. 1 (2) (1989) 270-280.

Bryan W. Stiles was born in Portsmouth, Virginia on September 8,1970, the son of Jerry and Carolyn Stiles. He graduated from Volunteer High School in Church Hill, Tennessee. He began his undergraduate studies in August 1988, and received the degree of Bachelor of Science in Electrical Engineering from the University of Tennessee at Knoxville. He entered the graduate school of the University of Texas at Austin in August 1992, where he received a Master of Science in August 1994. He is currently pursuing a Ph.D. while employed as a research assistant at the Laboratory for Artificial Neural Systems under the supervision of Dr. Ghosh. His primary field of interest is in the theoretical analysis of dynamic neural network structures.

Joydeep Ghosh graudated from IIT Kanpur (B. Tech ‘83) and The University of Southern California (MS, Ph.D, ‘88). He is currently an Associate Professor with the Department of Electrical and Computer Engineering at UT Austin, where he holds the Endowed Engineering Foundation Feilowship He directs the Laborat- ory for Artificial Neural Systems (LANS), where his research groups is studying adaptive and learning systems. Dr. Ghosh served as the general chairman for the SPIEjSPSE Conference on Image Processing Architectures, Santa Clara, Febru- ary 1990, as Conference Co-Chair of Artificial Neural Networks in Engineering (ANNIE)P3, ANNIE’94 and ANNIE’95, and in the program committee of several conferences on neural networks and parallel processing. He has published six book chapters and more than 70 refereed papers. He received the 1992 Darlington Award given by the IEEE Circuits and Systems Society for the Best Paper in the

areas of CAS/CAD, and also “best conference paper” citations for four papers on neural networks. He is an associate editor of IEEE Trans. Neural Networks and Pattern Recognition.

Habituation based neural networks for spatio-temporal ...

Documents

Transcript of Habituation based neural networks for spatio-temporal ...