On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag

7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag

http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 1/19

JOURNAL OF COMPUTER AND SYSTEM SCIENCES 50, 132-150 (1995)

On the Computational Power of Neural Nets*

H a v a T. S i e g e l m a n n +

Dep artmen t o f In formation Sy stem s Engineering , Technion, Haifa 32000, Israel

A N D

E d u a r d o D. S o n t a g *

Dep artment o f Mathema tics, Rutgers University, New Brunswick, New Jersey 08903

Received February 4, 1992; revised May 24, 1993

This paper deals with finite size networks which consist of inter-

connections of synchronously evolving processors. Each processor

updates its state by applying a "sigmoidal" function to a linear com-

bination of the previous states of all units. We prove that one may

simulate all Turing machines by such nets. In particular, one can

simulate any multistack Turing machine in real time, and there is a net

made up of 886 processors which computes a universal partialrecur-

sive function. Products (high order nets) are not required, contrary to

what had been stated in the literature. Nondeterministic Turing

machines can be simulated by nondeterministic rational nets, also in

real time. The simulation result has many consequences regarding the

decidability, or more generally the complexity, of questions about

recursive nets. C 1995 Academic Press, (nc.

1. INT RODU CT ION

We st udy the com put ati on al ca pabili tie s of re curr ent

first-order neural networks , or as we shall also say from now

on, processor nets. O ur model consists of a synchronous

netw ork of processors. Its architecture is specified by a

general directed graph. The input and output are presented

as streams. Input letters are transferred one at a time via M

input channels. A similar convention is applied to the

output, which is produced as a stream of letters, where each

letter is represented by p values. The nodes in the graph arecalled “neurons.” Each neuron updates its activation value

by applying a composition o f a certain one- variable func

tion with an affine combination (i.e., linear combination

and a bias) of the activations of all neurons X j , j = 1,..., N,

* This research was supported in part by U.S. Air force Gr ant AF OSR-

91-0343.

’ E-mail: iehava@ ie.technion.ac.il. Wor k done while at Dept, of Com

puter Science, Rutgers University.

* E- mail: sontag@ control.rutgers.edu

0022-0000/95 J6.00

Copyright £: 1995 by Academic Press, Inc. A ll rights of repro duction in any form reserved.

and the external inputs uk, k = 1,..., M , with rational valuedcoefficients— also called weights—- («,,, bin e,). T hat is, each

processor’s state is updated by an equation of the type

1) = <t( £ a , j X j ( t ) + Y , b ijuj (t ) + ci ), /= 1,..., N.

■ j = i j~ i

( i :

In our results, the function a is the simplest possible

‘sig moid,” namely the saturated- linear function

r o i f x < o

ct (. y ) : = ^ x i f 0 .v ^ 1

1 i f jc > 1.

( 2 )

This function has appeared in many applications of neural

nets (e.g., [B at91 , BV8 8, Lip87, ZZ Z9 2] ). We focus on this

specific activation function because it makes theorems

easier to prove, and also because, as it was proved in

[SS94], a large family of sigmoidal type activation func

tions result in models which are not stronger computa

tionally than this one.

T he use of simoidal functions— as opposed to hard

thresholds— is what distinguishes this area from older workthat dealt only with finite automata. Indeed, it has long been

known, at least since the classical papers by McCulloh and

Pitts [ WM 43 ] and K leene [ Kle5 6 ] ), how to simulate finite

automata by such nets.

As part of the descr iption, we as sume that ther e is singled

out a subset of the N processor s, say ..., x ip\these are the

p output processors, and they are used to communicate the

outputs of the network to the environment. (Generally,

the output values are arbitrary reals in the range [0, 1], but

later we constrain them to binary values only.)

132

mailto:[email protected]






T HE COMPUT A T IONA L POW E R OF NE URA L NET S 133

W e state that one can simul at e al l (multistack ) T uring

machines by nets having rational weights. The rational

numbers we consider (as weights, not those numbers arising

as intermediate values of computation) are simple, small,

and do not require much precision. Furthermore, this

simulation can be done with no slow down in the computa

tion. In par ticular , it is possible to giv e a net made up of 886

processors w hich computes a unive rs al partial- recursivefunction. Non- deterministic T uring machines can be

simulate d by nondeterministic nets, also in real time.

W e restrict to r ational (r ather th an re al) states an d

weig hts in or der to preserv e com put abil ity . It tur ns our tha t

using real valued weights results in processor nets that can

“calculate” super- T uring functions; see [ SS9 4] .

1.1. Related Work

Most of the previous work on recurrent neural networks

has focused on networks of infinite size. As each neuron is

itself a processor, an infinite net is effectively equivalent to

an infinite automaton, which by itself has unbounded

processing power. Therefore, infinite models are less

challenging for the investigation of computat ional power,

compared to our model which has only a finite number of

neurons and is bounded computationally.

There has been previous work concerned with computa

bility by finite networks , however, starting with the classical

w or k of Mc Cullo ch an d Pitts, cited abov e, on finite

automata. Another related result was due to Pollack

[ Pol8 7] . Polla ck ar gued that a certain recurrent net model,

w hic h he called a “ne uring ma chine,” is univer sa l. T he

model in [ Pol8 7] consisted of a finite number o f neurons of

two different kinds, having identity and threshold respon

ses, respectively . The machine w as high- order \ that is, the

activations were combined using multiplications, as

opposed to just linear combinations (as in Eq. (1)). T hat is,

the value of x, is updated by means of a formula of the type

x ,{ t+ \ ) = a ( X " , > . ^ “ (0 ; = 1■■■N,

(3)

whe re co, ft are multiindices, “ | | ” denotes their magnitudes

(total weights), k < oc, and

Pollac k left as an open question w hether high- order connec

tions are really necessary in order to achieve universality,

although he conjectured that they are. High order networks

(cf . [ CSSM89 , Elm90, G M C +92, Pol87, SCL G91 , W Z8 9] )

have been used in applications. One motivation often cited

for the use of high- order nets was P olla ck ’s conjecture

regarding their superior computational power. We see that

no such superiority of computationa l power exists, at least

when fo rmalize d in ter ms of poly nomia l- time co mputation .

W or k tha t deals w ith inf init e str ucture is re por ted

by Hartley and Szu [HS87] and by Franklin and

Gar zon [ FG 90 , GF8 9] , some of which deals w ith cellular

automata. There one assumes an unbounded number of

neurons, as opposed to a finite number fi x e d in advance. In

the paper [ W ol9 1 ] , W olpert studies a class of machines w ith jus t linear activation functions and shows that this

class is at least as power ful as any T uring machine (and

clearly has super- T uring capabilities as w ell). It is essential

in that model, a gain, that the number of “neurons” be

allowed to be infinite— as a matter of fact, in [ Wol9 1]

the number o f such units is even uncount able — as the

construction relies on using different neurons to encode

different possible s tack configurations in multistack T uring

machines.

T he idea of using continuous- valued neurons in order to

attain gains in computational capabilities as compared with

threshold gates had been investigated before. However,prior work considered only the special case of feedforward

nets— see, for instance, [ Son92 ] for questions o f approx i

mation and function interpolation and [ MS S9 1] for

questions of Boolean circuit complexity.

T he computability of an optical beam- tracing system

consisting of a finite number of elements was discussed in

[ RT Y 9 0] . One of the models described in that paper is

somewhat similar to ours, since it involves operations which

are linear combinations of the parameters, with rational

coefficients only, passing through the optical elements and

having recursive computational power. In that model,

however, three types of basic elements are involved, and thesimulated T uring machine has a unary tape. Furthermor e,

the authors of that paper assume that the s ystem can instan

taneously differentiate between two numbers, no matter

how close, which is not a logical assumption for our model.

1.2. Consequences and Future Work

The simulation result has many consequences regarding

the decidability, or more generally the complexity, of ques

tions about recursive nets of the type we consider. For

instance, determining if a given neuron ever assumes the

value “0” is effectively undecidable (as the halting problemcan be reduced to it); on the other hand, the problem

appears to become decidable if a linear activation is used

(halting in that case is equivalent to a fact that is widely con

jectured to fo llow from classical results due to Skolem and

others on ra tional functions; see [ BR 88 ] , p. 75 ] ), and is

also decidable in the pure threshold case (there are only

finitely many states). As our function a is in a sense a com

bination of threshold and linear functions, this gap in

decidability is perhaps remarkable. G ive n the linear- time

sim ulation b ound, it is of course also possible to transfer



13 4 S I EG ELM A N N A N D S ON T A G

NP- completeness r esults into the same questions for nets

(with rational coefficients). Another consequence of our

results (when using the halting problem ) is that the problem

of determining whether a dy namical s ystem of the particular

form

x (t + 1) = a ( A x ( t ) + c )

ever reaches an equilibrium point, from a given initial state,is effectively undecidable. Such models have been proposed

in the neural and analog computation literature for dealing

w it h cont ent - addr es sa ble re tr ieval and opt im iz ati on,

notably in the wor k of Hopfield (see, e.g., [ HT 85 ] . In

associative memories, for instance, the initial state is taken

as the “input pattern” and the final state, as a class repre

sentative. All convergence results currently known assume

special structure of the A matr ix — for instance, that it is

symmetric— but our undecidability conclusion shows that

such results will not be possible for general systems of the

above type.

A no ther corol lary is that hig her or der netw or ks ar e computationally equivalent, up to a polynomial time, to first-

order networks. (A finite networ k w ith rational weights is

finitely describable and can be efficiently simulated by a

T uring ma chine, w hich in turn can be simulated by first-

order networks.)

Many other types of “machines” may be used for univer

sality (see [ Son 90, especially Chap. 2] for general defini

tions of continuous machines). For instance, we can show

that systems evolving according to equations x (? + l) =

x (t) + r (A x (t) + bu(t) + c), where r takes the sign in each

coordinate, again are universal in a precise sense. It is

interesting to note that this equation represents an Euler

approx imation of a differential equation; this suggests

the existence of continuous- time s imulations o f Turing

machines, quite different technically from the work on

analog computation by Pour- El [ PE 74 ] and others. A dif

ferent approach to continuous- valued models of computa

tion is given in [BSS89] and other papers by the same

authors; in that context, our processor nets can be viewed as

programs with loops in which linear operations and linear

comparisons are allowed, but with an added restriction on

branc hing that reflects the nature o f the saturated response

we use.

The rest of this paper is organized as follows. Section 2

provides the exact definitions of the model and states the

results. It is follow ed by Section 3, w hich highlig hts the

main proof. Details of the proof are provided in Sections 4

to 7. In section 8, we describe the universal network, and we

end in S ection 9, where we briefly describe the non-deter-

ministic version of networks.

2. THE MODEL AND MAIN RESULTS

We mo de l a net, as des cr ibed in the intr oducti on, as a

dy namic al sy stem. A t each instant, the state of this system is

a vector x (r )e Q lvof rational numbers, where the /th coor

dinate keeps track of the activation value of the /th pro

cessor. Given a system of equations such as (1), an initial

state ,r( 1), and an infinite input sequence

u = u ( l ), u( 2),...,

we can def ine ite ra tiv ely the state x( t ) at time t, for eachinteger t ^ 1, as the value obta ined by recursively s olv ing the

equations. This gives rise, in turn, to a sequence of output

values, by restricting attention to the output processors; we

refer to this sequence as the “output produced by the input

u" starting from the given initial state.

W e w ant to def ine w hat we me an by a net co mputi ng a

function

{ 0, 1} + s—*•{ 0, 1} + ,

wher e { 0, 1} + deno tes the free semig roup o f bina ry st rings

(excluding the empty string). To do so, we must first define

w hat we me an by a formal network, a network whichadheres to a rigid encoding of its input and output. We

define formal nets with two binary input lines. The first of

these is a data line, and it is used to carry a binary input

signal; when no signal is present, it defaults to zero. The

second is the validation line, and it indicates when the data

line is active; it takes the value “ 1” while the input is present

there and “0” thereafter. We use “Z>” and “ V " to denote the

contents of these two lines, respectively, so that

u(t) = (D(t), F ( r )) e { 0 , l } 2

for each t. We always take the initial state x( 1) to be zeroand to be an equilibrium state. We assume that there are

two output processors, w hich also take the role of data and

v alidation lines and are denoted by “G ” and respec

tively. The output of the network is then given by

y (t ) = (H (t ), G (t ))e {0, l } 2.

(The convention of using two input lines allows us to

have all external signals be binary; of course many other

conventions are possible and would give rise to the same

results, for instance, one co uld use a three- valued input, say

w it h value s { —1 ,0 , 1} , where “0 ” indicate s th at no si gnal is

present, and ± 1 are the two possible binary input values .)

In general, our discrete- time dyna mical sys tem (w ith two

binary inputs) is specified by a dynamics map

,?■. G 'v x { 0, 1 } 2 - > Q 'V

One defines the state at time t, for eac h integ er / > 1, as the

value obtained by recursively solving the equations:

,v( 1) :=A - ,mt, x ( t+ \ ) : = 3 ? ( x ( t ) ,u { t ) ) , t = 1,2,... .



THE COMPUTATIONAL POWER OF NEURAL NETS 13 5

by

For each N e N, we denote the mapping

(? , , . . . , qN)\ - (o(qx), ..., a (q N))

ffv.: ->

■CO, £ { 0 , 1 }

with

<f>(co) = p x ■■■p t e {0, 1} + or undefined

and each r e N (/is called the length o f the res ponse/output

and r is called the response time, the time it takes to provide

the first output bit), we encode the input and output as

follows. Input is encoded as

wher e

V J t ) =

and we drop the subscript N when clear from the context.

(We think of elements of as column vectors, but for dis

play purposes we somtimes show them as rows, as above. As

usual, G denotes the ra tionals, and N denotes the natura l

numbers, not including zero.) Here is a formal definition.

D e f i n i t i o n 2.1. A a- processor net V with two binary

inputs is a dynamical system having a dynamics map of the

form

S ' i x , u) = a (A x + b l u i + b2u2 4- c),

for s ome matrix y 4 e Q fl' xA' and three vectors b t , b2, c e Q ,v

The “bias” vector c is not needed, as one may always add

a processor with constant value 1, but using c simplifies the

nota tion and allow s the use of zero initia l states, which

seems more natural. When b x = b 2 = 0, one has a net without

inputs. Processor nets appear frequently in neural network

studies, and their dy namic properties ar e of interest (see, for

instance, [ MW 89 ] ).

It is obvious that—with zero, or in general a rational,

initial state— one can simulate a processor net with a

T uring machine. W e wish to prove, conversely, that anyfunction computable by a T uring machine can be computed

by such a processor net. We look at partially defined maps

< : { 0, 1} + -> { 0, 1} +

that are recursively computed. In other words, maps for

w hic h ther e is a mult i- st ack T uring ma chine M so that,

when a w or d ca e { 0, 1} + is written initially on the input/

output stack, M halts on ca if and only if <j>(co) is defined,

and in that case, is the content of that stack when the

machine halts.

For each word

and

D J t ) =

if t = 1,..., k, otherwise,

if /= 1,..., k,

otherwise.

Output is encoded as

y u ,, r U) = ( G W, r ( t \ H w r ( t ) ) , t = 1, ...,

where

1, if t = r , ..., (r + /— 1),

0, otherwise,

if the output ii>{co) is defined, and is 0 when <j>(co) is not

defined, and finally

if t = r, ..., (r + / —1 ),

otherwise,

if the output cj>(w) is defined, and is 0 when <j>(co) is not

defined.

T h e o r e m 1. L et <j>\ { 0 , 1 } + —*•{ 0 , 1 } + be any recur

sively computable pa r tia l funct ion. Then, there exists a

processor net A " with the follow ing proper ty :

If . I ’is startedfrom the zero (inactive) initial state, and the

input sequence is applied, . \ ~ produces the output y ln r

(where H to r appears in the “out put node," a nd G OJ r in the

“validation node ") fo r some r.

Furthermore, if a p-stack T uring machine .1i (o f one input

stack and several working stacks) computes <j>(co) in time

T (q} ), then one may take r(co) = T(aj) + 0 ( | <y| ).

Note that in particular it follows that one can obtain the

behavior of a universal T uring machine v ia some net. A n

upper bo und fr om the constr uction shows tha t N = 886 pro

cessors are sufficient for computing such a universal partial

recursive function.

2.1. Restatement: No I/O

It will be convenient to have a version of Theorem 1 that

does not involve inputs but rather an encoding o f the initial

data into the initial state. (T his w ould be analogous, for



136 S IE G E L MA NN A ND SONT A G

T uring machines, to an encoding of the input into an input

stack r ather than having it arrive as ,an ex ternal input

stream.)

For a processor net without inputs, we may think of the

dynamics map 3F as a ma p Q ,V- »Q V. In that case, we

denote by the A: th itera te of & . For a state c e Q*, we

let := We now state that if 0: { 0, 1} + - *•{ 0, 1} +

is a recursively computable partial function, then thereexists a processor ne t. V without inputs, and an encoding of

data into the initial state of J \ such that: <t>(a>) is undefined

if and only if the second processor has activation value

always equal to zero, and it is defined if this value ever

becomes equal to one, in which case the first processor has

an encoding of the result.

Given a> = ■■■a k . e { 0, 1} + , we define t he e ncoding

function

^ 2.Q -f- 1

<4 )

1=1 H

(Note that the empty sequence gets mapped into 0 .)

T h e o r e m 2. Le t M be a p- stack T uring machine comput

ing (j>: { 0, 1} + -*■{ 0, 1} + in time T. Then there exists a pro

cessor net . I without inputs so tha t proper tie s (a) and (b)

below hold. In both cases, f o r each (o e {0, 1} +, we consider

the corresponding initial state

f (c u) : = ( « $ [ > ] , 0, . . . , 0 ) e O A'.

(a) I f 4>(to) is undefined, the s econd coordinate £(<»){ o f

the state aft er j steps is identically equal to zero, for a l l j . I f

instead <j>(a>) is defined and computed in time T, than there exists JT = O(T ) so that

Z(Q>)i =0 , ; = 0 , ...,3T

Z(co)f = 1,

an d £, (w )f = 8\ _<t>{a>) ]. (This is a linear time simulation.)

(b) Furthermore, if one substitutes 5 in Eq. (4) by a new

map 8p as

^ - - - - - • (51

then the construction can be done in such a way that ,T = T.

( This is a re al time simulation.)

The next few sections include the proofs of both theorems.

W e star t by pr ov ing T heo rem 2 , and then obtain Theorem 1

as a corollary. As the details of the proof of Theorem 2 are

very technical, we start by sketching the main steps of the

proof in Section 3. T he proo f itself is org anized into several

steps. We first show how to construct a network that

simulates a given multistack T uring machine M in time

y = 4 T ( T is the computa tion time of the T uring ma chine J.

This is done in Sections 4 and 5. In Section 6, we modify the

construction into a network that simulates a given multi

stack T uring machine w ith no slow dow n in the computa

tion. This is done in three steps and is based on allowing for

large negative numbers to act as inhibitors.

After T heorem 2 is prov ed, we sho w in Section 7 how to

add inputs and outputs to a processor net withoutinput/output, thus o btaining T heorem 1 as its corollary.

This ends the proofs.

3. HIGHLIGHTS OF THE PROOF

This section is aimed at highlighting the main part of the

proo f of Theorem 2. W e s tart w ith a 2- stack machine; this

model is obviously equivalent to a 1-tape (standar d) T uring

machine [ HU7 9] , Formally , a 2-stack machine consists of

a finite control and two binary stacks, unbounded in length.

Each stack is accessed by a read- wr ite head, which is

located at the top element of the stack. A t the beginning o fthe computation, the binary input sequence is written on

stack j . T he machine at every s tep reads the top element o f

each stack as a symbol sce{ 0, 1, # } (where # means that

the stack is empty), checks the state of the control

(,ve {1, 2,..., | S| } ), and executes the follow ing operations:

1. For each stack, one of the follo w ing ma nipulatio ns is

made:

(a) Po pping the top element.

(b) Pushing an extra element on the top of the stack

(either 0 or 1).

(c) No change in stack.2. Chan g ing the state of the control.

W he n the co ntrol reaches a special state, c alled the “hal ti ng

state,” the machine stops. Its output is defined as the binary

sequence on sta ck ,. T hus, the I/O ma p or function,

computed by a 2- stack machine, is defined by the binary

sequences on stack 1 before and after the com putation.

3.1. Encoding the Stacks

A ssume that we wer e to encode a binary str eam

co = cy,<a2 ■■■co„into the number

I 2‘

Such a value could be held in a neuron, since it ranges in

[0, 1]. However, there are a few problems with this simple

encoding. First, one cannot differentiate between the strings

“/?” and “/? •0,” where “ •” denotes the co ncat enat ion

operator. Second even when assuming that each binary

str ing ends with 1, the continuity of the activ ation function

o makes it impossible to retrieve the most significant bit (in



138 S IEG E L MA NN A ND SONT A G

3.5. Real Time Simulation

T he above- suggested encoding a nd the use of the lemma

result in a Tur ing mac hine simulat ion that r equires four

steps of the netw ork for each step of the T uring machine. T o

achieve a “step per step” simulation, a more sophisticated

encoding is required, which relies upon a Cantor set with a

specific size of gaps. T hen, we use large negative numbers asinhibitors, and attain in this manner the desired simulation

in real time. We turn now to the formal proof.

4. GENERAL CONSTRUCTION OF THE SIMULATION

As a departur e point , we pic k //- tape T ur ing mac hines

with binar y alphabe ts . We ma y equiv al ently st udy pus h

down automata with p = 2p' binary stacks. We choose to

represent the values in the stacks as fractions with

denominators which are powers of four. An algebraic

formalization is as follows.

4.1. p- Stack Machines

Denote by the “Ca ntor 4- set” consisting of all those

rational numbers q which can be written in the form

w ith 0 ^ k < cc and each a, = 1 or 3. (When k = 0, we inter

pret this sum as q = 0.) Elements of <€are precisely those of

the form <5[co], where 5 is as in Eq. (4).

The instantaneous description of a /7-stack machine, with

a control unit of n states, can be represented by a (p + 1 )-

tuple

(j, <$[>,] , <5[>2] , ..., 5[cop]),

wher e s is the state of the control unit, and the stacks store

the words (i = 1 ,..., p) , respectively. (Later, in the simula

tion by a net, the state s will be represented in unary as a

vector of the form (0 , 0 ,..., 0 , 1 , 0 ,..., 0 ).)

For any q e (€, we write

£I<7] = 1 when a, = 3. We interpret t [ •] as the “nonempty

stack” operator. It can never happen that £[<7] = 1 whi le

r [ g ] = 0 ; hence the pair (£[ 17] , r [ 9 ]) can have only three

possible values in { 0 , 1 } 2

D e f i n t i o n 4.1. A p- stack machine J t is specified by a

( p + 4)- tuple

(S , s,, s H, 0O, 0 U 0 2, ..., 0p),

wher e S is a finite set, s, and s ,, are elements of S called the

initial and halting states, respectively, and the 0 /s are maps

as follows:

0 o: S x { 0, 1} 2/1- S

0 ,.: S x { O , l p - K 1 ,0 ,0 ), ( i, 0 , | ) ,(1 ,0 ,1 ),

(4 , —2 ,- 1 )} for /= 1 , . . .,p.

(The function 0Ocomputes the next state, while the func

tions 0 t compute the next stack operations of stack, , respec

tively. The actions depend only on the state of the control

unit and the symbol being read from each stack. The

elements in the range

( l ,0 , 0 ) , ( i , 0 , i) , ( i , 0 , | ) ,( 4 , - 2 , - 1 )

of the 0, should be interpreted as “no oper atio n,” “pushO,”

“pushl,” and “pop,” respectively.)

T he set := S x <gp is called the instantaneous description

set of J t , and the map

I - >3C

defined by

P A s , qx, - , q p)

: = [ 0 OU, £[<?,], ..., £[ ? ,] , r [ ? ,] , . .., r[ ? ,] ) ,

Qx(s,Clqi~\ , - .£[ ? ,] . r[<7i]. ■■■>r[ $V ] )- (? i>C [ ? i] , H

C[ qpl r [ q x ] , .. ., t[ qp])- (qp, £ [ ? , ] , 1 )] ,

C M '= 1 ° ’ ^ where the dot indicates inner product, is the complete

{ l , if q > dynamics map o f . //. As par t of the definition, it is assumed

Y that the ma ps 0, ( z '= l , .../?) are such that 0, (s ,

r [ * ] H i ’ -f 9 Z n C[ ^ ] ’ °* r [ ^ ] , - ^ [ ^ ] ) , 0 2U £[ * ,] , ...,£[ * ,] , t [ ? , ] , 0,’ 1 q ■ ^[9 3 ], TC4r/>]) " ■* ( 4 , - 2 , - 1 ) f or a l i i , qu ..., qp (that

is, one does not attempt to pop an empty stack).

W e think of £[ •] as the “t op of sta ck ,” as in terms of the Let to e {0, 1} + be arbitrar y. If there exists a positive

base-4 expansion, C [ ? ] = 0 when a, = 1 (or <? = 0), and integer k, so that starting from the initial state, S n with



T H E C O MP UT A T IO N A L P O WE R O F N E UR A L N E T S 139

<5[&>] on the first stack and empty other stacks, the machine

reaches after k steps the halting state s H; that is,

0>k(s„<?[ >] , 0 , 0 ) = (sH, <5[ >i], <S[>2] , ..., S[cop] )

for some k , then the machine J t is said to halt on the input

CO. i f co is like this, let k be the least possible number such

that

& k(s t , S[co~\ , 0 , 0 )

has the above form. Then the machine . // is said to output

the string m, , and we let 4>. (co) : = w ,. T his defines a partial

ma p

**■■{0, l} + - { 0 , 1 } + ,

the i/o map of J i .

Save for the algebraic nota tion, the part ial recursive func

tions 4>\ {0 , 1 } ■*“ —►{ 0 , 1 } + are exactly the same as the maps

4>.« : ^ ^ of /7-stack machines as defined here; it is only

necessary to identify words in { 0 , 1 } + and elements of ^ via

the above encoding map S. O ur pr oof will then be based on

simulating /7- stack machines by processor nets.

for /£ { 1,..., s } , j e { 0,..., s}, and

yk : { 0, l } 2' - { 0 , 1} (7)

for i e { 1,..., p } , j e { 0,..., } , k = 1, 2, 3, 4 as

P v (a u a 2, ..., ap, b x,b 2, ..., bp) = 1

d0(j , a x,a 2, ..., a p , b l , b 2 , ..., bp) = i

(intuitively, there is a transition from state j of the control

part to state i iff the readings from the stacks are: top of

stack* is a k \ and the nonemptyness test on stack* gives bk).

y “ is zero except in the following cases:

y)j ( a l , a 2, . . . ,a p, b l , b 2,. .. , bp) = l

o 0,(y, a ,, a 2, ..., ap, b u b2, ..., bp) = (1 ,0 , 0)

(if the control is in state j and the stack readings area u a 2, ..., a p, b u b2, ..., bp, then stack i will not be changed);

a 2, ..., a p, b u b2, ..., bp)= 1

o 0 ,( j, a u a 2, ..., a p, b { , b 2, ..., bp) = ( 0 , | )

5. NETWORK WITH FOUR LEVELS:

CONSTRUCTION

A ssume tha t a /7- stack ma chine M is given. Without loss

of generality, we assume that the initial state s, differs from

the halting state sH (otherwise the function computed is the

identity, which can be easily implemented by a net), and we

assume t hat S := { 0,.. ., } , with ^ = 0 and sH = 1. We build

the net in two stages.

• Stage 1. As an interme diate step in the cons truction,

we shal l show how to s imulate J t with a certain dynamical

sys tem over CF +/’. W ri ting a vector in as

(jc x s,q u . .. ,qp),

the first s components will be used to encode the state of the

control unit, with 0 e S corresponding to the zero vector

x | = ••■= x s = 0 and i e S , i / 0, corr esponding to the j thcanonical vector

<?,= ( 0 ,..., 0, 1 ,0, ....,0 )

(the “1” is in the Jth position). For convenience, we also use

the nota tion e0 : = 0 6 Q ,!. T heq/s will encode the contents of

the stacks. For notational ease, we substitute £[ /,] and r[ /,]

by a, and 6,, respectively. Formally, define

/? ,: { 0 , 1 } * - { 0 ,1 } (6)

(if the control is in state j and the stack readings are a , ,

a 2, ..., ap, b i , b2, ..., bp, then operation Push0 will occur on

stack i );

y,y(«] , a 2, ..., ap , b x, b2, ..., bp) = 1

a , , a 2, ..., ap, b1,b 2, ..., bp) = (\ , 0, | )

(if the control is in state j and the stack readings are

a ,, a 2, ..., ap, bu b 2, ..., bp, then operation Push 1 will occur

on stack i );

y*j{ au a 2, .. . ,a p, b u b2, . . . ,b p ) = 1

o 0 i ( j , a i , a 2, .. ., a p, b l , b 2, .. ., bp) = (4, - 2 , - 1 )

(if the contol is in state j and the stack readings are

a l t a 2, ..., a p , b i , b 2, ..., bp, then operation Pop will occur onstack i ).

Let # be the ma p Q s+/> -»

(JC,, ..., x s, q u qp) ( x t , X ; ,q + , ..., q + ),

where, using the no ta ti on x 0 := 1 — i •*/>

s

x,+ := Z P i j (a u .. ., ap , b u . .. ,bp ) X j (8 )o



140 S IE G E L MA NN A ND SONT A G

for 1 = 1 ,..., s and

< lt '■= ( Z y\ (a .........a P> •••> M * / ) <7. V = o /

and scalars

(9.1)C,, C2, ..., C2' e ^

.smc/i that, fo r each d\ ,d , ..., fl1, , i e { 0, 1 } and each

? e [ 0 , l ] ,

1 1

X U 9 ' + 4(9.2)

+ ( Z y l(a \ ,- ,a p, b u ... ,bp) x j\ j = o

1 3

X U * ' + 4

V/=o X (4-7, —2C[ ^(] — 1)

(9.3)

(9.4)

for / = 1,

Let n: ,1' = 5 x ^ —» Q '+p be defined by

n (/, ..., qp) := (e „qx .... qp).

It follows immediately from the construction that

# ( 7z(i, q \,..., ^ ) ) = 7r ( (;, qr,,..., qp))

for all (/, qp)e:¥ .

A ppl ie d induct iv el y , the above implies that

& k(e0, <5[w], 0,..., 0) = 7 r(^*(0, <5[w] , 0,..., 0))

for all k, so 4>(co) is defined if and only if for some k it holds

that & k(e0, <5[cu], 0,..., 0) has the form

( et , q q p)

(recall that for the original machine, s, = 0 and s „= 1,

w hic h map respectiv ely to e0 = 0 and e , in the first s coor

dinates o f the corres ponding vector in Q J + p). If such a state

is reached, then q x is in <€and its value is <5[$(co)].

• Stage 2. T he second stage of the constr uction

simulation the dynamics J* by a net. We first need an easy

technical fact.

L e m m a 5.1. Let te N . For each function /?: {0, 1} '- *

{0, 1} there exist vectors

r , , v2, i'2‘ e Z ' + 2

P(du d2, .... d ,) x = £ cra(vr ■//)r = 1

an d

f ] (dl , d 2 ,. .. ,d r) x q = o i q + £ cr cr(vr ■p ) ~ \\ r = 1

where we denote n = (1, d x, d 2, ..., d,, x ) and “

product in Z '+ 2.

Proof. Write [1 as a polynomial

P (d t , d2, d,) = c, + c2d, + ■■■+ c, + I d,

+ c, + 2d l d2 + + c2id t d2 ■■■dn

( 10)

( 11)

= dot

( 12)

expand the product /?(d x, d2, ..., d,)x , and use that for any

sequence /,,..., lk of elements in { 0, 1} , one has

/j •••lk = a (/i + ••■+ lk —k + 1).

Using that .x = a ( x) , this gives that

P (d\ ,d2, ..., d,)x

= c l a ( x ) + c 2cr (d] + x — 1) 4- •■•+ c2,a

x (d\ -t-d2 -f- ••■+ d, + .V—/)

2'

= X C,o(vr - l l)r= * J

for suitable c/s and vr's. On the other hand, for each

r e {0, 1} and each q e [0, 1] it holds that zq= a(q + r —1)

(just check separately for r = 0, 1), so s ubstituting the

above formula with r = f i ( d u d2, ..., d,)x gives the desired

result. |

Remark 5.2. T he above cons truction over estimates theamount of neurons necessary. In the case where t —2p and

the arg uments are the top an d nonempty functions of the

stacks, the arguments are dependent, and there is a need for

jus t 3P terms in the summation, rather than 2 2p. We

illustrate this phenomenon with the simple case of p = 2,

that is, the case of two- stack machines. Here, w hen

(d\ , d2, dj, d4)

= ( a u a 2 , b i , b 2)( = (£[<?, ] , C[ ?2] , z[ ? , ] , z [ q2~\ )).



THE COMPUTATIONAL POWER OF NEURAL NETS 141

one can easily verify that from the following nine elements

(a l> a 2-> | | ’ b 1b2),

one can obtain—using a affine combinations only— the

value of the remaining seven elements

( a i b a 2b2, a^a2bi , a xa2b2, a {b{b2, a 2bl b2, a ia 2b{b2).

Hence, a sum of the type in Eq. (10) requires in this case

only nine O^) elements rather than sixteen (2 2p).

We now return to the pr oof of the theore m. A pply

L emma 5.1 repeatedly, w ith the “/?” of the lemma corre

s ponding to e ach o f the /?,y’s and y i ’s, and using variously

q = q „q = (\ qi + \ ), q = ( \ q i+ l ), or q = (4qf- 2 C O J - 1).

Write als o a(4 qj —2) whenever £[<?,] appears (using that

CItf] = ° ( 4 q —2) for each q e <&), and o(4q) whenever r[#]

appears. T he result is that # can be wr itten as a com

position

# = F , ° f 4

of four “saturated- affine” maps, i.e., maps o f the for m

c (A x + c): F 4: Q s+ p^ Q f‘, Fy . 0 ' ‘ 0\ / ’2 : 0 v - >0 '',

F { : Q n -» Q s+P, for some positive integers (T he

argument to the function F 4, called below r ,, o f dimension

($+ /? ), represents the i x ,’s of Eq. (8) and the two qi's of

Eqs. (9). The functions F { , F 2, F i compute the transition

function of the x /s and <7,’s in three stages.)

Consider the following set of equations, where from nowon we omit time arguments and use the superscript + to

indicate a one- step time incre ment:

z t = F i ( - 2 )

~ 2 = F 2{ = 3 )

z ? = F i {zA)

z At = F 4( z l ),

wher e the z ’s are vectors of sizes s + p , r;, v, and /x, respec

tively. This set of equations models the dynamics of a

CT-processor net, with

N = s + p + n + v + tj

processors. For an initial state of type z { = (e0, 3[ to1, 0)

and Zj = 0, i = 2, 3, 4, it follows that at each time of the

form t = 4k the f irst bloc k of coor dinates, z [, equals

# * (e 0, <5[co], 0).

A ll that is left is to add a mode- 4 co unter to impose the

constraint that state values at times that are not divisible by

4 should be disregarded. The counter is implemented by

adding a set of equations

y T

y 2 = a ( y 3),

y 3+ =<7( 7 4 ),

y 4 = < r ( l- y 2 ~ y ^ - y 4).

W he n st ar ting w ith all >',(0) = 0, it holds that >’, ( /) = 1 if

t = 4k for some positive integer k, and is zero otherwise.

In terms of the complete system, <j>(co) is defined if and

only if there exists a time t such that, starting at the state

Z\ = (e0, <5[>], 0), z, = 0, i = 2, 3, 4, y , = 0 ,; = 1, 2, 3, 4,

the first coordinate z n ( t ) of zt(t) equals 1 and also that

y i ( t ) = y 2U — 1) = 1. T o force z n { t ) not to output arbitrary

values at times that are not divisible by 4, we modify it to

" T i = o ( + H y 2 - ! ) ) ,

where “ •••” is as in the or ig ina l upda te equa ti on for z , , and

/ is a positive constant bigger than the largest possible value

of z M. T he net effect of this modif ica tion is that now

z n ( t) = 0 for all t that are not multiples o f 4 and f or t = 4k

it equals 1 if the machine s hould halt, and 0 otherwise. Reor

dering the coordinates so that the first stack ((j + 1) th coor

dinate of Zi) becomes the first coordinate, and the previous

z u (that represented the halting state s of the mac hine J i )

becomes the second coor dinate, T heorem 2(a) is proved. |

5.1. A L ay out o f the Const ruction

The above construction can be represented pictorially as

in Fig. 1.

For now, ignore the rightmost element at each level,

w hic h is the counter. T he remainde r corres ponds to the

functions F 4, F 3,F 2 , F u ordered from bottom to top. The

processors are divided in levels, where the ouput of the /th

level feeds into the (i — 1) th level (and the o utput of the top

level feeds back into the bottom). The processors are

grouped in the picture according to their function.

The bottom layer, F4, contains 3 + p groups of processors. T he leftmost g ro up of processors stores the values

of the s states to pass to level . T he “zer o sta te” processor

outputs 1 or 0, outputting 1 i f and only i f al l of the j pro

cessors in the first group are outputting 0. The “read stack

i" group computes the top element £[(/,] of stack i, and

r [<7.] e { 0, 1} , which equals 0 if and only if stack i is empty.

Each o f the p processors in the last group stores an encoding

of one of the p stacks.

Layer F 3 computes the 2 lp terms o(vr p) that appear in

Eq. (10) for each of the possible s + 1 values o f the vector x.



14 2 SIEGELMANN AND SONTAG

S :Qs state neurons

.. .

p stacks

6counter 1

............................................................. , t

l O ••<□.] l l O - O i • * • iO ” Q ! j O4 sub-stacks (1) * swb^siacks (pj counter 2

1s state neurons

p stacks counter 3

i Q * * O j p D ) - - o O O - o Os states + state 0 r"eaii stack (i) read stack (p) p stacks counter 4

Ff G. I. The universal network.

Only 1) processors are needed, although, since there

are only three possibilities for the ordered pair (C[</,],

r [<?,]) f or each of the two stacks; this was ex plained in

Remar k 5.2. (Note that each such // contains £[<7 i] , ...,

C [ ? , ] , ..., x\ _qp~\, as well as x 0, x „, that were

computed at the level F4.) In this level we also pass to level

F 2 the values C[ ?i ], C[ ?a] and the encoding o f the p stacks,

from level F 4. As a matter of fact, 3ps + 1 neurons may

substitute the /’(^+ 1), as the 0 state r equires one neuron

only.

A t level F 2 we compute all the new states as descr ibed in

Eq. (8) (to be passed along w ithout change to the top layer).

Each of the four processors in the “new stack /” group com

putes one of the four main terms (rows) of Eqs. (9) for one

stack. For instance, for the fourth main term we compute an

expression of the form:

4? ,- 2 C[ <? ,] - 1 + X I crja(vr - ( i )- 1^ j = 0 r = 1/

(obtained by applying Eq. (11)). Note that each of theterms

<7,, £[<?,] , &(Vr M) has been evaluated in the prev ious level.

Finally, at the top level, we copy the states from level

except that the halting state .v, is modified by counter 2. We

also add the four main terms in each stack and apply a to

the result.

After reordering coordinate s at the top level to be

the data processor and the halting processor are first and

second, respectively, as required to prove Theorem 2. Note

that this construction results in values f i = s + 3 p+ l ,

v = 3p(s + 1) + 2p, and t] = s + Ap.

If desired, one could modify this construction so that allactivation values are reset to zero when the halting

processor takes the value “ 1.” This is discussed more

thoroughly in Remark 7.1.

6. REA L T IME SIMULA T ION

Here, we refine the simulation of Section 5 to obtain the

claime d real-time simulation. Thus, this section provides a

proo f of T heorem 2(b). T his part is both less intuitive and

less crucial for the main result.



THE COMPUTATIONAL POWER OF NEURAL NETS 143

W e star t in Subsection 6.1 by mo dif y ing the co ns tr uctio n

given in Sec tion 5 in order to o btain a “two- level” neural

network. A t this point, we obtained a slow- down of a factor

of two in the simula tion. In S ubsection 6.2, we modify the

construction so that in one of the levels the neurons differ

from the standard neurons; they compute linear combina

tions of their input with no sigma function applied to the

combinations. Finally, in Subsection 6.3, we show how tomodify the last network into a standard one with one level

only, thus achieving the desired real- time simulation o f

T uring machines.

In Subsection 6.2, we substitute the 4- Cantor set

representation with a somewhat more complex Cantor set

representation (to be explained there), which has larger

gaps between consecutive admiss ible ranges o f empty

stacks, stacks with 0 top, and stacks with 1 top. These gaps,

along with the use of large negative numbers as inhibitors,

w ill ena ble us to speed up the s imula tion.

6.1. Computing in Two Layers

We ca n re wr ite the dy namic s o f the stack qi from Eqs. (9)

as the sum of four components,

<li= I <hj (13)

j = i

and

e - = Z e& -

j -1

(14)

(15)

As three out of the four sub- stack elements { q n , q i2,

qi4] of each stack /'= 1, p are 0, and the fourth has

the value of the stack q,, it is also the case that three out of

four s ub- top elements of ?, (and s ub- nonempty elements

of e,) are 0, and the fourth one stores the value of the top

(nonempty predicate) o f the r elevant stack.

Using Eq. (8), we can express the dynamics of the state

constrol as

X ? ~ X f i i j ih 1’ —>0>4> j= 0

?u , ..., ep4)xy (16)

for / = 1,..., s and the dynamics of the sub- stack elements as

? / + Z y l ( t u , - , e p 4 ) x k - l

< l a = ° k / + 5 + Z Y l U n , - , e p 4 ) x k - l

4 + 4 + Z

k = 04) x k - 1

<l n = ° ( 4? , - -2 C [ ? i] - 1 + £ Y*k (t 11, V k = 0

for all stacks q(, i = 1,..., p.

W e intr oduce the nota ti on

CP 4x t - 1

next- qv =

if j = 1,

if j = 2,

if j = 3,

4 * , - 2 C [ * , ] - l , if 7 = 4,

q^

l ? i + 4>

+ 1 >

where q j represents the row (9- j ) in Eqs. (9). We name

these q0 as the sub- stack elements o f stack i. T hat is, qn may

differ fr om 0 only if the last update of stack i was “no- opera

tion.” Similarly , the components qi2, q ,3, qi4 may differ from0 only if the last update of the ith stack were “pushO,”

“pushl,” or “pop,” respectively. We also define t v to be the

sub- top of the sub- stack element qir and e tj to be the sub-

nonempty test of the s ame sub- stack element. The top o f

stack / can be computed by

and summarize the above sub-stack dy namic equations by

= f f (n ex t - ^+ Z > i(- ) * /f c - 0 - ( 17)

V k = o /

Similarly , the sub- top and sub- nonempty are updated by

t v = a ( A \ nex t- ? i/+ Z y * ( - ) J C f c - i ] - 2 ) ,

k=s ° (18)

e*=a (4 [next'9 +i (■*Xk ~ 1])

for all i = 1,..., p , j = 1,..., 4.

W e cons tr uct a netw or k in w hic h the stacks and their

readings are not kept explicitly in values q,, en but

implicitly only v ia the sub- elements qv, t0, e0, j = 1,..., 4,

i = 1,..., p. This will enable us a simulation of one step of a

T uring machine by a “tw o level” rather than a “four level”

network, as was suggested in the previous section.

By L em ma 5.1 a nd Eqs. (14), (15 ), the funct ions /J,y and

y jik of Eqs. (16)—(18 ) can be wr itten as the com bina tion

Z Zk = 0 r —1

( 1 9 )

571, 50 I-11



144 S IEG EL MA NN A ND SONT A G

wher e

*=(i. i ... i i .... 1 ‘v-0' /= 1 j= 1 7=1 j = 1 7

c" are scalar constants, u" are vector constants, and a

represents the multiindices ij for f i and ijk for y. Thus, all

the updated equations of

x k, k = 1 .....s (states)

qir /= 1, 2, 7 = 1, 2, 3, 4,

tiJ, i= 1 ,2 , 7 = 1 , 2 , 3 ,4 ,

etJ, i = 1,2, y = 1,2, 3,4,

can be written as

er(lin. comb, of cr( lin. comb, of tops (/, )

and nonempty (e,))),

that is, as what is usually called a “feedforward neural net

w ith one hid de n layer .”

T he main layer consists o f the sub- elements q,r t,r e0 and

the states x r. In the hidden layer, we compute all elements

ct( •••) required by L emm a 5.1 to compute the f unct ions /?

and y. We showed that 3 ^ + 2 terms of this k ind are

required. We refer to these terms as “configuration detec

tors” as they provide the needed combinations of states and

stack readings. These terms are all that are required to com

pute x k . W e also k eep in the hidden layer the values of g,and t, to compute next-^ry.

T he result is that # can be written as a composition

# = F ,o F 2

of two “saturated- affine” maps, i.e., maps of the form

o(A x + c)\ F,: O v-> Q", F 2 : Q " — CT , f or v = 3ps + 2p + 2

and rj = s + 12 p.

In summary:

• T he main layer consists of:

1 . s neurons x k, k = 1 ,..., .v, that represent the state ofthe system unarily.

2. Fo r each stack i, i = 1,...,p, we have

(a) four neurons q) jS qtJ, j = 1, 2, 3, 4,

(b) four neurons t\j = j = 1, 2, 3, 4

(c) four neurons e]j = e0, j = 1, 2, 3, 4.

• T he hidden layer consists of:

1. 'ips + 2 neurons for configuration detecting. (The

additional two are for the case of s0.)

2. Fo r each stack i, i = 1,...,p we have

(a) a neuron q]1= q,,

(b) a neuron t] = t,.

6.2. Removing the Sig moid fr om the Ma in L evel

Here, we proceed by shrinking the network and showing

how to construct a network equivalent to the one above, in w hich neur ons in the ma in levelcompute linear combina

tions only (and apply no a function to it). In the follow ing

constr uction, we introduce a set of “noisy s ub- stack”

elements { qn , qi2, q{ i, qi4} for each sta ck /'= 1, These

may ass ume not only values in [ 0 ,1 ] , but also negative

values. Neg ative values of the stacks are interpreted as the

value 0, while positive values are the true values of the stack.

A s in las t sect ion, onl y one of these fo ur elements may

assume a non- negative va lue at each time. T he “noisy sub

top” and “noisy s ub- nonempty ” functions applied to the

noisy sub- stack elements may also produce values outside

the range [0, 1].To manage with only one level of a functions, we need to

choose a number representation that enforces large enough

gaps between valid values of the stacks. We now abandon

the 4- Cantor set repres entation, using instea d a slightly

different Cantor set representation: If the network is to

simulate a /?-stack machine, then our encoding is the 10/?2-

Cantor set.

To motivate our new Cantor representation, we illustrate

it first w ith the special case p —2. Here, the encodingis a 40-

Ca ntor set representation. We encode the bit0by£0 = 31

a nd 1 by e, = 39. We interpret the resulting sequence as a

number in base 40. Fo r ex ample, the stack tv =1 10 1 isencoded by

q = .(39)(31)(39)(39 )40.

In addition, we allow the empty stack to be represented by

any non- positive value of q, rather tha n 0 only. T his is where

the sigmoids of the main level are being saved. The possible

ranges of a value q representing the stacks are thus:

[ —g c , 0] empty stack

[ 40 ' l i ] top of stack is 0

[ H, 1] top of stack is 1.

Stack operations include pushO, which is implemented by

the operat ion (//40 + ^; pus hl, w hich is implemente d as

qf 40 + ; and pop which is implemented by 40q —t, where

t is the top element of the stack. We can compute the top

and nonempty predicates by

top(<?) : = 5(40</ —38),

nonempty { q) : = (40q — 30).



THE COMPUTATIONAL POWER OF NEURAL NETS 14 5

T he ranges of the top values are [5 , 10] , [ —35, —30],

and [ —oo, —190] for t o p= 1, top = 0, and empty stack,

respectively; a nd the ranges o f the nonempty predicate are

[9, 10], [ 1 ,2] , and [ oo, —30] , st ate d w ith the same order.

In partic ular, top is interpreted as being 1 when the function

top(g) is in the range [5, 10] and 0 for [ oo, —30 ] , w hile the

nonempty predicate is taken as 1 in the range [ 1, 10] and

0 in the range [ —oo, —30] . T he gaps betw ee n the pos itiveand negative ranges are large; in particular, the domain

values which represent the value 0 in both predicates are at

least 3 times larger than the domain values which represent

the value 1. The special encoding was planned to have this

property. In the case of p stack machines, we need an

encoding for which the values of the negative doma in are at

least (2/7 — 1) times lar g er tha n the v alues o f the positiv e

domain.

Fo r the general /7- stack machin e, we choose the base

b = \ 0p2. Denote c = 2/7+ 1, b = l0p2, £, = (10/72— 1),

£0 = (10p2—4p — 1). T hat is, 0 is encoded by e0, 1 is encoded

by £,, and the resulting sequence is interpreted in base b. The role of c w ill be explained below. T he r eading functions

“noisy sub- top” and “noisy sub- nonempty” cor res ponding

to the noisy sub- stack element v e { ^y| /= 1, - , p, j =

1,..., 4} are defined as

N- top(u) : = c(bv — (£[ — 1)) (20)

N- nonempty(t)) := bv —(£0 — 1). (21 )

W e denote t0 = N- top( , ; ) and e0 = N- none mpty (</,-,) for

/= 1 , p , j = 1,..., 4.

We summa rize dy na mi c equa tions of the nois y ele ments

by

Qu ~ a \ next- ,y + £ y £ ( - ) * * - l ) ,

b ( ne x t - q iJ + £ yJik( ■) x k - 1 ) - (e, - 1)t ; = a \ c

e t = a [ b next-<?,y + £ yJik(- ) x k - 1 ) - ( e 0 ~ 0 )» (2 2)k = 0

wher e

next■ 9 o = \

(in if j = 1

1 „ , £° b q< + J ’

i f y = 2

1 £ i

b q ‘ + T '

I I

and £[<?, ] is the true binar y value of the top of stack i.

T he ranges of values of the noisy sub- top and sub- non

empty functions are

N- top(y)

f [2/7 + 1, Ap + 2] when the top is “ 1”

e < [ —8/72 — 2/7 + 1, —8p1+ 2 ] whe n the top is “0”

([ —oo, —20/>3— 10p 2 + 4/> + 2] for an empty stack;

N- nonempty(u)

{[ 4/7+ 1,4/?+ 2] when the top is “ 1”

[ 1 ,2] when the top is “0”

[ —cc, — 10/72 + 4p + 2] fo r an em pty stack;

(Not e that the values o f <r(N- nonempty(i>)) and

o-(N-top(!;)) provide the exact binary top and nonempty

predicates.) The union of these ranges is

U = [ —oo, —8/72 + 2 ] u [ 1, Ap + 2 ] ,

The parameter c is chosen so that to assure large gaps in the

range U.

Property. For all p ^ 2, any negative value o f the func

tions N- top and N- nonempty has an absolute value o f at

least ( 2 / 7 — 1) time s any pos it iv e value of them.

T he large negative numbers operate as inhibitors. We w ill

see later how this property assists in constructing the

network. As for the possibility of maintaining negativevalues in stack elements rather than 0, Eq. (13) is not valid

any more. T hat is, the elements qiJy j — 1,..., 4, cannot be

combined linearly to provide the real value of the stack q, .

This is also the case with the top and nonempty predicates

(see Eq. (14), (15)).

In Lemma 5.1, we proved that for any Boolean function

of the ty pe /?: { 0, 1 } r(—►{ 0, 1} and x e { 0 , 1} , one may

express

2'

p ( du . . . , d , ) x = Y c ro ( v r - n)

for // = (1, d x, ..., d , ,x ) , constants c, and cons tant vectors vr . This was applicable for the functions firJ and yJjk in

Eqs. (1 6)—(18). Nex t, we prove that using the noisy s ub- top

t,j and noisy sub- nonempty ev elements—rather than the

binary sub- top (a (/,-,•)) and sub- nonempt y

ones—one may still compute the functions firj and y\k using

one hidden layer only.

It is not true for any function 1} and

x e { 0 , 1} t ha t fi( )x is computable in one hidden layer

network; this is true in particular cases only, including ours.



146 S IEG E L MA NN A ND SONT A G

D e f i n i t i o n 6.1. A f u n c t i o n f i (v u . .., v ,) is s a i d t o b e

sign- invariant if

v' = s ign(t;/) V /= 1,..., t= > f i(v , ■■■v,) = fi(v\ ..... v’,) .

L e m m a 6.2. For each t, r e N , let U, be the range

[ - oo , —2/2+ 2] u [ 1, 2/ + 2]

and let

S r , = { d\ d=(d\ ]), ..., d\ r\ d'2", ..., d {{\ ..., d\ X).....d\ r))

e R ", and'ii = 1,..., t at most one of d ),..., d[ is positive}.

We denote by I the set o f multiindices (z,,..., with each

ij 6 {0, 1,..., r } . For each function fi: S r ,- * { 0, 1} that is sign

invariant, there exist vectors

{ Vie Z r + 2, ie l }

and scalars

{ c , e Z , i e l }

such that f or each (d\ l), ..., d\ r))e S r ,) and any a-e {0, I } , we

can write

f i ( d [ ' d \ r)) x = £ c,o(vrMi),i e l

where

... i , ) = ( 1’ d 22\ d \‘'\ x )

and we are defining d (' = 0 . ( 2 3 )

Here the size of I is | /| = (r + 1 )', an d is the dot product

in Z ' + 2.

Proof. As [ i is sign- invariant, we can write fi when acting

on S r , as

p(d\ '\ d\ 2\ d\ r ') = P (o{ d \ "), o(d\ 2)) , a ( d \'>))•

Thus, f i can be viewed as a Boolean function on

{0, 1 } " (-*■{ 0, 1 }, and we can express it as a polynomial (see

E q . ( 1 2 ) ) :

f i(d[ '\ d\ 2\ . .. ,d ^)

= cl + c2o(d\ i)) + c3<j(d[2>)+ + crl + l a (d ,,r))

+ crl + 2o(d[ '))o (d <2l))+ ■■■

+ c{ r+ l),a(d\ r)) u i d ^ ) ■■■a(d\ r)).

(Note that no term with more than t elements of the type

o { d { j Y) appears, as most <r(^y)) = 0, by def inition o f S r ,.)

Observe that for any sequence /, ..... lk of (k ^ t) elements in

U, and x e {0, 1}, one has

o ( l i )- - - o(lk ) x = o ( l i + • + lk + A:(2? + 2 ) ( x - 1)).

This is due to two facts:

1. T he sum of k, k ^ t elements of U, is non- positive

when at leas t one of the elements is neg ative. T his stems

from the property that any negative value in this range is at

least (t — 1) times larger than any positive value there.

2. Ea ch /, is bo unde d by (2 t + 2).

Expand the product f i ( d j , ..., d rt)x , using the above obser

vation and the fact x = a{ x ) . This gives that

P ( d [ " ,. . ., d '; ') X

= c l a{x ) + c2(r(d\ l) + (2t + 2 )(x — 1))+

+ c(r+ l),a {d\ r) + ■■■d\ r' + t{ 2t + 2){ x - 1))

= £ Cia(vrHi),i e l

for suitable c,’s and i\ 's, where p, is defined as in (23). |

Remark 6.3. Note that in the case where the arg uments

are the functions N- top and N- nonempty, the arguments

are dependent and not all (r + 1)' terms are needed.

We conclude tha t the nois y sub- stack elements qt] as wellas the next state control, are computable in the one hidden

layer network from f ;/ and e,r We next provide the exact

network architecture.

Network Description. T he network consists of two

levels. The main level consists of both 5 state neurons

and noisy sub- stack neurons ac companied by sub- stack

noisy- readings neurons: q)j, t ’ y, e l , i = 1,..., p, j = 1,..., 4-

representing respectively noisy sub-s tack- elements, noisy

sub- top elements, and noisy s ub- nonempty elements.

For simplicity, we first provide the update equation of the

network which simulates a 2-stack machine, and later we

generalize it to p- stack machines. R ecall tha t for a 2- stack

machine the encoding w as a 40- Cantor set representation

w ith £0 = 31 an d e, = 39. We use the same notatio ns for the

functions y 0 as in Eq. (9); let q, represent the zth stack, and

t, r epresents the tr ue (i.e., clea n) value o f the top of the stack.

T he noisy sub-stack q n updates for no- operation as in

Eq. (9.1),

<7,’r = 9 /+ Z YJikX k - 1;k = 0



T H E C O M P U T A T I ON A L P O W E R O F N E U R A L NE T S 14 7

the noisy sub- stack qi2 updates for pushO as in Eq. (9.2), w hich are update d by the equatio ns

- i Vi 31 v ' / iXo

the noisy sub- stack qn updates for pushl as in Eq. (9.3),

i + I t 39 v /= 4 0 + 4 0 + , ? o > * ( * 1;

and the noisy sub- stack q ,-4 updates for pop as in Eq. (9.4),

g I14+ = 40 <? , - 8 « , - 3 1 + X y{kx k - \ .k = 0

For general />-stack machines, the bases are 10p 2 and the

update equations are given by

q)j+ = ne x t - ^+ £ Yli x k 1,

V + = (2 p+ \ ) \ 0p2 ( next- q0 + £ y & x k - 1

— ( 10/>2 — 2)

e' + = 10/»2 ( next- y-l- £ ? * * * - 1

whe re

next- ,y

- ( l O/? 2 - 4p - 2),

qi, if j = 1 (not updated)

1 I Op2 —4p — 1

T o p q‘ + \ op ’

if j = 2 (pushO)

- « „ = < 1 10/ j 2 — 1

1 0p2 * ' + I0 p2 ’

if y = 3 (pushl)

l0 p2qi - 4 p t - ( l 0 p 2 - 4 p- \ )

if y = 4 (pop),

q \ + - = o ( q \'■)>

/2+ := c r ( f i = 1, . . . , p , j = 1 , 4 ,

6.3. Owe Lei;*?/Network Simulates T M

j = i

(24)

(25)

(26)

Consider the above network. Remove the main level and

leave the hidden level only, while letting each neuron there

compute the information that it received beforehand from a

neuron at the main level. This can be written as a standard

network . No te that as the net consists of one level only,

no “counters” are required. T his ends the proo f of

T heorem 2. |

7. INPUTS AND OUTPUTS

W e no w ex pla in how to deduce T heor em 1 fr omT heore m 2. (W e present details in the linear- time simula

tion case only. T he real-time case is entirely analo g ous, but

the notations become more complicated.) We first show

how to modify a net with no inputs into one which, given

the input uw( ■), pro duces the enco ding <5[a>] as a state coor

dinate and after that emulates the original net. Later we

show how the output is decoded. As explained above, there

are two input lines: D = u t carries the data, and V = u 2

validates it. In this section we concentrate on the function 3,

thus prov ing that T heorem 2(a) implies a linear time

simulation of a Turing machine via a network with input

and output. A s imilar proof, when applied to T heorem 2(b)

(while substituting 8 by Sp) implies T heorem 1.

So assume that we are given a net with no inputs,

„v+ = a (A x + c), (27)

as in the conclusion o f Theorem 2. Suppose that we have

already found a net

y + = t j( Fy + gUi + hu2) (28)

(co ns isting o f fiv e process ors ) so t hat, if w,( •) = Z)m( •) a nd

u2( ■) = Vm( •), then w ith j( 0 ) = 0 we hav e

and qt and t, are the exact values o f the stacks and top

elements. Using Lemma 6.2, all the expressions of the type

P (- ) x and y (- )x can be written as linear combinations of

terms like a (linear combinations of t}-, el). These q0 , tha t is,

and e,j constitute the main level.

The hidden layer consists o f both up to ( 5 ^ + 2) con

figuration detectors neurons (as proved in Lemma 6.2) and

the stack and top neurons,

q2, tl, /= 1,..., p 7 = 1 , —, 4,

y 4( •) = 0 ••■0 <5[co] 00 •••, y 5( •) = 0 ■••0 11

y *(t ) =

y s(t ) =

8[co] , i f t = |a>| + 2,

0, otherwise;

0, if t < M + 2,

1, otherwise.



148 S IE G E L M A N N A N D S O N T A G

Once this is done, modify the original net (27) as follows.

The new state consists of the pair f.v, j), with y evolving

according to (28) and the equations for jc modified in this

manner (using A , to denote the /th row of A and c, for the

ith entry of c):

.v,+ = a (A l x + c l y 5 + y 4)

. y,+ = a( A ,x + cty 5), i = 2 , . . . , n.

T hen, star ting at the initial state y = x = 0, clearly jct(r) = 0

for t = 0 ,.. ., |<w| 4- 2 and Jc,(| ct>| -t- 3) = <5[co], whil e, for i > 1,

X j( t) = 0 for t = 0,..., \co\ + 3.

A ft er time | co| + 3, as y 5 = 1 and m, = u2 = 0, the e qua

tions for x evolve as in the or iginal net, so x (/) in the new

net equals x (t — \co[ —3) in the original one for t | co| + 3.

The system (28) can be constructed as

>'2+ = o ( u 2 )

y 3 = cr (y 2 - u 2)

} ’4 = C J i y 1 + > ' 2 - M 2 - 1 )

y ? = c r (> ,3 + > ' s) -

This completes the proof of the encoding part. For the

decoding process of producing the output signal y ,, it will

be sufficient to show how to build a net (o f dimens ion 10

and with two inputs) such that, starting at the zero state and

if the input sequences are x , and x 2, where .v, (k ) = <S[a>] for

some k and x 2(t) = 0 for t < k , x 2( k ) = 1 (x , (7)e [0 , 1] for

t / k, x 2(t) g [0, 1] for t > k), then for processors z g, it

holds that

1, if k + 4 ^ k + 3 + |w|,

0, otherwise;

if k + 4 ^ ^ k + 3 + | co| ,

” 10 [ 0 , othe rw is e.

One can verify that this can be done by

- r = <r (x 2 + z i )

z t = a ( z , )

z + = o { z 2 )

“ 4 = a ( x x )

“ 5+ = 0 (Z 4 + Zl —Z2— 1)

Z6 = ff (4- 4 + -1 —2 2 —3)

Zy = a( 16z s —%z7 —6 zi + z 6)

Zg —cr(4zg —2z 7 —z 3+ z s)

= + = a(A:g)

-10 = a (z- j).

In this case the output is y = (zl0, z9).

Remark 7.1. If one w ould also like to achieve a resettingof the whole network after completing the operation, it is

possible to add the processor

- M= Ot- io)

and to add to each processor that is not identically zero at

this point

t’,+ =er ( + : M - r I0), c e { x , y , z} ,

where “ ■■•” is the former ly def ined ope rat ion of the

processor.

8. UNIVERSAL NETWORK

The number of neurons r equired to simulate a T uring

machine consisting of 5 states and p stacks, with a slow

down of a factor of two in the computation, is

5 + 12p + 3ps+ 2 + 2 p.

ma in layer hidden layer

T o estimate the number of processors r equired for a“universal” processor net, we should calculate the number s

discussed above, w hich is the number of states in the control

unit of a two- stack univers al T uring machine. Minsk y

proved the existence of a universal T uring machine hav ing

one tape with four letters and seven control states

[ Mi n 67 ]. S hannon showed in [ Sha56 ] how to contro l the

number of letters and states in a T uring machine. F ollow ing

his cons truction, we obt ain a 2- letter 63- state 1- tape T uring

machine. Howeve r, we are interested in a two- stack

machine rather than one tape. Similar arguments to the

ones made by Shannon, but for two stacks, leads us to

5 = 84. A pplying the formula 3 ^ + s + \ Ap + 2, we concludethat there is a universal net with 870 processors. To allow

for input and output to the network, we need an extra

16 neurons, thus having 886 in a universal machine.

(T his estimate is very conservative. It wo uld certainly be

interesting to have a better bound. T he use of multi- tape

T uring machines may reduce the bound. Further more , it is

quite possible that with some care in the construction one

may be able to drastically reduce this estimate. One useful

tool here may be the result in [ A D 0 9 1 ] applied to the

contr ol unit— here we used a very inefficient sim ulat ion.)



T H E C OM P U T A T IO N A L P O W E R O F N E U R A L N ET S 14 9

9. NON- DETERMINIST IC COMPUT A T ION

A nondeterministic processor net is a modification of a

determinstic one, obtained by incorporating a guess input

line (^) in addition to the validation and data lines. Hence,

the dynamics map of the network is now

3'- . 0 A'x { 0, l } 3 - © "

Similar ly to D ef inition 2.1, we define a nondeterministic

processor network as follows.

D e f i n i t i o n 9.1. A nondeterministic a- processor net J/"

w ith tw o bin ar y inputs and a guess line is a dy namic al

system having a dynamics map of the form

& { x , u, g ) = a ( A x + b l u l + b 2u 2 + b 7tg + c ),

for some matrix ^ e Q Ax ‘v and four vectors b 1 , b 2 , b 3,

c e Q N.

A for ma l non- det er minis tic netw or k is one that adher es

to an input- output encoding convention s imilar to the one

for deterministic systems, as in Section 2. For each pair of

wor ds co, y e {0, 1} +, the input to the network is encoded as

va, J t ) = ( V J t ) , D J t ) , 9 J t ) ) , t = 1,...,

wher e the input lines Vm and D m are the same as described

in Section 2 and 'Sy is defined as

5 (0, otherwise.

T he output e coding is the same as the one used for the deter

ministic networks.

W e restrict our at te ntion to for ma l non- det er minis tic

networks that compute binary output values only, that is,

<j>2 >{co, y) e {0, 1}, where <l> (co, y) is the function computed

by a f ormal nondeterministic network when the input is

co 8 { 0, 1} + and the guess is y e {0, 1} +. The language L

computed by a nondeterministic f ormal netw ork in time B

is

L = { w e { 0 , 1} + | 3 ag ue s s y m,<l>s ,(co,yw)

= 1, | y J < 7 V ( © K * ( M ) } .

Note that the input co and the guess yOJ do not have to have

the same length; the only requirement regarding their syn

chronization is that a>( 1) and y{ 1) appear as an input to the

network simultaneously. The function T v is the amount of

time required to compute the response to a given input or,

and its bound, the function B, is called the computation

time. The length of the guess, \yw \, is bounded by the

function T r .

W he n restricted to the case of lang uag e re co gnit ion

rather than general function computing, T heorems 1 and 2

can be restated for the nondeterministic model in w hich

is a non- deter ministic pr ocess or net and . // is a non- deter-

ministic T uring machine. T he proofs are s imilar and areomitted.

A CK NOW L E DG MENT

We tha nk Robe rt Solov ay for his useful comment s dur ing the early

stages of this research.

REFERENCES

[ A D0 91 ] N. A lon, A. K. Dewdney, and T. J. Ott, Efficient simulationof finite automata by neural nets, J. Assoc. Com put. Mach.

38, No. 2 (1991), 495- 514.

[ Bat9 1] R. Batr uni, A multilayer neural network with piecewise-

linear structure and back- propagation lear ning, IE EE Trans.

Neural Ne tworks 2 (1991), 395-403.

[ BR8 8] J. Berstel and C. Reutenauer, “Ratio nal Series and their

Languag es,” Springer- Verlag , Berlin, 1988.

[ BSS89 ] L. Blum, M. Shub, and S. Smale, On a theory of computation

and complexity over the real numbers: NP completeness,

recursive functions, and universal machines, Bull. Amer.

Ma th. Soc. 21 (1989), 1^16.

[ BV 88 ] J. R. Brown, M. M. Garber, and S. F. Vanable, Artificial

neural network on a S IM D architecture, in “Proceedings,

2nd Symposium on the Frontier of Massively Parallel

Computation, Fairfax, 1988,” pp. 43^7.[ CSS M89 ] A. Cleeremans, D. Servan- Schreiber, and J. McClelland,

Finite state automata and simple recurrent networks, Neural

Comput. 1, No. 3 (1989), 372.

[ Elm90 ] J. L. Elman, Finding structure in time. Cog. Sci. 14 (1990),

179-211.

[ FG 90 ] S. Franklin and M. Gar zon, Neural computability, in

“Progress in Neural Network s” (O . M. Omidv ar, Ed ),

pp. 128-144, Ablex, Norwood, NJ, 1990.

[ GF 89 ] M. Gar zon and S. Franklin, Neural computability, in

“Proceedings, 3rd Int. Joint Conf. Neural Networks, 1989,”

V ol. 2, pp. 63 1- 637.

[ G M C + 92] C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and

Y. C. Lee, Learni ng and ex tracti ng fi nite state aut omat a wi th

second-order recurrent neural network s, Neura l Com put. 4,No. 3 (1992), 393-405.

[ HS8 7] R. Hartley and H. Szu, A comparison of the computational

power of neural network models, in “Proceedings, IEEE

Conf. Neural Networks, 1987,” pp. 17-22.

[ HT 85 ] J. J. Hopfteld and D. W. T ank, Neural computation of

decisions in optimization problems, Biol. Cybern. 52 (1985),

141-152.

[ HU79 ] J. E. Hopcroft and J. D. Ullman, “Introduction to Automata

T heory, Languages, and Computa tion,” Addison- Wesley,

Reading, MA , 1979.

[ Kle5 6] S. C. Kleene, Representation of events in nerve nets and finite

automata, in “Automata Studies” (C. E. Shannon and



15 0 S IE G E L M A N N A N D S O N T A G

[ Lip87 ]

[ M i n67 ]

[MSS91 ]

[ MW89]

[ P E 7 4 ]

[ P o l 8 7 ]

[RTY90]

[SCLG91 ]

J. McCa rt hy, Ed.), pp. 3-41, Princeton Univ. Press,

Princeton, NJ, 1956.

R. P. Lippmann, An introduction to computing with neural

nets, IE EE A cou st. Speech Sig nal Process. Mag. April (1987),

4-22.

M. L. Minsk y, “C omputation: F inite and Inf inite Machines,"

Prentice.Hall, Engelwood ClifTs, NJ 1967.

W. Maa ss . G . Sc hnitg er , and E. D. Sonta g , O n the co mputa

tional power of sigmoid versus boolean threshold circuits,in “Proceedings, 32nd Annu. Sympos. on Foundations of

Compute r Science, 19 91," pp. 767-776.

C. M. Mar cus and R. M. Westervelt, Dy namics o f iterated-

map neural networks, Phys. Rev. Ser. A 40 (1989).

3355-3364.

M. Pour- El, Abstract co mputability and its relation to the

general purpose analog computer, Trans. Amer. Math Soc.

290 (1974), 1-29.

J. B. Pollack, “On Connectionist Models of Natural

Language Processing,” Ph.D. thesis, Computer Science

Dept, Univ. of Illinois, Ur bana, 1987.

J. H. Reif, J. D. Ty gar, and A. Y oshid, The computability and

complexity of optical beam tracing, in “Proceedings, 31st

A nnu. Sy mpos. on Fo unda ti ons of C om pute r Science, 1990,”

pp. 106- 144.

G. Z. Sun, H. H. Chen, Y. C. Lee, and C. L. Giles, Turing

equivalence of neural netw orks with s econd-order connec

tion weights, in “Proceedings International Joint Conference

on Neural Networks, IEEE, 1991.”

[ Sha56] C. E. Shannon, A universal turing machine with two internal

states, in “Automata Studies” (C. E. Shannon and

J. McC ar thy , Eds.), pp. 156- 165, Princeton Univ . Press,

Princeton, NJ, 1956.

[ Son90] E. D. Sontag, “Mathematical Contro l Theory: Deterministic

Finite Dimensional Sy stems," Springer- Verlag, New Y ork,

1990.[ Son92] E. D. Sontag, Feedforward nets for interpolation and

classification, J. Com put. Syste m. Sci. 45 (1992), 20^4-8.

[ SS94] H. T. Siegelmann and E. D. Sontag, Ana log computation

via neural networks, Theoret. Comput. Sci. 131 (1994),

331-360.

[ WM 43 ] W. Pitts and W. S. McCulloch, A logical calculus of ideas

immanent in nervous activity, Bull. Math. Biophys. 5 (1943),

115-133.

[ Wol91 ] D. Wolpert, “A Computationa lly Universal Field Computer

Whic h Is Purely L ine ar ,” T echni cal Re por t LA - UR- 91- 2937,

Los Alamos National Laboratory, 1991.

[ WZ 89 ] R. J. Williams and D. Zipser, A learning algor ithm for

continually running fully recurrent neural networks. Neural

Comput. 1, No. 2 (1989).

[ ZZZ9 2] B. Zhang, L. Zhang, and H. Zhang, A quantitative analysis

of the behavior of the pin network, Neura l Ne tworks 5

(1992). 639-661.

On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag

Documents

Transcript of On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag