On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
Transcript of On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 1/19
JOURNAL OF COMPUTER AND SYSTEM SCIENCES 50, 132-150 (1995)
On the Computational Power of Neural Nets*
H a v a T. S i e g e l m a n n +
Dep artmen t o f In formation Sy stem s Engineering , Technion, Haifa 32000, Israel
A N D
E d u a r d o D. S o n t a g *
Dep artment o f Mathema tics, Rutgers University, New Brunswick, New Jersey 08903
Received February 4, 1992; revised May 24, 1993
This paper deals with finite size networks which consist of inter-
connections of synchronously evolving processors. Each processor
updates its state by applying a "sigmoidal" function to a linear com-
bination of the previous states of all units. We prove that one may
simulate all Turing machines by such nets. In particular, one can
simulate any multistack Turing machine in real time, and there is a net
made up of 886 processors which computes a universal partialrecur-
sive function. Products (high order nets) are not required, contrary to
what had been stated in the literature. Nondeterministic Turing
machines can be simulated by nondeterministic rational nets, also in
real time. The simulation result has many consequences regarding the
decidability, or more generally the complexity, of questions about
recursive nets. C 1995 Academic Press, (nc.
1. INT RODU CT ION
We st udy the com put ati on al ca pabili tie s of re curr ent
first-order neural networks , or as we shall also say from now
on, processor nets. O ur model consists of a synchronous
netw ork of processors. Its architecture is specified by a
general directed graph. The input and output are presented
as streams. Input letters are transferred one at a time via M
input channels. A similar convention is applied to the
output, which is produced as a stream of letters, where each
letter is represented by p values. The nodes in the graph arecalled “neurons.” Each neuron updates its activation value
by applying a composition o f a certain one- variable func
tion with an affine combination (i.e., linear combination
and a bias) of the activations of all neurons X j , j = 1,..., N,
* This research was supported in part by U.S. Air force Gr ant AF OSR-
91-0343.
’ E-mail: iehava@ ie.technion.ac.il. Wor k done while at Dept, of Com
puter Science, Rutgers University.
* E- mail: sontag@ control.rutgers.edu
0022-0000/95 J6.00
Copyright £: 1995 by Academic Press, Inc. A ll rights of repro duction in any form reserved.
and the external inputs uk, k = 1,..., M , with rational valuedcoefficients— also called weights—- («,,, bin e,). T hat is, each
processor’s state is updated by an equation of the type
1) = <t( £ a , j X j ( t ) + Y , b ijuj (t ) + ci ), /= 1,..., N.
■ j = i j~ i
( i :
In our results, the function a is the simplest possible
‘sig moid,” namely the saturated- linear function
r o i f x < o
ct (. y ) : = ^ x i f 0 .v ^ 1
1 i f jc > 1.
( 2 )
This function has appeared in many applications of neural
nets (e.g., [B at91 , BV8 8, Lip87, ZZ Z9 2] ). We focus on this
specific activation function because it makes theorems
easier to prove, and also because, as it was proved in
[SS94], a large family of sigmoidal type activation func
tions result in models which are not stronger computa
tionally than this one.
T he use of simoidal functions— as opposed to hard
thresholds— is what distinguishes this area from older workthat dealt only with finite automata. Indeed, it has long been
known, at least since the classical papers by McCulloh and
Pitts [ WM 43 ] and K leene [ Kle5 6 ] ), how to simulate finite
automata by such nets.
As part of the descr iption, we as sume that ther e is singled
out a subset of the N processor s, say ..., x ip\these are the
p output processors, and they are used to communicate the
outputs of the network to the environment. (Generally,
the output values are arbitrary reals in the range [0, 1], but
later we constrain them to binary values only.)
132
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 2/19
T HE COMPUT A T IONA L POW E R OF NE URA L NET S 133
W e state that one can simul at e al l (multi- stack ) T uring
machines by nets having rational weights. The rational
numbers we consider (as weights, not those numbers arising
as intermediate values of computation) are simple, small,
and do not require much precision. Furthermore, this
simulation can be done with no slow down in the computa
tion. In par ticular , it is possible to giv e a net made up of 886
processors w hich computes a unive rs al partial- recursivefunction. Non- deterministic T uring machines can be
simulate d by non- deterministic nets, also in real time.
W e restrict to r ational (r ather th an re al) states an d
weig hts in or der to preserv e com put abil ity . It tur ns our tha t
using real valued weights results in processor nets that can
“calculate” super- T uring functions; see [ SS9 4] .
1.1. Related Work
Most of the previous work on recurrent neural networks
has focused on networks of infinite size. As each neuron is
itself a processor, an infinite net is effectively equivalent to
an infinite automaton, which by itself has unbounded
processing power. Therefore, infinite models are less
challenging for the investigation of computat ional power,
compared to our model which has only a finite number of
neurons and is bounded computationally.
There has been previous work concerned with computa
bility by finite networks , however, starting with the classical
w or k of Mc Cullo ch an d Pitts, cited abov e, on finite
automata. Another related result was due to Pollack
[ Pol8 7] . Polla ck ar gued that a certain recurrent net model,
w hic h he called a “ne uring ma chine,” is univer sa l. T he
model in [ Pol8 7] consisted of a finite number o f neurons of
two different kinds, having identity and threshold respon
ses, respectively . The machine w as high- order \ that is, the
activations were combined using multiplications, as
opposed to just linear combinations (as in Eq. (1)). T hat is,
the value of x, is updated by means of a formula of the type
x ,{ t+ \ ) = a ( X " , > . ^ “ (0 ; = 1■■■N,
(3)
whe re co, ft are multiindices, “ | | ” denotes their magnitudes
(total weights), k < oc, and
Pollac k left as an open question w hether high- order connec
tions are really necessary in order to achieve universality,
although he conjectured that they are. High order networks
(cf . [ CSSM89 , Elm90, G M C +92, Pol87, SCL G91 , W Z8 9] )
have been used in applications. One motivation often cited
for the use of high- order nets was P olla ck ’s conjecture
regarding their superior computational power. We see that
no such superiority of computationa l power exists, at least
when fo rmalize d in ter ms of poly nomia l- time co mputation .
W or k tha t deals w ith inf init e str ucture is re por ted
by Hartley and Szu [HS87] and by Franklin and
Gar zon [ FG 90 , GF8 9] , some of which deals w ith cellular
automata. There one assumes an unbounded number of
neurons, as opposed to a finite number fi x e d in advance. In
the paper [ W ol9 1 ] , W olpert studies a class of machines w ith jus t linear activation functions and shows that this
class is at least as power ful as any T uring machine (and
clearly has super- T uring capabilities as w ell). It is essential
in that model, a gain, that the number of “neurons” be
allowed to be infinite— as a matter of fact, in [ Wol9 1]
the number o f such units is even uncount able — as the
construction relies on using different neurons to encode
different possible s tack configurations in multi- stack T uring
machines.
T he idea of using continuous- valued neurons in order to
attain gains in computational capabilities as compared with
threshold gates had been investigated before. However,prior work considered only the special case of feedforward
nets— see, for instance, [ Son92 ] for questions o f approx i
mation and function interpolation and [ MS S9 1] for
questions of Boolean circuit complexity.
T he computability of an optical beam- tracing system
consisting of a finite number of elements was discussed in
[ RT Y 9 0] . One of the models described in that paper is
somewhat similar to ours, since it involves operations which
are linear combinations of the parameters, with rational
coefficients only, passing through the optical elements and
having recursive computational power. In that model,
however, three types of basic elements are involved, and thesimulated T uring machine has a unary tape. Furthermor e,
the authors of that paper assume that the s ystem can instan
taneously differentiate between two numbers, no matter
how close, which is not a logical assumption for our model.
1.2. Consequences and Future Work
The simulation result has many consequences regarding
the decidability, or more generally the complexity, of ques
tions about recursive nets of the type we consider. For
instance, determining if a given neuron ever assumes the
value “0” is effectively undecidable (as the halting problemcan be reduced to it); on the other hand, the problem
appears to become decidable if a linear activation is used
(halting in that case is equivalent to a fact that is widely con
jectured to fo llow from classical results due to Skolem and
others on ra tional functions; see [ BR 88 ] , p. 75 ] ), and is
also decidable in the pure threshold case (there are only
finitely many states). As our function a is in a sense a com
bination of threshold and linear functions, this gap in
decidability is perhaps remarkable. G ive n the linear- time
sim ulation b ound, it is of course also possible to transfer
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 3/19
13 4 S I EG ELM A N N A N D S ON T A G
NP- completeness r esults into the same questions for nets
(with rational coefficients). Another consequence of our
results (when using the halting problem ) is that the problem
of determining whether a dy namical s ystem of the particular
form
x (t + 1) = a ( A x ( t ) + c )
ever reaches an equilibrium point, from a given initial state,is effectively undecidable. Such models have been proposed
in the neural and analog computation literature for dealing
w it h cont ent - addr es sa ble re tr ieval and opt im iz ati on,
notably in the wor k of Hopfield (see, e.g., [ HT 85 ] . In
associative memories, for instance, the initial state is taken
as the “input pattern” and the final state, as a class repre
sentative. All convergence results currently known assume
special structure of the A matr ix — for instance, that it is
symmetric— but our undecidability conclusion shows that
such results will not be possible for general systems of the
above type.
A no ther corol lary is that hig her or der netw or ks ar e computationally equivalent, up to a polynomial time, to first-
order networks. (A finite networ k w ith rational weights is
finitely describable and can be efficiently simulated by a
T uring ma chine, w hich in turn can be simulated by first-
order networks.)
Many other types of “machines” may be used for univer
sality (see [ Son 90, especially Chap. 2] for general defini
tions of continuous machines). For instance, we can show
that systems evolving according to equations x (? + l) =
x (t) + r (A x (t) + bu(t) + c), where r takes the sign in each
coordinate, again are universal in a precise sense. It is
interesting to note that this equation represents an Euler
approx imation of a differential equation; this suggests
the existence of continuous- time s imulations o f Turing
machines, quite different technically from the work on
analog computation by Pour- El [ PE 74 ] and others. A dif
ferent approach to continuous- valued models of computa
tion is given in [BSS89] and other papers by the same
authors; in that context, our processor nets can be viewed as
programs with loops in which linear operations and linear
comparisons are allowed, but with an added restriction on
branc hing that reflects the nature o f the saturated response
we use.
The rest of this paper is organized as follows. Section 2
provides the exact definitions of the model and states the
results. It is follow ed by Section 3, w hich highlig hts the
main proof. Details of the proof are provided in Sections 4
to 7. In section 8, we describe the universal network, and we
end in S ection 9, where we briefly describe the non-deter-
ministic version of networks.
2. THE MODEL AND MAIN RESULTS
We mo de l a net, as des cr ibed in the intr oducti on, as a
dy namic al sy stem. A t each instant, the state of this system is
a vector x (r )e Q lvof rational numbers, where the /th coor
dinate keeps track of the activation value of the /th pro
cessor. Given a system of equations such as (1), an initial
state ,r( 1), and an infinite input sequence
u = u ( l ), u( 2),...,
we can def ine ite ra tiv ely the state x( t ) at time t, for eachinteger t ^ 1, as the value obta ined by recursively s olv ing the
equations. This gives rise, in turn, to a sequence of output
values, by restricting attention to the output processors; we
refer to this sequence as the “output produced by the input
u" starting from the given initial state.
W e w ant to def ine w hat we me an by a net co mputi ng a
function
{ 0, 1} + s—*•{ 0, 1} + ,
wher e { 0, 1} + deno tes the free semig roup o f bina ry st rings
(excluding the empty string). To do so, we must first define
w hat we me an by a formal network, a network whichadheres to a rigid encoding of its input and output. We
define formal nets with two binary input lines. The first of
these is a data line, and it is used to carry a binary input
signal; when no signal is present, it defaults to zero. The
second is the validation line, and it indicates when the data
line is active; it takes the value “ 1” while the input is present
there and “0” thereafter. We use “Z>” and “ V " to denote the
contents of these two lines, respectively, so that
u(t) = (D(t), F ( r )) e { 0 , l } 2
for each t. We always take the initial state x( 1) to be zeroand to be an equilibrium state. We assume that there are
two output processors, w hich also take the role of data and
v alidation lines and are denoted by “G ” and respec
tively. The output of the network is then given by
y (t ) = (H (t ), G (t ))e {0, l } 2.
(The convention of using two input lines allows us to
have all external signals be binary; of course many other
conventions are possible and would give rise to the same
results, for instance, one co uld use a three- valued input, say
w it h value s { —1 ,0 , 1} , where “0 ” indicate s th at no si gnal is
present, and ± 1 are the two possible binary input values .)
In general, our discrete- time dyna mical sys tem (w ith two
binary inputs) is specified by a dynamics map
,?■. G 'v x { 0, 1 } 2 - > Q 'V
One defines the state at time t, for eac h integ er / > 1, as the
value obtained by recursively solving the equations:
,v( 1) :=A - ,mt, x ( t+ \ ) : = 3 ? ( x ( t ) ,u { t ) ) , t = 1,2,... .
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 4/19
THE COMPUTATIONAL POWER OF NEURAL NETS 13 5
by
For each N e N, we denote the mapping
(? , , . . . , qN)\ - (o(qx), ..., a (q N))
ffv.: ->
■CO, £ { 0 , 1 }
with
<f>(co) = p x ■■■p t e {0, 1} + or undefined
and each r e N (/is called the length o f the res ponse/output
and r is called the response time, the time it takes to provide
the first output bit), we encode the input and output as
follows. Input is encoded as
wher e
V J t ) =
and we drop the subscript N when clear from the context.
(We think of elements of as column vectors, but for dis
play purposes we somtimes show them as rows, as above. As
usual, G denotes the ra tionals, and N denotes the natura l
numbers, not including zero.) Here is a formal definition.
D e f i n i t i o n 2.1. A a- processor net V with two binary
inputs is a dynamical system having a dynamics map of the
form
S ' i x , u) = a (A x + b l u i + b2u2 4- c),
for s ome matrix y 4 e Q fl' xA' and three vectors b t , b2, c e Q ,v
The “bias” vector c is not needed, as one may always add
a processor with constant value 1, but using c simplifies the
nota tion and allow s the use of zero initia l states, which
seems more natural. When b x = b 2 = 0, one has a net without
inputs. Processor nets appear frequently in neural network
studies, and their dy namic properties ar e of interest (see, for
instance, [ MW 89 ] ).
It is obvious that—with zero, or in general a rational,
initial state— one can simulate a processor net with a
T uring machine. W e wish to prove, conversely, that anyfunction computable by a T uring machine can be computed
by such a processor net. We look at partially defined maps
< : { 0, 1} + -> { 0, 1} +
that are recursively computed. In other words, maps for
w hic h ther e is a mult i- st ack T uring ma chine M so that,
when a w or d ca e { 0, 1} + is written initially on the input/
output stack, M halts on ca if and only if <j>(co) is defined,
and in that case, is the content of that stack when the
machine halts.
For each word
and
D J t ) =
if t = 1,..., k, otherwise,
if /= 1,..., k,
otherwise.
Output is encoded as
y u ,, r U) = ( G W, r ( t \ H w r ( t ) ) , t = 1, ...,
where
1, if t = r , ..., (r + /— 1),
0, otherwise,
if the output ii>{co) is defined, and is 0 when <j>(co) is not
defined, and finally
if t = r, ..., (r + / —1 ),
otherwise,
if the output cj>(w) is defined, and is 0 when <j>(co) is not
defined.
T h e o r e m 1. L et <j>\ { 0 , 1 } + —*•{ 0 , 1 } + be any recur
sively computable pa r tia l funct ion. Then, there exists a
processor net A " with the follow ing proper ty :
If . I ’is startedfrom the zero (inactive) initial state, and the
input sequence is applied, . \ ~ produces the output y ln r
(where H to r appears in the “out put node," a nd G OJ r in the
“validation node ") fo r some r.
Furthermore, if a p-stack T uring machine .1i (o f one input
stack and several working stacks) computes <j>(co) in time
T (q} ), then one may take r(co) = T(aj) + 0 ( | <y| ).
Note that in particular it follows that one can obtain the
behavior of a universal T uring machine v ia some net. A n
upper bo und fr om the constr uction shows tha t N = 886 pro
cessors are sufficient for computing such a universal partial
recursive function.
2.1. Restatement: No I/O
It will be convenient to have a version of Theorem 1 that
does not involve inputs but rather an encoding o f the initial
data into the initial state. (T his w ould be analogous, for
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 5/19
136 S IE G E L MA NN A ND SONT A G
T uring machines, to an encoding of the input into an input
stack r ather than having it arrive as ,an ex ternal input
stream.)
For a processor net without inputs, we may think of the
dynamics map 3F as a ma p Q ,V- »Q V. In that case, we
denote by the A: th itera te of & . For a state c e Q*, we
let := We now state that if 0: { 0, 1} + - *•{ 0, 1} +
is a recursively computable partial function, then thereexists a processor ne t. V without inputs, and an encoding of
data into the initial state of J \ such that: <t>(a>) is undefined
if and only if the second processor has activation value
always equal to zero, and it is defined if this value ever
becomes equal to one, in which case the first processor has
an encoding of the result.
Given a> = ■■■a k . e { 0, 1} + , we define t he e ncoding
function
^ 2.Q -f- 1
<4 )
1=1 H
(Note that the empty sequence gets mapped into 0 .)
T h e o r e m 2. Le t M be a p- stack T uring machine comput
ing (j>: { 0, 1} + -*■{ 0, 1} + in time T. Then there exists a pro
cessor net . I without inputs so tha t proper tie s (a) and (b)
below hold. In both cases, f o r each (o e {0, 1} +, we consider
the corresponding initial state
f (c u) : = ( « $ [ > ] , 0, . . . , 0 ) e O A'.
(a) I f 4>(to) is undefined, the s econd coordinate £(<»){ o f
the state aft er j steps is identically equal to zero, for a l l j . I f
instead <j>(a>) is defined and computed in time T, than there exists JT = O(T ) so that
Z(Q>)i =0 , ; = 0 , ...,3T
Z(co)f = 1,
an d £, (w )f = 8\ _<t>{a>) ]. (This is a linear time simulation.)
(b) Furthermore, if one substitutes 5 in Eq. (4) by a new
map 8p as
^ - - - - - • (51
then the construction can be done in such a way that ,T = T.
( This is a re al time simulation.)
The next few sections include the proofs of both theorems.
W e star t by pr ov ing T heo rem 2 , and then obtain Theorem 1
as a corollary. As the details of the proof of Theorem 2 are
very technical, we start by sketching the main steps of the
proof in Section 3. T he proo f itself is org anized into several
steps. We first show how to construct a network that
simulates a given multi- stack T uring machine M in time
y = 4 T ( T is the computa tion time of the T uring ma chine J.
This is done in Sections 4 and 5. In Section 6, we modify the
construction into a network that simulates a given multi
stack T uring machine w ith no slow dow n in the computa
tion. This is done in three steps and is based on allowing for
large negative numbers to act as inhibitors.
After T heorem 2 is prov ed, we sho w in Section 7 how to
add inputs and outputs to a processor net withoutinput/output, thus o btaining T heorem 1 as its corollary.
This ends the proofs.
3. HIGHLIGHTS OF THE PROOF
This section is aimed at highlighting the main part of the
proo f of Theorem 2. W e s tart w ith a 2- stack machine; this
model is obviously equivalent to a 1-tape (standar d) T uring
machine [ HU7 9] , Formally , a 2-stack machine consists of
a finite control and two binary stacks, unbounded in length.
Each stack is accessed by a read- wr ite head, which is
located at the top element of the stack. A t the beginning o fthe computation, the binary input sequence is written on
stack j . T he machine at every s tep reads the top element o f
each stack as a symbol sce{ 0, 1, # } (where # means that
the stack is empty), checks the state of the control
(,ve {1, 2,..., | S| } ), and executes the follow ing operations:
1. For each stack, one of the follo w ing ma nipulatio ns is
made:
(a) Po pping the top element.
(b) Pushing an extra element on the top of the stack
(either 0 or 1).
(c) No change in stack.2. Chan g ing the state of the control.
W he n the co ntrol reaches a special state, c alled the “hal ti ng
state,” the machine stops. Its output is defined as the binary
sequence on sta ck ,. T hus, the I/O ma p or function,
computed by a 2- stack machine, is defined by the binary
sequences on stack 1 before and after the com putation.
3.1. Encoding the Stacks
A ssume that we wer e to encode a binary str eam
co = cy,<a2 ■■■co„into the number
I 2‘
Such a value could be held in a neuron, since it ranges in
[0, 1]. However, there are a few problems with this simple
encoding. First, one cannot differentiate between the strings
“/?” and “/? •0,” where “ •” denotes the co ncat enat ion
operator. Second even when assuming that each binary
str ing ends with 1, the continuity of the activ ation function
o makes it impossible to retrieve the most significant bit (in
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 6/19
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 7/19
138 S IEG E L MA NN A ND SONT A G
3.5. Real Time Simulation
T he above- suggested encoding a nd the use of the lemma
result in a Tur ing mac hine simulat ion that r equires four
steps of the netw ork for each step of the T uring machine. T o
achieve a “step per step” simulation, a more sophisticated
encoding is required, which relies upon a Cantor set with a
specific size of gaps. T hen, we use large negative numbers asinhibitors, and attain in this manner the desired simulation
in real time. We turn now to the formal proof.
4. GENERAL CONSTRUCTION OF THE SIMULATION
As a departur e point , we pic k //- tape T ur ing mac hines
with binar y alphabe ts . We ma y equiv al ently st udy pus h
down automata with p = 2p' binary stacks. We choose to
represent the values in the stacks as fractions with
denominators which are powers of four. An algebraic
formalization is as follows.
4.1. p- Stack Machines
Denote by the “Ca ntor 4- set” consisting of all those
rational numbers q which can be written in the form
w ith 0 ^ k < cc and each a, = 1 or 3. (When k = 0, we inter
pret this sum as q = 0.) Elements of <€are precisely those of
the form <5[co], where 5 is as in Eq. (4).
The instantaneous description of a /7-stack machine, with
a control unit of n states, can be represented by a (p + 1 )-
tuple
(j, <$[>,] , <5[>2] , ..., 5[cop]),
wher e s is the state of the control unit, and the stacks store
the words (i = 1 ,..., p) , respectively. (Later, in the simula
tion by a net, the state s will be represented in unary as a
vector of the form (0 , 0 ,..., 0 , 1 , 0 ,..., 0 ).)
For any q e (€, we write
£I<7] = 1 when a, = 3. We interpret t [ •] as the “nonempty
stack” operator. It can never happen that £[<7] = 1 whi le
r [ g ] = 0 ; hence the pair (£[ 17] , r [ 9 ]) can have only three
possible values in { 0 , 1 } 2
D e f i n t i o n 4.1. A p- stack machine J t is specified by a
( p + 4)- tuple
(S , s,, s H, 0O, 0 U 0 2, ..., 0p),
wher e S is a finite set, s, and s ,, are elements of S called the
initial and halting states, respectively, and the 0 /s are maps
as follows:
0 o: S x { 0, 1} 2/1- S
0 ,.: S x { O , l p - K 1 ,0 ,0 ), ( i, 0 , | ) ,(1 ,0 ,1 ),
(4 , —2 ,- 1 )} for /= 1 , . . .,p.
(The function 0Ocomputes the next state, while the func
tions 0 t compute the next stack operations of stack, , respec
tively. The actions depend only on the state of the control
unit and the symbol being read from each stack. The
elements in the range
( l ,0 , 0 ) , ( i , 0 , i) , ( i , 0 , | ) ,( 4 , - 2 , - 1 )
of the 0, should be interpreted as “no oper atio n,” “pushO,”
“pushl,” and “pop,” respectively.)
T he set := S x <gp is called the instantaneous description
set of J t , and the map
I - >3C
defined by
P A s , qx, - , q p)
: = [ 0 OU, £[<?,], ..., £[ ? ,] , r [ ? ,] , . .., r[ ? ,] ) ,
Qx(s,Clqi~\ , - .£[ ? ,] . r[<7i]. ■■■>r[ $V ] )- (? i>C [ ? i] , H
C[ qpl r [ q x ] , .. ., t[ qp])- (qp, £ [ ? , ] , 1 )] ,
C M '= 1 ° ’ ^ where the dot indicates inner product, is the complete
{ l , if q > dynamics map o f . //. As par t of the definition, it is assumed
Y that the ma ps 0, ( z '= l , .../?) are such that 0, (s ,
r [ * ] H i ’ -f 9 Z n C[ ^ ] ’ °* r [ ^ ] , - ^ [ ^ ] ) , 0 2U £[ * ,] , ...,£[ * ,] , t [ ? , ] , 0,’ 1 q ■ ^[9 3 ], TC4r/>]) " ■* ( 4 , - 2 , - 1 ) f or a l i i , qu ..., qp (that
is, one does not attempt to pop an empty stack).
W e think of £[ •] as the “t op of sta ck ,” as in terms of the Let to e {0, 1} + be arbitrar y. If there exists a positive
base-4 expansion, C [ ? ] = 0 when a, = 1 (or <? = 0), and integer k, so that starting from the initial state, S n with
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 8/19
T H E C O MP UT A T IO N A L P O WE R O F N E UR A L N E T S 139
<5[&>] on the first stack and empty other stacks, the machine
reaches after k steps the halting state s H; that is,
0>k(s„<?[ >] , 0 , 0 ) = (sH, <5[ >i], <S[>2] , ..., S[cop] )
for some k , then the machine J t is said to halt on the input
CO. i f co is like this, let k be the least possible number such
that
& k(s t , S[co~\ , 0 , 0 )
has the above form. Then the machine . // is said to output
the string m, , and we let 4>. (co) : = w ,. T his defines a partial
ma p
**■■{0, l} + - { 0 , 1 } + ,
the i/o map of J i .
Save for the algebraic nota tion, the part ial recursive func
tions 4>\ {0 , 1 } ■*“ —►{ 0 , 1 } + are exactly the same as the maps
4>.« : ^ ^ of /7-stack machines as defined here; it is only
necessary to identify words in { 0 , 1 } + and elements of ^ via
the above encoding map S. O ur pr oof will then be based on
simulating /7- stack machines by processor nets.
for /£ { 1,..., s } , j e { 0,..., s}, and
yk : { 0, l } 2' - { 0 , 1} (7)
for i e { 1,..., p } , j e { 0,..., } , k = 1, 2, 3, 4 as
P v (a u a 2, ..., ap, b x,b 2, ..., bp) = 1
d0(j , a x,a 2, ..., a p , b l , b 2 , ..., bp) = i
(intuitively, there is a transition from state j of the control
part to state i iff the readings from the stacks are: top of
stack* is a k \ and the nonemptyness test on stack* gives bk).
y “ is zero except in the following cases:
y)j ( a l , a 2, . . . ,a p, b l , b 2,. .. , bp) = l
o 0,(y, a ,, a 2, ..., ap, b u b2, ..., bp) = (1 ,0 , 0)
(if the control is in state j and the stack readings area u a 2, ..., a p, b u b2, ..., bp, then stack i will not be changed);
a 2, ..., a p, b u b2, ..., bp)= 1
o 0 ,( j, a u a 2, ..., a p, b { , b 2, ..., bp) = ( 0 , | )
5. NETWORK WITH FOUR LEVELS:
CONSTRUCTION
A ssume tha t a /7- stack ma chine M is given. Without loss
of generality, we assume that the initial state s, differs from
the halting state sH (otherwise the function computed is the
identity, which can be easily implemented by a net), and we
assume t hat S := { 0,.. ., } , with ^ = 0 and sH = 1. We build
the net in two stages.
• Stage 1. As an interme diate step in the cons truction,
we shal l show how to s imulate J t with a certain dynamical
sys tem over CF +/’. W ri ting a vector in as
(jc x s,q u . .. ,qp),
the first s components will be used to encode the state of the
control unit, with 0 e S corresponding to the zero vector
x | = ••■= x s = 0 and i e S , i / 0, corr esponding to the j thcanonical vector
<?,= ( 0 ,..., 0, 1 ,0, ....,0 )
(the “1” is in the Jth position). For convenience, we also use
the nota tion e0 : = 0 6 Q ,!. T heq/s will encode the contents of
the stacks. For notational ease, we substitute £[ /,] and r[ /,]
by a, and 6,, respectively. Formally, define
/? ,: { 0 , 1 } * - { 0 ,1 } (6)
(if the control is in state j and the stack readings are a , ,
a 2, ..., ap, b i , b2, ..., bp, then operation Push0 will occur on
stack i );
y,y(«] , a 2, ..., ap , b x, b2, ..., bp) = 1
a , , a 2, ..., ap, b1,b 2, ..., bp) = (\ , 0, | )
(if the control is in state j and the stack readings are
a ,, a 2, ..., ap, bu b 2, ..., bp, then operation Push 1 will occur
on stack i );
y*j{ au a 2, .. . ,a p, b u b2, . . . ,b p ) = 1
o 0 i ( j , a i , a 2, .. ., a p, b l , b 2, .. ., bp) = (4, - 2 , - 1 )
(if the contol is in state j and the stack readings are
a l t a 2, ..., a p , b i , b 2, ..., bp, then operation Pop will occur onstack i ).
Let # be the ma p Q s+/> -»
(JC,, ..., x s, q u qp) ( x t , X ; ,q + , ..., q + ),
where, using the no ta ti on x 0 := 1 — i •*/>
s
x,+ := Z P i j (a u .. ., ap , b u . .. ,bp ) X j (8 )o
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 9/19
140 S IE G E L MA NN A ND SONT A G
for 1 = 1 ,..., s and
< lt '■= ( Z y\ (a .........a P> •••> M * / ) <7. V = o /
and scalars
(9.1)C,, C2, ..., C2' e ^
.smc/i that, fo r each d\ ,d , ..., fl1, , i e { 0, 1 } and each
? e [ 0 , l ] ,
1 1
X U 9 ' + 4(9.2)
+ ( Z y l(a \ ,- ,a p, b u ... ,bp) x j\ j = o
1 3
X U * ' + 4
V/=o X (4-7, —2C[ ^(] — 1)
(9.3)
(9.4)
for / = 1,
Let n: ,1' = 5 x ^ —» Q '+p be defined by
n (/, ..., qp) := (e „qx .... qp).
It follows immediately from the construction that
# ( 7z(i, q \,..., ^ ) ) = 7r ( (;, qr,,..., qp))
for all (/, qp)e:¥ .
A ppl ie d induct iv el y , the above implies that
& k(e0, <5[w], 0,..., 0) = 7 r(^*(0, <5[w] , 0,..., 0))
for all k, so 4>(co) is defined if and only if for some k it holds
that & k(e0, <5[cu], 0,..., 0) has the form
( et , q q p)
(recall that for the original machine, s, = 0 and s „= 1,
w hic h map respectiv ely to e0 = 0 and e , in the first s coor
dinates o f the corres ponding vector in Q J + p). If such a state
is reached, then q x is in <€and its value is <5[$(co)].
• Stage 2. T he second stage of the constr uction
simulation the dynamics J* by a net. We first need an easy
technical fact.
L e m m a 5.1. Let te N . For each function /?: {0, 1} '- *
{0, 1} there exist vectors
r , , v2, i'2‘ e Z ' + 2
P(du d2, .... d ,) x = £ cra(vr ■//)r = 1
an d
f ] (dl , d 2 ,. .. ,d r) x q = o i q + £ cr cr(vr ■p ) ~ \\ r = 1
where we denote n = (1, d x, d 2, ..., d,, x ) and “
product in Z '+ 2.
Proof. Write [1 as a polynomial
P (d t , d2, d,) = c, + c2d, + ■■■+ c, + I d,
+ c, + 2d l d2 + + c2id t d2 ■■■dn
( 10)
( 11)
= dot
( 12)
expand the product /?(d x, d2, ..., d,)x , and use that for any
sequence /,,..., lk of elements in { 0, 1} , one has
/j •••lk = a (/i + ••■+ lk —k + 1).
Using that .x = a ( x) , this gives that
P (d\ ,d2, ..., d,)x
= c l a ( x ) + c 2cr (d] + x — 1) 4- •■•+ c2,a
x (d\ -t-d2 -f- ••■+ d, + .V—/)
2'
= X C,o(vr - l l)r= * J
for suitable c/s and vr's. On the other hand, for each
r e {0, 1} and each q e [0, 1] it holds that zq= a(q + r —1)
(just check separately for r = 0, 1), so s ubstituting the
above formula with r = f i ( d u d2, ..., d,)x gives the desired
result. |
Remark 5.2. T he above cons truction over estimates theamount of neurons necessary. In the case where t —2p and
the arg uments are the top an d nonempty functions of the
stacks, the arguments are dependent, and there is a need for
jus t 3P terms in the summation, rather than 2 2p. We
illustrate this phenomenon with the simple case of p = 2,
that is, the case of two- stack machines. Here, w hen
(d\ , d2, dj, d4)
= ( a u a 2 , b i , b 2)( = (£[<?, ] , C[ ?2] , z[ ? , ] , z [ q2~\ )).
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 10/19
THE COMPUTATIONAL POWER OF NEURAL NETS 141
one can easily verify that from the following nine elements
(a l> a 2-> | | ’ b 1b2),
one can obtain—using a affine combinations only— the
value of the remaining seven elements
( a i b a 2b2, a^a2bi , a xa2b2, a {b{b2, a 2bl b2, a ia 2b{b2).
Hence, a sum of the type in Eq. (10) requires in this case
only nine O^) elements rather than sixteen (2 2p).
We now return to the pr oof of the theore m. A pply
L emma 5.1 repeatedly, w ith the “/?” of the lemma corre
s ponding to e ach o f the /?,y’s and y i ’s, and using variously
q = q „q = (\ qi + \ ), q = ( \ q i+ l ), or q = (4qf- 2 C O J - 1).
Write als o a(4 qj —2) whenever £[<?,] appears (using that
CItf] = ° ( 4 q —2) for each q e <&), and o(4q) whenever r[#]
appears. T he result is that # can be wr itten as a com
position
# = F , ° f 4
of four “saturated- affine” maps, i.e., maps o f the for m
c (A x + c): F 4: Q s+ p^ Q f‘, Fy . 0 ' ‘ 0\ / ’2 : 0 v - >0 '',
F { : Q n -» Q s+P, for some positive integers (T he
argument to the function F 4, called below r ,, o f dimension
($+ /? ), represents the i x ,’s of Eq. (8) and the two qi's of
Eqs. (9). The functions F { , F 2, F i compute the transition
function of the x /s and <7,’s in three stages.)
Consider the following set of equations, where from nowon we omit time arguments and use the superscript + to
indicate a one- step time incre ment:
z t = F i ( - 2 )
~ 2 = F 2{ = 3 )
z ? = F i {zA)
z At = F 4( z l ),
wher e the z ’s are vectors of sizes s + p , r;, v, and /x, respec
tively. This set of equations models the dynamics of a
CT-processor net, with
N = s + p + n + v + tj
processors. For an initial state of type z { = (e0, 3[ to1, 0)
and Zj = 0, i = 2, 3, 4, it follows that at each time of the
form t = 4k the f irst bloc k of coor dinates, z [, equals
# * (e 0, <5[co], 0).
A ll that is left is to add a mode- 4 co unter to impose the
constraint that state values at times that are not divisible by
4 should be disregarded. The counter is implemented by
adding a set of equations
y T
y 2 = a ( y 3),
y 3+ =<7( 7 4 ),
y 4 = < r ( l- y 2 ~ y ^ - y 4).
W he n st ar ting w ith all >',(0) = 0, it holds that >’, ( /) = 1 if
t = 4k for some positive integer k, and is zero otherwise.
In terms of the complete system, <j>(co) is defined if and
only if there exists a time t such that, starting at the state
Z\ = (e0, <5[>], 0), z, = 0, i = 2, 3, 4, y , = 0 ,; = 1, 2, 3, 4,
the first coordinate z n ( t ) of zt(t) equals 1 and also that
y i ( t ) = y 2U — 1) = 1. T o force z n { t ) not to output arbitrary
values at times that are not divisible by 4, we modify it to
" T i = o ( + H y 2 - ! ) ) ,
where “ •••” is as in the or ig ina l upda te equa ti on for z , , and
/ is a positive constant bigger than the largest possible value
of z M. T he net effect of this modif ica tion is that now
z n ( t) = 0 for all t that are not multiples o f 4 and f or t = 4k
it equals 1 if the machine s hould halt, and 0 otherwise. Reor
dering the coordinates so that the first stack ((j + 1) th coor
dinate of Zi) becomes the first coordinate, and the previous
z u (that represented the halting state s of the mac hine J i )
becomes the second coor dinate, T heorem 2(a) is proved. |
5.1. A L ay out o f the Const ruction
The above construction can be represented pictorially as
in Fig. 1.
For now, ignore the rightmost element at each level,
w hic h is the counter. T he remainde r corres ponds to the
functions F 4, F 3,F 2 , F u ordered from bottom to top. The
processors are divided in levels, where the ouput of the /th
level feeds into the (i — 1) th level (and the o utput of the top
level feeds back into the bottom). The processors are
grouped in the picture according to their function.
The bottom layer, F4, contains 3 + p groups of processors. T he leftmost g ro up of processors stores the values
of the s states to pass to level . T he “zer o sta te” processor
outputs 1 or 0, outputting 1 i f and only i f al l of the j pro
cessors in the first group are outputting 0. The “read stack
i" group computes the top element £[(/,] of stack i, and
r [<7.] e { 0, 1} , which equals 0 if and only if stack i is empty.
Each o f the p processors in the last group stores an encoding
of one of the p stacks.
Layer F 3 computes the 2 lp terms o(vr p) that appear in
Eq. (10) for each of the possible s + 1 values o f the vector x.
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 11/19
14 2 SIEGELMANN AND SONTAG
S :Qs state neurons
.. .
p stacks
6counter 1
............................................................. , t
l O ••<□.] l l O - O i • * • iO ” Q ! j O4 sub-stacks (1) * swb^siacks (pj counter 2
1s state neurons
p stacks counter 3
i Q * * O j p D ) - - o O O - o Os states + state 0 r"eaii stack (i) read stack (p) p stacks counter 4
Ff G. I. The universal network.
Only 1) processors are needed, although, since there
are only three possibilities for the ordered pair (C[</,],
r [<?,]) f or each of the two stacks; this was ex plained in
Remar k 5.2. (Note that each such // contains £[<7 i] , ...,
C [ ? , ] , ..., x\ _qp~\, as well as x 0, x „, that were
computed at the level F4.) In this level we also pass to level
F 2 the values C[ ?i ], C[ ?a] and the encoding o f the p stacks,
from level F 4. As a matter of fact, 3ps + 1 neurons may
substitute the /’(^+ 1), as the 0 state r equires one neuron
only.
A t level F 2 we compute all the new states as descr ibed in
Eq. (8) (to be passed along w ithout change to the top layer).
Each of the four processors in the “new stack /” group com
putes one of the four main terms (rows) of Eqs. (9) for one
stack. For instance, for the fourth main term we compute an
expression of the form:
4? ,- 2 C[ <? ,] - 1 + X I crja(vr - ( i )- 1^ j = 0 r = 1/
(obtained by applying Eq. (11)). Note that each of theterms
<7,, £[<?,] , &(Vr M) has been evaluated in the prev ious level.
Finally, at the top level, we copy the states from level
except that the halting state .v, is modified by counter 2. We
also add the four main terms in each stack and apply a to
the result.
After reordering coordinate s at the top level to be
the data processor and the halting processor are first and
second, respectively, as required to prove Theorem 2. Note
that this construction results in values f i = s + 3 p+ l ,
v = 3p(s + 1) + 2p, and t] = s + Ap.
If desired, one could modify this construction so that allactivation values are reset to zero when the halting
processor takes the value “ 1.” This is discussed more
thoroughly in Remark 7.1.
6. REA L T IME SIMULA T ION
Here, we refine the simulation of Section 5 to obtain the
claime d real-time simulation. Thus, this section provides a
proo f of T heorem 2(b). T his part is both less intuitive and
less crucial for the main result.
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 12/19
THE COMPUTATIONAL POWER OF NEURAL NETS 143
W e star t in Subsection 6.1 by mo dif y ing the co ns tr uctio n
given in Sec tion 5 in order to o btain a “two- level” neural
network. A t this point, we obtained a slow- down of a factor
of two in the simula tion. In S ubsection 6.2, we modify the
construction so that in one of the levels the neurons differ
from the standard neurons; they compute linear combina
tions of their input with no sigma function applied to the
combinations. Finally, in Subsection 6.3, we show how tomodify the last network into a standard one with one level
only, thus achieving the desired real- time simulation o f
T uring machines.
In Subsection 6.2, we substitute the 4- Cantor set
representation with a somewhat more complex Cantor set
representation (to be explained there), which has larger
gaps between consecutive admiss ible ranges o f empty
stacks, stacks with 0 top, and stacks with 1 top. These gaps,
along with the use of large negative numbers as inhibitors,
w ill ena ble us to speed up the s imula tion.
6.1. Computing in Two Layers
We ca n re wr ite the dy namic s o f the stack qi from Eqs. (9)
as the sum of four components,
<li= I <hj (13)
j = i
and
e - = Z e& -
j -1
(14)
(15)
As three out of the four sub- stack elements { q n , q i2,
qi4] of each stack /'= 1, p are 0, and the fourth has
the value of the stack q,, it is also the case that three out of
four s ub- top elements of ?, (and s ub- nonempty elements
of e,) are 0, and the fourth one stores the value of the top
(nonempty predicate) o f the r elevant stack.
Using Eq. (8), we can express the dynamics of the state
constrol as
X ? ~ X f i i j ih 1’ —>0>4> j= 0
?u , ..., ep4)xy (16)
for / = 1,..., s and the dynamics of the sub- stack elements as
? / + Z y l ( t u , - , e p 4 ) x k - l
< l a = ° k / + 5 + Z Y l U n , - , e p 4 ) x k - l
4 + 4 + Z
k = 04) x k - 1
<l n = ° ( 4? , - -2 C [ ? i] - 1 + £ Y*k (t 11, V k = 0
for all stacks q(, i = 1,..., p.
W e intr oduce the nota ti on
CP 4x t - 1
next- qv =
if j = 1,
if j = 2,
if j = 3,
4 * , - 2 C [ * , ] - l , if 7 = 4,
q^
l ? i + 4>
+ 1 >
where q j represents the row (9- j ) in Eqs. (9). We name
these q0 as the sub- stack elements o f stack i. T hat is, qn may
differ fr om 0 only if the last update of stack i was “no- opera
tion.” Similarly , the components qi2, q ,3, qi4 may differ from0 only if the last update of the ith stack were “pushO,”
“pushl,” or “pop,” respectively. We also define t v to be the
sub- top of the sub- stack element qir and e tj to be the sub-
nonempty test of the s ame sub- stack element. The top o f
stack / can be computed by
and summarize the above sub-stack dy namic equations by
= f f (n ex t - ^+ Z > i(- ) * /f c - 0 - ( 17)
V k = o /
Similarly , the sub- top and sub- nonempty are updated by
t v = a ( A \ nex t- ? i/+ Z y * ( - ) J C f c - i ] - 2 ) ,
k=s ° (18)
e*=a (4 [next'9 +i (■*Xk ~ 1])
for all i = 1,..., p , j = 1,..., 4.
W e cons tr uct a netw or k in w hic h the stacks and their
readings are not kept explicitly in values q,, en but
implicitly only v ia the sub- elements qv, t0, e0, j = 1,..., 4,
i = 1,..., p. This will enable us a simulation of one step of a
T uring machine by a “tw o level” rather than a “four level”
network, as was suggested in the previous section.
By L em ma 5.1 a nd Eqs. (14), (15 ), the funct ions /J,y and
y jik of Eqs. (16)—(18 ) can be wr itten as the com bina tion
Z Zk = 0 r —1
( 1 9 )
571, 50 I-11
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 13/19
144 S IEG EL MA NN A ND SONT A G
wher e
*=(i. i ... i i .... 1 ‘v-0' /= 1 j= 1 7=1 j = 1 7
c" are scalar constants, u" are vector constants, and a
represents the multi- indices ij for f i and ijk for y. Thus, all
the updated equations of
x k, k = 1 .....s (states)
qir /= 1, 2, 7 = 1, 2, 3, 4,
tiJ, i= 1 ,2 , 7 = 1 , 2 , 3 ,4 ,
etJ, i = 1,2, y = 1,2, 3,4,
can be written as
er(lin. comb, of cr( lin. comb, of tops (/, )
and nonempty (e,))),
that is, as what is usually called a “feedforward neural net
w ith one hid de n layer .”
T he main layer consists o f the sub- elements q,r t,r e0 and
the states x r. In the hidden layer, we compute all elements
ct( •••) required by L emm a 5.1 to compute the f unct ions /?
and y. We showed that 3 ^ + 2 terms of this k ind are
required. We refer to these terms as “configuration detec
tors” as they provide the needed combinations of states and
stack readings. These terms are all that are required to com
pute x k . W e also k eep in the hidden layer the values of g,and t, to compute next-^ry.
T he result is that # can be written as a composition
# = F ,o F 2
of two “saturated- affine” maps, i.e., maps of the form
o(A x + c)\ F,: O v-> Q", F 2 : Q " — CT , f or v = 3ps + 2p + 2
and rj = s + 12 p.
In summary:
• T he main layer consists of:
1 . s neurons x k, k = 1 ,..., .v, that represent the state ofthe system unarily.
2. Fo r each stack i, i = 1,...,p, we have
(a) four neurons q) jS qtJ, j = 1, 2, 3, 4,
(b) four neurons t\j = j = 1, 2, 3, 4
(c) four neurons e]j = e0, j = 1, 2, 3, 4.
• T he hidden layer consists of:
1. 'ips + 2 neurons for configuration detecting. (The
additional two are for the case of s0.)
2. Fo r each stack i, i = 1,...,p we have
(a) a neuron q]1= q,,
(b) a neuron t] = t,.
6.2. Removing the Sig moid fr om the Ma in L evel
Here, we proceed by shrinking the network and showing
how to construct a network equivalent to the one above, in w hich neur ons in the ma in levelcompute linear combina
tions only (and apply no a function to it). In the follow ing
constr uction, we introduce a set of “noisy s ub- stack”
elements { qn , qi2, q{ i, qi4} for each sta ck /'= 1, These
may ass ume not only values in [ 0 ,1 ] , but also negative
values. Neg ative values of the stacks are interpreted as the
value 0, while positive values are the true values of the stack.
A s in las t sect ion, onl y one of these fo ur elements may
assume a non- negative va lue at each time. T he “noisy sub
top” and “noisy s ub- nonempty ” functions applied to the
noisy sub- stack elements may also produce values outside
the range [0, 1].To manage with only one level of a functions, we need to
choose a number representation that enforces large enough
gaps between valid values of the stacks. We now abandon
the 4- Cantor set repres entation, using instea d a slightly
different Cantor set representation: If the network is to
simulate a /?-stack machine, then our encoding is the 10/?2-
Cantor set.
To motivate our new Cantor representation, we illustrate
it first w ith the special case p —2. Here, the encodingis a 40-
Ca ntor set representation. We encode the bit0by£0 = 31
a nd 1 by e, = 39. We interpret the resulting sequence as a
number in base 40. Fo r ex ample, the stack tv =1 10 1 isencoded by
q = .(39)(31)(39)(39 )40.
In addition, we allow the empty stack to be represented by
any non- positive value of q, rather tha n 0 only. T his is where
the sigmoids of the main level are being saved. The possible
ranges of a value q representing the stacks are thus:
[ —g c , 0] empty stack
[ 40 ' l i ] top of stack is 0
[ H, 1] top of stack is 1.
Stack operations include pushO, which is implemented by
the operat ion (//40 + ^; pus hl, w hich is implemente d as
qf 40 + ; and pop which is implemented by 40q —t, where
t is the top element of the stack. We can compute the top
and non- empty predicates by
top(<?) : = 5(40</ —38),
nonempty { q) : = (40q — 30).
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 14/19
THE COMPUTATIONAL POWER OF NEURAL NETS 14 5
T he ranges of the top values are [5 , 10] , [ —35, —30],
and [ —oo, —190] for t o p= 1, top = 0, and empty stack,
respectively; a nd the ranges o f the non- empty predicate are
[9, 10], [ 1 ,2] , and [ oo, —30] , st ate d w ith the same order.
In partic ular, top is interpreted as being 1 when the function
top(g) is in the range [5, 10] and 0 for [ oo, —30 ] , w hile the
non- empty predicate is taken as 1 in the range [ 1, 10] and
0 in the range [ —oo, —30] . T he gaps betw ee n the pos itiveand negative ranges are large; in particular, the domain
values which represent the value 0 in both predicates are at
least 3 times larger than the domain values which represent
the value 1. The special encoding was planned to have this
property. In the case of p stack machines, we need an
encoding for which the values of the negative doma in are at
least (2/7 — 1) times lar g er tha n the v alues o f the positiv e
domain.
Fo r the general /7- stack machin e, we choose the base
b = \ 0p2. Denote c = 2/7+ 1, b = l0p2, £, = (10/72— 1),
£0 = (10p2—4p — 1). T hat is, 0 is encoded by e0, 1 is encoded
by £,, and the resulting sequence is interpreted in base b. The role of c w ill be explained below. T he r eading functions
“noisy sub- top” and “noisy sub- nonempty” cor res ponding
to the noisy sub- stack element v e { ^y| /= 1, - , p, j =
1,..., 4} are defined as
N- top(u) : = c(bv — (£[ — 1)) (20)
N- nonempty(t)) := bv —(£0 — 1). (21 )
W e denote t0 = N- top( , ; ) and e0 = N- none mpty (</,-,) for
/= 1 , p , j = 1,..., 4.
We summa rize dy na mi c equa tions of the nois y ele ments
by
Qu ~ a \ next- ,y + £ y £ ( - ) * * - l ) ,
b ( ne x t - q iJ + £ yJik( ■) x k - 1 ) - (e, - 1)t ; = a \ c
e t = a [ b next-<?,y + £ yJik(- ) x k - 1 ) - ( e 0 ~ 0 )» (2 2)k = 0
wher e
next■ 9 o = \
(in if j = 1
1 „ , £° b q< + J ’
i f y = 2
1 £ i
b q ‘ + T '
I I
and £[<?, ] is the true binar y value of the top of stack i.
T he ranges of values of the noisy sub- top and sub- non
empty functions are
N- top(y)
f [2/7 + 1, Ap + 2] when the top is “ 1”
e < [ —8/72 — 2/7 + 1, —8p1+ 2 ] whe n the top is “0”
([ —oo, —20/>3— 10p 2 + 4/> + 2] for an empty stack;
N- nonempty(u)
{[ 4/7+ 1,4/?+ 2] when the top is “ 1”
[ 1 ,2] when the top is “0”
[ —cc, — 10/72 + 4p + 2] fo r an em pty stack;
(Not e that the values o f <r(N- nonempty(i>)) and
o-(N-top(!;)) provide the exact binary top and nonempty
predicates.) The union of these ranges is
U = [ —oo, —8/72 + 2 ] u [ 1, Ap + 2 ] ,
The parameter c is chosen so that to assure large gaps in the
range U.
Property. For all p ^ 2, any negative value o f the func
tions N- top and N- nonempty has an absolute value o f at
least ( 2 / 7 — 1) time s any pos it iv e value of them.
T he large negative numbers operate as inhibitors. We w ill
see later how this property assists in constructing the
network. As for the possibility of maintaining negativevalues in stack elements rather than 0, Eq. (13) is not valid
any more. T hat is, the elements qiJy j — 1,..., 4, cannot be
combined linearly to provide the real value of the stack q, .
This is also the case with the top and nonempty predicates
(see Eq. (14), (15)).
In Lemma 5.1, we proved that for any Boolean function
of the ty pe /?: { 0, 1 } r(—►{ 0, 1} and x e { 0 , 1} , one may
express
2'
p ( du . . . , d , ) x = Y c ro ( v r - n)
for // = (1, d x, ..., d , ,x ) , constants c, and cons tant vectors vr . This was applicable for the functions firJ and yJjk in
Eqs. (1 6)—(18). Nex t, we prove that using the noisy s ub- top
t,j and noisy sub- nonempty ev elements—rather than the
binary sub- top (a (/,-,•)) and sub- nonempt y
ones—one may still compute the functions firj and y\k using
one hidden layer only.
It is not true for any function 1} and
x e { 0 , 1} t ha t fi( )x is computable in one hidden layer
network; this is true in particular cases only, including ours.
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 15/19
146 S IEG E L MA NN A ND SONT A G
D e f i n i t i o n 6.1. A f u n c t i o n f i (v u . .., v ,) is s a i d t o b e
sign- invariant if
v' = s ign(t;/) V /= 1,..., t= > f i(v , ■■■v,) = fi(v\ ..... v’,) .
L e m m a 6.2. For each t, r e N , let U, be the range
[ - oo , —2/2+ 2] u [ 1, 2/ + 2]
and let
S r , = { d\ d=(d\ ]), ..., d\ r\ d'2", ..., d {{\ ..., d\ X).....d\ r))
e R ", and'ii = 1,..., t at most one of d ),..., d[ is positive}.
We denote by I the set o f multi- indices (z,,..., with each
ij 6 {0, 1,..., r } . For each function fi: S r ,- * { 0, 1} that is sign
invariant, there exist vectors
{ Vie Z r + 2, ie l }
and scalars
{ c , e Z , i e l }
such that f or each (d\ l), ..., d\ r))e S r ,) and any a-e {0, I } , we
can write
f i ( d [ ' d \ r)) x = £ c,o(vrMi),i e l
where
... i , ) = ( 1’ d 22\ d \‘'\ x )
and we are defining d (' = 0 . ( 2 3 )
Here the size of I is | /| = (r + 1 )', an d is the dot product
in Z ' + 2.
Proof. As [ i is sign- invariant, we can write fi when acting
on S r , as
p(d\ '\ d\ 2\ d\ r ') = P (o{ d \ "), o(d\ 2)) , a ( d \'>))•
Thus, f i can be viewed as a Boolean function on
{0, 1 } " (-*■{ 0, 1 }, and we can express it as a polynomial (see
E q . ( 1 2 ) ) :
f i(d[ '\ d\ 2\ . .. ,d ^)
= cl + c2o(d\ i)) + c3<j(d[2>)+ + crl + l a (d ,,r))
+ crl + 2o(d[ '))o (d <2l))+ ■■■
+ c{ r+ l),a(d\ r)) u i d ^ ) ■■■a(d\ r)).
(Note that no term with more than t elements of the type
o { d { j Y) appears, as most <r(^y)) = 0, by def inition o f S r ,.)
Observe that for any sequence /, ..... lk of (k ^ t) elements in
U, and x e {0, 1}, one has
o ( l i )- - - o(lk ) x = o ( l i + • + lk + A:(2? + 2 ) ( x - 1)).
This is due to two facts:
1. T he sum of k, k ^ t elements of U, is non- positive
when at leas t one of the elements is neg ative. T his stems
from the property that any negative value in this range is at
least (t — 1) times larger than any positive value there.
2. Ea ch /, is bo unde d by (2 t + 2).
Expand the product f i ( d j , ..., d rt)x , using the above obser
vation and the fact x = a{ x ) . This gives that
P ( d [ " ,. . ., d '; ') X
= c l a{x ) + c2(r(d\ l) + (2t + 2 )(x — 1))+
+ c(r+ l),a {d\ r) + ■■■d\ r' + t{ 2t + 2){ x - 1))
= £ Cia(vrHi),i e l
for suitable c,’s and i\ 's, where p, is defined as in (23). |
Remark 6.3. Note that in the case where the arg uments
are the functions N- top and N- nonempty, the arguments
are dependent and not all (r + 1)' terms are needed.
We conclude tha t the nois y sub- stack elements qt] as wellas the next state control, are computable in the one hidden
layer network from f ;/ and e,r We next provide the exact
network architecture.
Network Description. T he network consists of two
levels. The main level consists of both 5 state neurons
and noisy sub- stack neurons ac companied by sub- stack
noisy- readings neurons: q)j, t ’ y, e l , i = 1,..., p, j = 1,..., 4-
representing respectively noisy sub-s tack- elements, noisy
sub- top elements, and noisy s ub- nonempty elements.
For simplicity, we first provide the update equation of the
network which simulates a 2-stack machine, and later we
generalize it to p- stack machines. R ecall tha t for a 2- stack
machine the encoding w as a 40- Cantor set representation
w ith £0 = 31 an d e, = 39. We use the same notatio ns for the
functions y 0 as in Eq. (9); let q, represent the zth stack, and
t, r epresents the tr ue (i.e., clea n) value o f the top of the stack.
T he noisy sub-stack q n updates for no- operation as in
Eq. (9.1),
<7,’r = 9 /+ Z YJikX k - 1;k = 0
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 16/19
T H E C O M P U T A T I ON A L P O W E R O F N E U R A L NE T S 14 7
the noisy sub- stack qi2 updates for pushO as in Eq. (9.2), w hich are update d by the equatio ns
- i Vi 31 v ' / iXo
the noisy sub- stack qn updates for pushl as in Eq. (9.3),
i + I t 39 v /= 4 0 + 4 0 + , ? o > * ( * 1;
and the noisy sub- stack q ,-4 updates for pop as in Eq. (9.4),
g I14+ = 40 <? , - 8 « , - 3 1 + X y{kx k - \ .k = 0
For general />-stack machines, the bases are 10p 2 and the
update equations are given by
q)j+ = ne x t - ^+ £ Yli x k 1,
V + = (2 p+ \ ) \ 0p2 ( next- q0 + £ y & x k - 1
— ( 10/>2 — 2)
e' + = 10/»2 ( next- y-l- £ ? * * * - 1
whe re
next- ,y
- ( l O/? 2 - 4p - 2),
qi, if j = 1 (not updated)
1 I Op2 —4p — 1
T o p q‘ + \ op ’
if j = 2 (pushO)
- « „ = < 1 10/ j 2 — 1
1 0p2 * ' + I0 p2 ’
if y = 3 (pushl)
l0 p2qi - 4 p t - ( l 0 p 2 - 4 p- \ )
if y = 4 (pop),
q \ + - = o ( q \'■)>
/2+ := c r ( f i = 1, . . . , p , j = 1 , 4 ,
6.3. Owe Lei;*?/Network Simulates T M
j = i
(24)
(25)
(26)
Consider the above network. Remove the main level and
leave the hidden level only, while letting each neuron there
compute the information that it received beforehand from a
neuron at the main level. This can be written as a standard
network . No te that as the net consists of one level only,
no “counters” are required. T his ends the proo f of
T heorem 2. |
7. INPUTS AND OUTPUTS
W e no w ex pla in how to deduce T heor em 1 fr omT heore m 2. (W e present details in the linear- time simula
tion case only. T he real-time case is entirely analo g ous, but
the notations become more complicated.) We first show
how to modify a net with no inputs into one which, given
the input uw( ■), pro duces the enco ding <5[a>] as a state coor
dinate and after that emulates the original net. Later we
show how the output is decoded. As explained above, there
are two input lines: D = u t carries the data, and V = u 2
validates it. In this section we concentrate on the function 3,
thus prov ing that T heorem 2(a) implies a linear time
simulation of a Turing machine via a network with input
and output. A s imilar proof, when applied to T heorem 2(b)
(while substituting 8 by Sp) implies T heorem 1.
So assume that we are given a net with no inputs,
„v+ = a (A x + c), (27)
as in the conclusion o f Theorem 2. Suppose that we have
already found a net
y + = t j( Fy + gUi + hu2) (28)
(co ns isting o f fiv e process ors ) so t hat, if w,( •) = Z)m( •) a nd
u2( ■) = Vm( •), then w ith j( 0 ) = 0 we hav e
and qt and t, are the exact values o f the stacks and top
elements. Using Lemma 6.2, all the expressions of the type
P (- ) x and y (- )x can be written as linear combinations of
terms like a (linear combinations of t}-, el). These q0 , tha t is,
and e,j constitute the main level.
The hidden layer consists o f both up to ( 5 ^ + 2) con
figuration detectors neurons (as proved in Lemma 6.2) and
the stack and top neurons,
q2, tl, /= 1,..., p 7 = 1 , —, 4,
y 4( •) = 0 ••■0 <5[co] 00 •••, y 5( •) = 0 ■••0 11
y *(t ) =
y s(t ) =
8[co] , i f t = |a>| + 2,
0, otherwise;
0, if t < M + 2,
1, otherwise.
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 17/19
148 S IE G E L M A N N A N D S O N T A G
Once this is done, modify the original net (27) as follows.
The new state consists of the pair f.v, j), with y evolving
according to (28) and the equations for jc modified in this
manner (using A , to denote the /th row of A and c, for the
ith entry of c):
.v,+ = a (A l x + c l y 5 + y 4)
. y,+ = a( A ,x + cty 5), i = 2 , . . . , n.
T hen, star ting at the initial state y = x = 0, clearly jct(r) = 0
for t = 0 ,.. ., |<w| 4- 2 and Jc,(| ct>| -t- 3) = <5[co], whil e, for i > 1,
X j( t) = 0 for t = 0,..., \co\ + 3.
A ft er time | co| + 3, as y 5 = 1 and m, = u2 = 0, the e qua
tions for x evolve as in the or iginal net, so x (/) in the new
net equals x (t — \co[ —3) in the original one for t | co| + 3.
The system (28) can be constructed as
>'2+ = o ( u 2 )
y 3 = cr (y 2 - u 2)
} ’4 = C J i y 1 + > ' 2 - M 2 - 1 )
y ? = c r (> ,3 + > ' s) -
This completes the proof of the encoding part. For the
decoding process of producing the output signal y ,, it will
be sufficient to show how to build a net (o f dimens ion 10
and with two inputs) such that, starting at the zero state and
if the input sequences are x , and x 2, where .v, (k ) = <S[a>] for
some k and x 2(t) = 0 for t < k , x 2( k ) = 1 (x , (7)e [0 , 1] for
t / k, x 2(t) g [0, 1] for t > k), then for processors z g, it
holds that
1, if k + 4 ^ k + 3 + |w|,
0, otherwise;
if k + 4 ^ ^ k + 3 + | co| ,
” 10 [ 0 , othe rw is e.
One can verify that this can be done by
- r = <r (x 2 + z i )
z t = a ( z , )
z + = o { z 2 )
“ 4 = a ( x x )
“ 5+ = 0 (Z 4 + Zl —Z2— 1)
Z6 = ff (4- 4 + -1 —2 2 —3)
Zy = a( 16z s —%z7 —6 zi + z 6)
Zg —cr(4zg —2z 7 —z 3+ z s)
= + = a(A:g)
-10 = a (z- j).
In this case the output is y = (zl0, z9).
Remark 7.1. If one w ould also like to achieve a resettingof the whole network after completing the operation, it is
possible to add the processor
- M= Ot- io)
and to add to each processor that is not identically zero at
this point
t’,+ =er ( + : M - r I0), c e { x , y , z} ,
where “ ■■•” is the former ly def ined ope rat ion of the
processor.
8. UNIVERSAL NETWORK
The number of neurons r equired to simulate a T uring
machine consisting of 5 states and p stacks, with a slow
down of a factor of two in the computation, is
5 + 12p + 3ps+ 2 + 2 p.
ma in layer hidden layer
T o estimate the number of processors r equired for a“universal” processor net, we should calculate the number s
discussed above, w hich is the number of states in the control
unit of a two- stack univers al T uring machine. Minsk y
proved the existence of a universal T uring machine hav ing
one tape with four letters and seven control states
[ Mi n 67 ]. S hannon showed in [ Sha56 ] how to contro l the
number of letters and states in a T uring machine. F ollow ing
his cons truction, we obt ain a 2- letter 63- state 1- tape T uring
machine. Howeve r, we are interested in a two- stack
machine rather than one tape. Similar arguments to the
ones made by Shannon, but for two stacks, leads us to
5 = 84. A pplying the formula 3 ^ + s + \ Ap + 2, we concludethat there is a universal net with 870 processors. To allow
for input and output to the network, we need an extra
16 neurons, thus having 886 in a universal machine.
(T his estimate is very conservative. It wo uld certainly be
interesting to have a better bound. T he use of multi- tape
T uring machines may reduce the bound. Further more , it is
quite possible that with some care in the construction one
may be able to drastically reduce this estimate. One useful
tool here may be the result in [ A D 0 9 1 ] applied to the
contr ol unit— here we used a very inefficient sim ulat ion.)
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 18/19
T H E C OM P U T A T IO N A L P O W E R O F N E U R A L N ET S 14 9
9. NON- DETERMINIST IC COMPUT A T ION
A non- deterministic processor net is a modification of a
determinstic one, obtained by incorporating a guess input
line (^) in addition to the validation and data lines. Hence,
the dynamics map of the network is now
3'- . 0 A'x { 0, l } 3 - © "
Similar ly to D ef inition 2.1, we define a non- deterministic
processor network as follows.
D e f i n i t i o n 9.1. A non- deterministic a- processor net J/"
w ith tw o bin ar y inputs and a guess line is a dy namic al
system having a dynamics map of the form
& { x , u, g ) = a ( A x + b l u l + b 2u 2 + b 7tg + c ),
for some matrix ^ e Q Ax ‘v and four vectors b 1 , b 2 , b 3,
c e Q N.
A for ma l non- det er minis tic netw or k is one that adher es
to an input- output encoding convention s imilar to the one
for deterministic systems, as in Section 2. For each pair of
wor ds co, y e {0, 1} +, the input to the network is encoded as
va, J t ) = ( V J t ) , D J t ) , 9 J t ) ) , t = 1,...,
wher e the input lines Vm and D m are the same as described
in Section 2 and 'Sy is defined as
5 (0, otherwise.
T he output e coding is the same as the one used for the deter
ministic networks.
W e restrict our at te ntion to for ma l non- det er minis tic
networks that compute binary output values only, that is,
<j>2 >{co, y) e {0, 1}, where <l> (co, y) is the function computed
by a f ormal non- deterministic network when the input is
co 8 { 0, 1} + and the guess is y e {0, 1} +. The language L
computed by a non- deterministic f ormal netw ork in time B
is
L = { w e { 0 , 1} + | 3 ag ue s s y m,<l>s ,(co,yw)
= 1, | y J < 7 V ( © K * ( M ) } .
Note that the input co and the guess yOJ do not have to have
the same length; the only requirement regarding their syn
chronization is that a>( 1) and y{ 1) appear as an input to the
network simultaneously. The function T v is the amount of
time required to compute the response to a given input or,
and its bound, the function B, is called the computation
time. The length of the guess, \yw \, is bounded by the
function T r .
W he n restricted to the case of lang uag e re co gnit ion
rather than general function computing, T heorems 1 and 2
can be restated for the non- deterministic model in w hich
is a non- deter ministic pr ocess or net and . // is a non- deter-
ministic T uring machine. T he proofs are s imilar and areomitted.
A CK NOW L E DG MENT
We tha nk Robe rt Solov ay for his useful comment s dur ing the early
stages of this research.
REFERENCES
[ A D0 91 ] N. A lon, A. K. Dewdney, and T. J. Ott, Efficient simulationof finite automata by neural nets, J. Assoc. Com put. Mach.
38, No. 2 (1991), 495- 514.
[ Bat9 1] R. Batr uni, A multilayer neural network with piecewise-
linear structure and back- propagation lear ning, IE EE Trans.
Neural Ne tworks 2 (1991), 395-403.
[ BR8 8] J. Berstel and C. Reutenauer, “Ratio nal Series and their
Languag es,” Springer- Verlag , Berlin, 1988.
[ BSS89 ] L. Blum, M. Shub, and S. Smale, On a theory of computation
and complexity over the real numbers: NP completeness,
recursive functions, and universal machines, Bull. Amer.
Ma th. Soc. 21 (1989), 1^16.
[ BV 88 ] J. R. Brown, M. M. Garber, and S. F. Vanable, Artificial
neural network on a S IM D architecture, in “Proceedings,
2nd Symposium on the Frontier of Massively Parallel
Computation, Fairfax, 1988,” pp. 43^7.[ CSS M89 ] A. Cleeremans, D. Servan- Schreiber, and J. McClelland,
Finite state automata and simple recurrent networks, Neural
Comput. 1, No. 3 (1989), 372.
[ Elm90 ] J. L. Elman, Finding structure in time. Cog. Sci. 14 (1990),
179-211.
[ FG 90 ] S. Franklin and M. Gar zon, Neural computability, in
“Progress in Neural Network s” (O . M. Omidv ar, Ed ),
pp. 128-144, Ablex, Norwood, NJ, 1990.
[ GF 89 ] M. Gar zon and S. Franklin, Neural computability, in
“Proceedings, 3rd Int. Joint Conf. Neural Networks, 1989,”
V ol. 2, pp. 63 1- 637.
[ G M C + 92] C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and
Y. C. Lee, Learni ng and ex tracti ng fi nite state aut omat a wi th
second-order recurrent neural network s, Neura l Com put. 4,No. 3 (1992), 393-405.
[ HS8 7] R. Hartley and H. Szu, A comparison of the computational
power of neural network models, in “Proceedings, IEEE
Conf. Neural Networks, 1987,” pp. 17-22.
[ HT 85 ] J. J. Hopfteld and D. W. T ank, Neural computation of
decisions in optimization problems, Biol. Cybern. 52 (1985),
141-152.
[ HU79 ] J. E. Hopcroft and J. D. Ullman, “Introduction to Automata
T heory, Languages, and Computa tion,” Addison- Wesley,
Reading, MA , 1979.
[ Kle5 6] S. C. Kleene, Representation of events in nerve nets and finite
automata, in “Automata Studies” (C. E. Shannon and
7/24/2019 On the Computational Power of Neural Nets, Hava Siegelmann & Eduardo Sontag
http://slidepdf.com/reader/full/on-the-computational-power-of-neural-nets-hava-siegelmann-eduardo-sontag 19/19
15 0 S IE G E L M A N N A N D S O N T A G
[ Lip87 ]
[ M i n67 ]
[MSS91 ]
[ MW89]
[ P E 7 4 ]
[ P o l 8 7 ]
[RTY90]
[SCLG91 ]
J. McCa rt hy, Ed.), pp. 3-41, Princeton Univ. Press,
Princeton, NJ, 1956.
R. P. Lippmann, An introduction to computing with neural
nets, IE EE A cou st. Speech Sig nal Process. Mag. April (1987),
4-22.
M. L. Minsk y, “C omputation: F inite and Inf inite Machines,"
Prentice.Hall, Engelwood ClifTs, NJ 1967.
W. Maa ss . G . Sc hnitg er , and E. D. Sonta g , O n the co mputa
tional power of sigmoid versus boolean threshold circuits,in “Proceedings, 32nd Annu. Sympos. on Foundations of
Compute r Science, 19 91," pp. 767-776.
C. M. Mar cus and R. M. Westervelt, Dy namics o f iterated-
map neural networks, Phys. Rev. Ser. A 40 (1989).
3355-3364.
M. Pour- El, Abstract co mputability and its relation to the
general purpose analog computer, Trans. Amer. Math Soc.
290 (1974), 1-29.
J. B. Pollack, “On Connectionist Models of Natural
Language Processing,” Ph.D. thesis, Computer Science
Dept, Univ. of Illinois, Ur bana, 1987.
J. H. Reif, J. D. Ty gar, and A. Y oshid, The computability and
complexity of optical beam tracing, in “Proceedings, 31st
A nnu. Sy mpos. on Fo unda ti ons of C om pute r Science, 1990,”
pp. 106- 144.
G. Z. Sun, H. H. Chen, Y. C. Lee, and C. L. Giles, Turing
equivalence of neural netw orks with s econd-order connec
tion weights, in “Proceedings International Joint Conference
on Neural Networks, IEEE, 1991.”
[ Sha56] C. E. Shannon, A universal turing machine with two internal
states, in “Automata Studies” (C. E. Shannon and
J. McC ar thy , Eds.), pp. 156- 165, Princeton Univ . Press,
Princeton, NJ, 1956.
[ Son90] E. D. Sontag, “Mathematical Contro l Theory: Deterministic
Finite Dimensional Sy stems," Springer- Verlag, New Y ork,
1990.[ Son92] E. D. Sontag, Feedforward nets for interpolation and
classification, J. Com put. Syste m. Sci. 45 (1992), 20^4-8.
[ SS94] H. T. Siegelmann and E. D. Sontag, Ana log computation
via neural networks, Theoret. Comput. Sci. 131 (1994),
331-360.
[ WM 43 ] W. Pitts and W. S. McCulloch, A logical calculus of ideas
immanent in nervous activity, Bull. Math. Biophys. 5 (1943),
115-133.
[ Wol91 ] D. Wolpert, “A Computationa lly Universal Field Computer
Whic h Is Purely L ine ar ,” T echni cal Re por t LA - UR- 91- 2937,
Los Alamos National Laboratory, 1991.
[ WZ 89 ] R. J. Williams and D. Zipser, A learning algor ithm for
continually running fully recurrent neural networks. Neural
Comput. 1, No. 2 (1989).
[ ZZZ9 2] B. Zhang, L. Zhang, and H. Zhang, A quantitative analysis
of the behavior of the pin network, Neura l Ne tworks 5
(1992). 639-661.