Learning, In Variance, And Generalization in High-Order Neural Networks

7
Learning, invariance, and generalization in high-order neural networks C. Lee Giles and Tom Maxwell High-order neural networks have been shown to have impressive computational, storage, and learning capabilities. This performance is because the order or structure of a high-order neural network can be tailored to the order or structure of a problem. Thus, a neural network designed for a particular class of problems becomes specialized but also very efficient in solving those problems. Furthermore, a priori knowledge, such as geometric invariances, can be encoded in high-order networks. Because this knowledge does not have to be learned, these networks are very efficient in solving problems that utilize this knowledge. 1. Introduction A key facility of neural networks is their ability to recognize to extract essential parame- ters from complex high-dimensional data. The pro- cess of recognizing invariants requires the construc- tion of complex mappings, which utilize internal representations, from a higher dimensional input space to a lowerdimensional output space. The inter- nal representations of these mappings capture high- order correlations which embody the invariant rela- tionships between the input data stream and the output. Typically, an internal representation is constructed from a group of first-order units via a learning rule such as backpropagation. First-order units are units which are linear in the sense that they can capture only first-order correlations. They can be represented by YA() =S [ W(ij)X0i)] where x = {x(j)} is an input vector (bold type indicates vectors), W(ij) is a set of adaptive parameters (weights), N is the number of elements in the input vector x, and S is an unspecified sigmoid function. We suggest that the use of building blocks which are able to capture high-order correlations is a very effi- cient method for building high-order representations. In many cases, greatly enhanced performance in learn- ing, generalization, and knowledge (symbol) represen- tation can be achieved by crafting networks to reflect the high-order correlational structure of the environ- ment in which they are designed to operate. Our research in this area has focused on high-order units which can be represented by the equations C. L. Giles is with U.S. Air Force Office of Scientific Research, Bolling Air Force Base, Washington, DC 20332, and T. Maxwell is with Sachs/Freeman Associates, 1401 McCormick Drive, Landover, Maryland 20785. Received 14 August 1987. yi(X) = S[Wo(i) + E W 1 (ij)X(j) + E E W 2 (ijk)x(j)x(k) + . .. 1, j k where the higher-order weights capture higher-order correlations. A unit which includes terms up to and including degree k will be called a kth order unit. We have investigated various topics with high-order networks: learning and generalization, implementa- tion of invariances, associative memory, temporal se- quences, and computational feasibility of high-order approaches.1- 5 This paper is an introduction to the forenamed work and that work should be consulted for additional details and explanations. 11. Background Early in the history of neural network research it was known that nonlinearly separable subsets of pattern space can be dichotomized by nonlinear discriminate functions. 6 Attempts to adaptively generate useful discriminate functions led to the study of threshold logic units (TLUs). The most famous TLU is the perceptron, 7 which in its original form was constructed from randomly generated functions of arbitrarily high order. Minsky and Papert 8 studied TLUs of all or- ders, and came to the conclusions that high-order TLUs were impractical due to the combinatorial ex- plosion of high-order terms, and that first-order TLUs were too limited to be of much interest. Minsky and Papert also showed that single feed-forward slabs of first-order TLUs can implement only linearly separa- ble mappings. Since most problems of interest are not linearly separable, this is a very serious limitation. One alternative is to cascade slabs of first-order TLUs. The units embedded in the cascade (hidden units) can then combinethe outputs of previous units and gener- ate nonlinear maps. However, training in cascades is 7 because there is no simple way to pro- vide the hidden units with a training signal. Multislab learning rules require thousands of iterations to con- 4972 APPLIED OPTICS / Vol. 26, No. 23 / 1 December 1987

Transcript of Learning, In Variance, And Generalization in High-Order Neural Networks

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 1/7

Learning, invariance, and generalization in high-order neuralnetworks

C. Lee Giles and Tom Maxwell

High-order neural networks have been shown to have impressive computational, storage, and learningcapabilities. This performance is because the order or structure of a high-order neural network can betailored to the order or structure of a problem. Thus, a neural network designed for a particular class ofproblems becomes specialized but also very efficient in solving those problems. Furthermore, a prioriknowledge, such as geometric invariances, can be encoded in high-order networks. Because this knowledgedoes not have to be learned, these networks are very efficient in solving problems that utilize this knowledge.

1. Introduction

A key facility of neural networks is their ability to

recognize invariances, or to extract essential parame-ters from complex high-dimensional data. The pro-cess of recognizing invariants requires the construc-tion of complex mappings, which utilize internalrepresentations, from a higher dimensional inputspace to a lower dimensional output space. The inter-nal representations of these mappings capture high-order correlations which embody the invariant rela-tionships between the input data stream and theoutput.

Typically, an internal representation is constructedfrom a group of first-order units via a learning rulesuch as backpropagation. First-order units are unitswhich are linear in the sense that they can capture onlyfirst-order correlations. They can be represented by

YA()=S [ W(ij)X0i)]

where x = {x(j)} is an input vector (bold type indicatesvectors), W(ij) is a set of adaptive parameters(weights), N is the number of elements in the inputvector x, and S is an unspecified sigmoid function.

We suggest that the use of building blocks which areable to capture high-order correlations is a very effi-cient method for building high-order representations.In many cases, greatly enhanced performance in learn-ing, generalization, and knowledge (symbol) represen-tation can be achieved by crafting networks to reflectthe high-order correlational structure of the environ-ment in which they are designed to operate. Ourresearch in this area has focused on high-order unitswhich can be represented by the equations

C. L. Giles is with U.S. Air Force Office of Scientific Research,Bolling Air Force Base, Washington, DC 20332, and T. Maxwell is

with Sachs/Freeman Associates, 1401 McCormick Drive, Landover,Maryland 20785.

Received 14 August 1987.

yi(X) = S[Wo(i) + E W1(ij)X(j)

+ E E W2(ijk)x(j)x(k)+ . .. 1,j k

where the higher-order weights capture higher-ordercorrelations. A unit which includes terms up to andincluding degree k will be called a kth order unit.

We have investigated various topics with high-ordernetworks: learning and generalization, implementa-tion of invariances, associative memory, temporal se-quences, and computational feasibility of high-orderapproaches.1- 5 This paper is an introduction to theforenamed work and that work should be consulted foradditional details and explanations.

11. Background

Early in the history of neural network research it wasknown that nonlinearly separable subsets of patternspace can be dichotomized by nonlinear discriminatefunctions. 6 Attempts to adaptively generate usefuldiscriminate functions led to the study of thresholdlogic units (TLUs). The most famous TLU is theperceptron, 7 which in its original form was constructedfrom randomly generated functions of arbitrarily highorder. Minsky and Papert 8 studied TLUs of all or-ders, and came to the conclusions that high-orderTLUs were impractical due to the combinatorial ex-plosion of high-order terms, and that first-order TLUswere too limited to be of much interest. Minsky andPapert also showed that single feed-forward slabs offirst-order TLUs can implement only linearly separa-ble mappings. Since most problems of interest are notlinearly separable, this is a very serious limitation.One alternative is to cascade slabs of first-order TLUs.The units embedded in the cascade (hidden units) canthen combine the outputs of previous units and gener-ate nonlinear maps. However, training in cascades is

very difficult7

because there is no simple way to pro-vide the hidden units with a training signal. Multislablearning rules require thousands of iterations to con-

4972 APPLIEDOPTICS / Vol. 26, No. 23 / 1 December 1987

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 2/7

verge, and sometimes do not converge at all, due to thelocal minimum problem.

These problems can be overcome by using singleslabs of high-order TLUs. The high-order terms areequivalent to previously specified hidden units, so thata single high-order slab can now take the place of manyslabs of first-order units. Since there are no hiddenunits to be trained, the extremely fast and reliablesingle-slab learning rules can be used.

More recent research involving high-order correla-tions includes optical implementations, 9"10 high-orderconjunctive connections," sigma-pi units,1 2 associa-tive memories,4 59 18 and a high-order extension of theBoltzmann machine.' 3

Il1. High-Order Neural Networks

A high-order neuron can be defined as a high-orderthreshold logic unit (HOTLU) which includes termscontributed by various high-order weights. Usually,but not necessarily, the output of a HOTLU is (1,0) or

(+l,-i). A high-order neural networkslab is defined

as a collection of high-order logic units (HOTLUs). Asimple HOTLU slab can be described by

Yi = S[net(i)]=[TO(i) T1(i) +Ti)+i)k(i)I,1)

where yi is the output of the ith high-order neuron unit,and S is a sigmoid function. T,(i) is the nth order termfor the ith unit, and k is the order of the unit. Thezeroth-order term is an adjustable threshold, denotedby Wo(i). The nth order term is a linear weighted sumover nth order products of inputs, examples of whichare

Tl(i) = 3 Wl(ij)x(j), T2(i) = 33 2 (ij,k)x(j)x(k), (2)i j k

where x(7) is the jth input to the ith high-order neuron,and Wn(ij, ... ) is an adjustable weight which cap-tures the nth order correlation between an nth orderproduct of inputs and the output of the unit. Severalslabs can be cascaded to produce multislab networksby feeding the output of one slab to another slab asinput. (The sigma-pi' 2 neural networks are multilevelnetworks which can have high-order terms at eachlevel. As such, most of the neural networks described

here can be considered asa special case of the sigma-pi

units. A learning algorithm for these networks is gen-eralized back-propagation. However, the sigma-piunits as originally formulated did not have invariantweight terms, though it is quite simple to incorporatesuch invariances in these units.)

The learning process involves implementing a speci-fied mapping in a neural network by means of aniterative adaptation of the weights based on a particu-lar learning rule and the network's response to a train-ing set. The mapping to be learned is represented by aset of examples consisting of a possible input vectorpaired with a desired output. The training set is asubset of the set of all possible examples of the map-ping. The implementation of the learning processinvolves sequentially presenting to the network exam-

ples of the mapping, taken from the training set, asinput-output pairs. Following each presentation, theweights of the network are adjusted so that they cap-ture the correlational structure of the mapping. Atypical single-slab learning rule is the perceptron rule,which for the second-order update rule can be ex-pressed as

W2(ij,k) = W 2 (ijk) + [t(i) - y(i)]x(j)x(k). (3)

Here t(i) is the target output and y(i) is the actualoutput of the ith unit for input vector x. Similarlearning rules exist for the other Wi terms. If thenetwork yields the correct output for each exampleinput in the training set, we say that the network hasconverged, or learned the training set. If, after learn-ing the training set, the network gives the correct out-put on a set of examples in the training set that it hasnot yet seen, we say that the network has generalizedproperly.

Another learning rule which we have investigated isthe outer product rule, sometimes referred to as theHebbian learning rule. In this form of learning, thecorrelation matrix is formed in one shot via the equa-tions:

NpW 2(ij,k) = 3 yS(i) y(i)I[xs(J) - x(j)I[xs(k) - x(k)],

s=1

y(i)= ys(i)]/Nps x(J) = [a s(i)]/Np,

where Np denotes the number of patterns in the train-ing set. The training set is denoted by I(xs,ys)ls e(1,Np)}, and y and x are the averages of ys and xs over

the training set. In cases in which x is nearly zero[which often is the case when (+1,-i) input encodingis used], simplicity may be gained in the above equa-tions at the expense of a small reduction in perfor-mance by setting the x and y terms to zero. Thisillustrates the reduction in computation oftenachieved with the proper choice of data representa-tion. (Unless otherwise specified, all training patternsdiscussed in this paper are represented by +1's in a -1background.) Analogous equations hold for other or-ders.

IV. Learning the EXCLUSIVE-OR

Learning the EXCLUSIVE-OR (see below) has beenthe classic difficult problem for first-order units be-cause it is the simplest nonlinearity separable prob-lem. Training a hidden unit to perform this functionrequires thousands of iterations of the fastest learningrules.

EXCLUSIVE-OR

x(1) x(2) t(x)1 1 11 -1 -1

-1 1 -1-1 -1 1

An alternative method is to use a fixed hidden unit.An equivalent approach is to train a second-order unit

1 December 1987 / Vol. 26, No. 23 / APPLIEDOPTICS 4973

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 3/7

to solve this problem. The training set consists of theEXCLUSIVE-OR values given above. Consider a sec-ond-order unit of the form:

y(x) = sgn[W 1 (1)x(1) + Wl(2)x(2) + W2 x(l)x(2)], (4)

with the learning rules:WV(i)= W(i) + [t(x) - y(x)]x(i),

W = W2

+ [t(x) - y(x)Jx(1)x(2).

Here, t(x) is the correct output for input x = (bx 2).The signum function becomes sgn[xl, which is definedas +1 when x > 0 and -1 when x < 0. The second-order term is equivalent to a handcrafted hidden unit,so that the single second-order unit takes the place of atwo-layer cascade. Since there is no correlation be-tween x1 or x2 and the desired output t(x), the W(i)terms average to zero in the learning process. Howev-er, since x(1)x(2) is perfectly correlated with t(x), W2 isincremented positively at each step of the learningprocedure, so that the training process converges inone iteration to the

solution W,(i) = 0 and W 2 > 0.One iteration is defined as training of the network overall members of the training set. This high-order unitwill learn an arbitrary binary map in one iteration ofthe above learning rule.

V. Implementation of Invariances

In high-order networks it is possible to handcraft theunits such that their output is invariant under theaction of an arbitrary finite group of transformationson the input space.2 A unit is invariant under theaction of a transformation group G = {g}iff:

y(gx) = y(x) for allg s G. (5)

This invariance is imposed by averaging the weightmatrices over the group, thus eliminating the unit'sability to detect correlations which are incompatiblewith the imposed group invariance. In particular, wecan handcraft a G-invariant unit from the HOTLUdescribed by Eq. (1). This handcrafted unit is de-scribed by the equation

y(x) = S net(gx)] (6)

where the sum is over all members of the group G andthe net operator is defined as in Eq. (1). To see thatthis unit is invariant under the group G, note that forh E G, we have

y(hx) = S [ net(ghx) = [3 net(g'x)l = y(x), (7)geG J [G J

where we have made the substitution g' = gh. Thesecond step follows from the fact that a sum over g =g'h-1 is equivalent to a sum over g', since multiplyingall terms in a sum over a group by a member of the samegroup simply results in a permutation of the terms inthe sum. In some important cases, this averagingprocess allows a collapse of the weight matrix when

redundant terms are eliminated, analogous to the re-duction in dimension which occurs in the transforma-tion from absolute to relative coordinates.

As an example of the power of using the group invari-ance operators, we will use the sign-change group as analternative solution to the EXCLUSIVE-OR problem.The EXCLUSIVE-OR is invariant under the action ofthe sign-change group, members of which are denotedby Si,i e {1,2}}:

SoMxM)} X(j), S1lx(j)) = -X(1).

Averaging the sign-change group over the second-order unit of Eq. (4) yields the simplification

y(x) = S WI(j)[x(j) - x(j)] + Wjx(1)x(2) + x(I)x(2)]}

= S2W 2 x(1)x(2)j.

The problem is essentially solved; all that remains is todetermine the sign of W2.

Let us now illustrate the implementation of transla-tion invariance in a more intuitive fashion. Supposewe are given a second-order unit described by theequation

y(x) = SA[3 W2(Jk)x()x(k) +

We wish to determine what constraints must be placedon the W matrix in order to ensure that the unit'soutput y is invariant under the transformation groupT. Define T = {g(m)} uch thatg(m)x(k) = x(m + k).If we apply this group operator to the previous equa-tion, the result is

y[x = y[g(m)x] for all g(m) T.

Expanding the right-hand side of this equation yields

y[g(m)xJ = S ZW2 (ijk)X(j + m)x(k + m)+...

= [I W2(ij - m,k - m)x(j)x(k) + ..

where in the last step we have assumed either thatperiodic boundary conditions are appropriate or thatedge effects are negligible. A comparison of the aboveequations reveals that the constraint that must beplaced on W2 to ensure translation invariance is

W2 (j,k) = W2( - m,k - m) (translation invariance constraint).

To understand this condition, let us note that W2 is afunction of a pattern composed of two points located atj and k. This constraint stipulates that W2 have thesame value when evaluated at any pattern (j - m,k -m) which is a translated version of the original pattern(j,h). In other words, to ensure translation invariance,W2 must depend only on the equivalence class of pat-terns under translation; then the weight functionW2(j,k) will not distinguish between any shifted pat-tern in the input space. Such patterns are describedas translation-equivalent. This constraint can be gen-eralized to implement any invariance under an arbi-trary transformation group.

Another way of expressing the invariance constraintderived in the previous paragraph becomes clear if wenote that the set of translation equivalence classes of

4974 APPLIEDOPTICS / Vol. 26, No. 23 / 1 December 1987

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 4/7

two point patterns is given by the relative coordinatedj = j - k. In other words, every possible value of djcorresponds to a single equivalence class and viceversa. We now note that this condition is equivalentto constraining W2 such that it depends only on therelative coordinate dj and not on the absolute coordi-nates j and h (which would distinguish between differ-ent patterns belonging to the same equivalence class).This leads to the condition on W2 that

W 2(j,k) - W2 (dj) (translation invariance solution).

Thus, the implementation of translation invariancein Eq. (1) is equivalent to redefining the correlationmatrices such that they depend only on relative posi-tion dj and not absolute position j. The first- andsecond-order invariant terms are

Tl(i) = W 1(i) 3 (), T2 (i) =3 W2 (ildj) 3 (j)x(7 + dj). (8)j dj j

The expression for T, is not very interesting. A sum

over the input is, of course, translation-invariant.The

expression for T2 should not be surprising. Opticalprocessing has long known and used the translation-invariant properties of the autocorrelation. What isdifferent here is that arbitrary scale factors W, and W2are being learned by the neural network and W2weights the autocorrelation. Optical implementa-tions of this approach are discussed in Ref. 10.

An error-correcting perceptron learning rule for thesecond-order term can be written as

W 2 (idj) = W 2 (i,dj) + [t(i) - y(i)] x(j)x(j + dj). (9)

An invariant outer product or Hebbian rule can also beeasily formulated. A single unit may have both invari-ant and noninvariant terms. Many other invariances(including rotation and scale) can be implemented inthis manner. Simulations indicate that prior imple-mentation of known invariances can be a very powerfulstep in the adaptation of a network to a problem envi-ronment.

We have implemented translation-invariant net-works for both pattern recognition 2 3 and associativememory 4 5 applications. The associative network's at-tractors are localized patterns of activity which are

independent of their absolute position on the neuralslab. We conjecture that the implementation of in-variances may be an important step in the constructionof neural network symbol processors.

VI. Competitive Learning

Competitive learning algorithms, discussed in Ref.14, have the advantage that they are capable of classi-fying input patterns without requiring a training signalt(i) as in the supervised learning rules discussed above.The drawback of most of these algorithms is that thereis no way to tell a priori if the classification categoriesgenerated by the network will be useful or interesting.In this section we discuss a method, based on grouptheory, for handcrafting classification categories forhigh-order competitive learning networks.

Since a group G-invariant HOTLU yields a responsethat is dependent only on the equivalence class of theinput pattern, such a unit will naturally function as apattern classifier, with its classification categories de-termined by the group G. Geometrical feature detec-tors can be constructed in this way, provided we candefine the features as invariants of transformationgroups. Constructing an architecture in which each

unit of the network responds to a distinct feature in-volves creating competition between the feature detec-tors, together with training rules which prevent oneunit from capturing many features. Rules of this sortare discussed in Ref. 14. Here we discuss an applica-tion in which different units naturally capture distinctcategories.

Consider the problem of constructing a HOTLUthat will discriminate between horizontal and verticallines. Since these two pattern classes belong to differ-ent equivalence classes under the translation group, atranslation-invariant HOTLU will yield one output forall horizontal lines and another output for all verticallines. The probability that the outputs will be thesame is effectively zero given enough precision in thetraining. However, ambiguous patterns such as diago-nal lines will cause either of the outputs to respond.The classifier network consists of two competing sec-ond-order units with linear threshold functions, de-scribed by the equations

(10)Yi(X) = 3 W2 (i,di) 3 (j)x(j + dj),

d i

where the subscript i = tJiy} represents position in a2-D pattern. The tensor W2 is initialized randomlyand normalized, so that

djW2 (i,dj) = 1 for i E (1,2). (11)

Both units are fed the same binary input pattern x,where x e 1+1,-i), displayed on an N X N grid. Theinput pattern set consists of a set of horizontal andvertical lines presented in random order. The outputof the system, z, is arbitrarily defined to be 1 if Y1> Y2and 2 f Y2> Yl. The casey, = Y2,which never occurredin our simulations, would result in an error conditionand renormalization of the weights.

Simulations performed over randomly generated

presentations of the training set indicatethat this

system (with few iterations over the training set) al-ways yields a 1 for any horizontal lines and a 2 for anyvertical lines (or vice versa). The initial weights wererandomly chosen and the training set consisted of sixhorizontal and vertical lines presented over a 6 X 6grid. The two units can now be sensitized to theirrespective pattern classes by updating the weights ateach input pattern presentation via the learning rule.For each input pattern presentation, the unit whoseoutput y(i) is a maximum updates its weights with thelearning rule:

W'2[z(x),dji = W2[z(x),djl + AL 3 (j)x(j + di),I. (12)

where AL (set arbitrarily at 0.5) controls the rate of

1 December 1987 / Vol. 26, No. 23 / APPLIEDOPTICS 4975

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 5/7

learning. The unit whose output y(i) is a minimumdoes not change its weights. The weights are alsorenormalized at each step, so that Eq. (1) remains validfor W2. The problem of one unit capturing both cate-gories, which did not arise in this example, can be dealtwith in general by allowing the losing HOTLU to learnalso with a reduced AL value.

The learning process endows the system with the

capacity to generalize and sort arbitrary patterns ac-cording to their degree of horizontalness or vertical-ness. The translation invariance gives the net thecapability to discriminate only between different pat-tern classes independent of the pattern position. Thecapacity of generalization, involving the clustering ofequivalence pattern classes arises in this case from thelearning process. We speculate that this generaliza-tion capacity could also have been implemented, how-ever, by the imposition of a higher level invarianceconstraint.' 1 Although in this example we consideredonly two-category classification, the architectures dis-cussed can be readily expanded to classify N categorieswith N output units.

It is useful to compare the capabilities of this high-order translation-invariant neural net with the layeredfirst-order network in Ref. 12. The high-order com-petitive network learned the problem with orders ofmagnitude reduction in iterations and did not requireany additional special teaching patterns. Further-more, Rumelhart, Hinton and Williams required a fac-tor of 2 increase in the number of input and hiddenunits.

VII. TC Problem

The TC problem involves distinguishing between ashifted and rotated T and C. If the T and C arerepresented as patterns embedded in a 3 X 3 squarearray (as shown below), it is not difficult to see that onemust simultaneously inspect 3 squares in the array todiscriminate between these letters:

XXX xxx

X xx XXX.

Thus, the TC problem as defined here is a third-orderproblem. 8 The training patterns consist of single let-ters inscribed in 3 X 3 squares, each of which occurs

inan arbitrary position and orientation in a larger (10 X10) spatial array. These training patterns are ran-domly presented to the HOTLU. The correct classifi-cation can be obtained with training over only a fewiterations of the training set using a pair of translation-and rotation-invariant competitive HOTLUs as dis-cussed above. Here we will examine another approachwhich only learns the rotated letters by supervisedperceptron-like learning. The shifted patterns are notlearned in individual shifted positions, but from onearbitrary position due to the encoded translation-in-variant network. In this case the solution requires asingle third-order translation-invariant HOTLU withshort-range interconnects (a 3 X 3 window). Dynam-ics are governed by the equation

yi(x) = sgn[, Zf3(i,dj 1 ,dj2 )dI Od

E XUj)XUdl)XU + dj2) , (13a)

with the weight update ruleW 3 (i,dj1 ,dj 2 ) = W 3 (idj 1 ,dj

2) + [t(i) - y(i)

+ 3 s(J)x(j + djil)xs(j+ dj2). (13b)

The djl and dj2 sums are limited to the 3 X 3 window.In general, for an m X m window, this architecture willrequire less than m 2 k- 2 adaptive weights, where k is theorder of the HOTLU. For initial random weightsvalues, the training procedure converges to a correctset of weights in one to ten iterations. Note that allknown methods of solving this problem involvingtraining hidden units in cascades require more unitsand adaptive weights, and utilize learning rules which

usually take at least thousands of iterations to con-verge. (An alternate approach to the TC problem is toput translation-invariant 2cd order weights in the in-put layer of a single hidden layer feed-forward net-work. The output layer weights are 1st order. Gener-alized back propagation solves this problem with anorder of magnitude reduction in learning time com-pared to the learning time taken for the same networkwith only 1st order weights.1 9)

V1I1. Generalization in Neural Networks

An inductive inference problem, as characterized byAngluin and Smith,' 5 consists of five components:

(1) a rule to be inferred, generally characterized by aset of examples;

(2) a hypothesis space, that is, a space of descrip-tions such that any admissible rule can be described bya point in the space;

(3) a training set, or a sequence of examples thatconstitute admissible presentations of the rule;

(4) an inference method; and(5) the criterion for a successful inference.Roughly speaking, the generalization problem can

be stated as given a sequence of examples of the rule,use the inference method to search through the hy-

pothesis space until a point is reached which satisfiedthe criterion for successful induction.The neural network generalization problem (as de-

fined here) is a special case of the problem outlinedabove. An admissible rule in this case is a mappingfrom a set of possible network input patterns to a set ofpossible network output patterns. The hypothesisspace of a neural network is defined by the architectureof the network; a point in hypothesis space is charac-terized by a specific set of weights (or a point in weightspace). The training set consists of some subset of theset of all input patterns for which the rule is defined(together with the associated outputs). The inferencemethod is a learning rule which modifies the weights ofthe network, generally in response to feedback infor-mation on how well the network performed on a given

4976 APPLIEDOPTICS / Vol. 26, No. 23 / 1 December 1987

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 6/7

0.0 10.0 20.0 0.0 0.0 500.0

Percentoge o Pss0bLe Iput Ptte- - TLnog St

Fig. 1. Generalization capability vs percentage size of training set

for the two-three clump discrimination problem using a second-

order neural network. The string length is Nx = 9. The uppercurve (circles) is for a second-order net with a second-order transla-

tion-invariantterm. The lower curve (squares) is for a second-orderneural network, no invariance.

example of the rule. The test of generalization con-sists in training the network until it performs perfectly

on the training set, and thenmeasuring its perfor-

mance on a set of examples that were not included inthe training set.

IX. Generalization and the Contiguity Problem

We have examined several types of contiguity prob-lems; here we discuss one which we call the two-threeclump problem. Further details and simulations arepresented in Ref. 1. The performance of backpropa-gation on this problem has been analyzed extensivelyby other researchers.' 6 In the two-three clump prob-lem, the training set consists only of patterns whichhave either two or three clumps, contiguous strings of

ones separated by non-ones (however for simulations,+1's and -1's were used). The target output of theunit is defined to be +1 for two-clump patterns and -1for three-clump patterns:

two-clumps three-clumps011001100 110011011

All versions of the contiguity problem are at leastsecond-order problems (see Ref. 8). Thus, learningthese problems with a neural net will require either acascade of first-order units or at least a second-orderunit. A first-order perceptron is capable of learningthe two-three clump problem with slightly more than

50% accuracy.16

Backpropagation seems to generalizevery poorly on this problem, and requires thousands ofpresentations in order to learn the training set.' 6

We studied the two-three clump problem using asecond-order unit with terms To, T1, and T2 , and thetraining procedure described earlier [Eqs. (1) and (2)].After initializing the weights with random values, theunit was trained on a fraction of the data set with therange of the second-order interactions R2 = 4 (i.e., therange of the sum of dj over W2 never exceeded 4) andthe input pattern length Nx = 9. For Nx = 9, thenumber of all possible inputs is 512. After training, wemade use of an iterative weight thresholding procedureto reduce the noise in the final weight values. Follow-ing the thresholding, the generalization capacity of theunit was tested by counting the number of errors that

the unit made on the set of examples that were notincluded in the training set. In Fig. 1 we have plottedgeneralization capacity vs training set size for thisproblem. The upper curve is the result with transla-tion invariance imposed on the second-order weightW2, and the lower curve is the result without transla-tion invariance imposed. Without invariance, theunit generalizes nearly perfectly for training sets as

small as l/8of all possible inputs, but drops sharply to85% accuracy for '/o of all possible inputs. Averageconvergence times ranged from three presentations ofthe training set for /2 of all possible inputs to tenpresentations for 'ho of all possible inputs. With in-variance, the unit generalizes nearly perfectly fortraining sets as small as '/o of all possible patterns, butdrops sharply to 85% accuracy for training sets with '/20

of the possible patterns. The average convergencetime was about three presentations for all cases withinvariance. Because the net must effectively simulatea parallel edge counter to solve this problem, it appearsthat translation invariance does not play an importantrole when the training set is large.

An additional feature of this solution was that thehigh-order neural network unit explicitly (not implic-itly as in layered networks' 2 ) generated an algorithmicsolution to the contiguity problem. By closely in-specting the magnitudes of the weights, an explicitformula emerges. If one pays attention to the largemagnitude weights and ignores the small magnitudeweights, the resulting network actually generates theequation for a parallel edge-counter detector. Thisshould not be surprising since this net has been hand-crafted to solve this problem. See Ref. 1 for further

details.X. Generating High-Order Representations

The applicability of the high-order approach de-pends on the system designer's ability to choose a set ofhigh-order terms which adequately represents theproblem task domain which is likely to be encounteredby the network. Building a network with all possible2N combinations of N inputs is clearly unfeasible. Wehave investigated several methods of choosing a repre-sentative set of terms:

(1) Matching the order of the network to the orderof the problem. A single layer of kth order units willsolve a kth order problem (as defined by Minsky andPapert 8 ). Thus, if the order of the problem is known,we can specify the order of the unit required to solvethe problem. Although the order of many hard prob-lems such as speech recognition and visual patternrecognition is unknown, it can be estimated.

(2) Implementation of invariances. If it is known apriori that the problem to be implemented possesses agiven set of invariances, those invariances can be pre-programmed into the network by the methods dis-cussed earlier, thus eliminating all terms which areincompatible with the invariances. For example,

speech recognition is invariant under temporal scaleand translation operations.(3) Correlation calculations. One way to determine

1 December 1987 / Vol. 26, No. 23 / APPLIEDOPTICS 4977

8/6/2019 Learning, In Variance, And Generalization in High-Order Neural Networks

http://slidepdf.com/reader/full/learning-in-variance-and-generalization-in-high-order-neural-networks 7/7

which terms will be useful is to calculate the correla-tion matrices [Eq. (4)] for a representative sampling ofthe mapping to be generated. The entries in the corre-lation matrix which are largest in absolute value corre-spond to terms which are highly correlated with theoutput of the map, and thus are most likely to make animportant contribution to the network when the mapis implemented. Other smaller terms can be ignored.

(4) Generating representations adaptively. It ispossible to generate high-order representations byprogressively adapting the network to its problem en-vironment. There are many adaptation methodswhich12 may be applicable to this problem; generalizedbackpropagation is an example of such a method whichgenerates high-order representations in a different(multilayer) context. Applying gradient descentmethods to the problem of generating high-order rep-resentations for single-layer architectures is very simi-lar to the correlation calculation methods discussedunder (3) above. A suggested approach is to use thecapabilities of genetic search algorithms' 7 for generat-ing high-order representations.Xi. Conclusions

By choosing a set of high-order terms which embod-ies prior knowledge about the problem domain, we areable to construct a network which can learn and gener-alize very efficiently within its designated environ-ment. By providing the network with the tools itneeds to solve the problems it expects to encounter, weliberate the network from the difficult task of decidingwhich tools it will need and then creating those tools.This process constitutes a major part of the learningprocedure for networks utilizing backpropagation.

For example, in the contiguity problem, the second-order terms of the form x(j)x(j + 1) explicitly encodeedge detection information; thus, the network onlyneeds to learn how to use these edge terms, not gener-ate the edge detection terms itself. Since a humanuses edges to solve the contiguity problem (at least forlarge strings), the network is solving the problem in asimilar fashion.

The drawback of the high-order approach is thecombinatorial explosion of high-order terms inherentin a single-slab HOTLU. Various methods exist fordealing with this problem: cascades of high-orderslabs, limitation of order, and reduced interconnec-tions. In this paper we have discussed another meth-od, which involves using prior knowledge of the prob-lem domain to weed out the terms which have a smalllikelihood of being useful. This method produces spe-cialized networks which are very powerful in a limiteddomain. We suggest that a general-purpose learningnetwork might be constructed from a number of spe-cialized modules designed for specific types of tasks,combined with more general networks to handle newand unusual situations.References1. T. Maxwell, C. L. Giles, and Y. C. Lee, "Generalization in Neural

Networks: the Contiguity Problem," in Proceedings, IEEEInternational Conference on Neural Networks, San Diego(June 1987); U. Maryland Physics Department Technical Re-

port UMLPF 87-050 (June 1987).2. T. Maxwell, C. L. Giles, Y. C. Lee, and H. H. Chen, "Transforma-

tion Invariance Using High Order Correlations in Neural NetArchitectures," IEEE Trans. Syst. Man Cybern. SMC-86CH2364-8, 627 (1986).

3. T. Maxwell, C. L. Giles, Y. C. Lee, and H. H. Chen, "NonlinearDynamics of Artificial Neural Systems," AIP Conf. Proc. 151,299 (1986).

4. H. H. Chen, Y. C. Lee, T. Maxwell, G. Z. Sun, H. Y. Lee, and C. L.Giles, "High Order Correlation Model for Associative Memory,"AIP Conf. Proc. 151, 86 (1986).

5. Y. C. Lee, G. Doolen, H. H. Chen, G. Z. Sun, T. Maxwell, H. Y.Lee, and C. L. Giles, "Machine Learning Using a Higher OrderCorrelation Network," Physica D 22, 276 (1986).

6. N. J. Nilsson, Learning Machines (McGraw-Hill, New York,1965).

7. F. Rosenblatt, Principles of Neurodynamics (Spartan, NewYork, 1962).

8. M. L. Minsky and S. Papert, Perceptrons (MIT Press, Cam-bridge, MA, 1969).

9. D. Psaltis, J. Hong, and S. Venkatesh, "Shift Invariance inOptical Associative Memories," Proc. Soc. Photo-Opt. Instrum.Eng. 625, 189 (1986); D. Psaltis and C. H. Park, "NonlinearDiscriminant Functions and Associative Memories," AIP Conf.Proc. 151, 370 (1986). R. A. Athale, H. H. Szu, C. B. Fried-lander, "Optical Implementation of Associative Memory withControlled Nonlinearity in the Correlation Domain," OpticsLetters, vol.11 #7, p482 (1986). Y. Owechko, G. J. Dunning, E.Maron, and B. H. Soffer, "Holographic Associative Memorywith Nonlinearities in the Correlation Domain," Applied Optics,vol. 26 # 10, p 1900 (1987).

10. R. D. Griffin, C. L. Giles, J. N. Lee, T. Maxwell, and F. P. Pursel,"Optical Higher Order Neural Networks for Invariant PatternRecognition," presented at Optical Society of America AnnualMeeting (Oct. 1987).

11. G. E. Hinton, "A Parallel Computation that Assigns CanonicalObject-based Frames of Reference," Proceedings of 7th Interna-tional Joint Conference on Artificial Intelligence, ed. A. Drina, p683 (1981). J. A. Feldman, "Dynamic Connections in NeuralNetworks," Biol. Cybern. 46, 27 (1982). Representative papersare: D. H. Ballard, "Cortical Connections and Parallel Process-ing: Structure and Function," Behav. Brain Sci. 9, 67 (1986).

12. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "LearningInternal Representations by Error Propagation," Ch 8 in Par-allel Distributed Processing, vol 1., D. E. Rumelhart and J. L.McClelland, (MIT Press, Cambridge, MA, 1986).

13. T. J. Sejnowski, "Higher-Order Boltzmann Machines," AIPConf. Proc. 151, 398 (1986).

14. S. Grossberg, "Adaptive Pattern Classification and UniversalRecoding, I: Parallel Development and Coding of Neural Fea-

ture Detectors," Biol. Cybern. 23, 121 (1976); D. E. Rumelhartand D. Zipser, "Feature Discovery by Competitive Learning,"Cognitive Sci. 9, 75 (1985).

15. D. Angluin and C. H. Smith, "Inductive Inference: Theory andMethods," ACM Comput. Surv. 15, 237 (1983).

16. J. S. Denker and S. A. Solla, "Generalization and the DeltaRule," presented at the Neural Networks for Computing Con-ference, Snowbird, UT (Apr. 1987).

17. J. H. Holland, K. J. Holyoak, R. E. Nisbett, and P. R. Thagard,Induction (MIT Press, Cambridge, MA, 1986).

18. P. Peretto and J. J. Niez, "Long Term Memory Storage Capacityof Multiconnected Neural Networks," Biological Cybernetics54, p 53 (1986).

19. C. L. Giles, R. D. Griffin, T. P. Maxwell, "Encoding Invariances

in Higher Order Neural Networks," presented at IEEE conf. onNeural Information Processing Systems-Natural and Synthe-tic," Denver, Co. Nov. 1987.

4978 APPLIEDOPTICS / Vol. 26, No. 23 / 1 December 1987