, Pittsburgh LEARNING REPRESENT A TI - Stanford …jlmcc/papers/HintonMcC88.pdf · LEARNING...

358LE

AR

NIN

G R

EPR

ESE

NT

A T

I 0 NS B

Y R

EC

m C

ULA

TIO

N

GeoffreyE

, Hinton

Computer Science and Psychology Departments, University

of T

orontoT

oronto M5S lA

4, Canada

James L

. McC

lellandPsychology and C

omputer Science D

epartments, C

amegie-M

ellon University,

, Pittsburgh

, PA 15213

AB

STR

AC

T

We describe a new learning procedure for networks that contain groups

of non-

linear units arranged in a closed loop, The aim

of the learning is to discover codes

that allow the activity vectors in a ""isible

" group to be represented

by activity

vectors in a "hidden

" group, One way to test w

hether a code is an accurate

representation is to try to reconstruct the visible vector from the hidden vector. T

hedifference betw

een the original and the reconstructed visible vectors is called thereconstruction error, and the learning procedure aim

s to minllnize this eX

Tor. T

helearning procedure has tw

o passes, On the first pass, the original visible vector is

passed around the loop, and on the second pass an average

of the original vector and

the reconstructed,vector is passed around the loop, T

he learning procedure changes, each w

eight by an amount proportional to the product of the "presynaptic

" activity

, and the

difference in the-post-synaptic activity on the tw

o passes, This procedure is

much sim

pler to implem

ent than methods like back-propagation, S

imulations in

simple networks show that it usually converges rapidly on a good set

of codes, and

analysis shows that in certain restricted cases It perfonns gradient d

escent in the

squared reconstruction error,

INT

RO

DU

CT

ION

Supervised gradient-descent learning procedures such as back-propagation

have been shown to construct interesting internal r

epresentations in "hidden

" units

that are not part

of the input or output

of a connectionist netw

ork. One criticism

ofback-propagation is that it requires a teacher to specify the desired output vectors, Itis possible to dispense w

ith the teacher in the case of "encoder, netw

orks 2 in which

the desired output vector is identical with the input vector (see Fig, 1), T

he pulposeof an encoder netw

ork is to learn good "codes" in the intennediate, hidden units, for, exam

ple, there are iess hidden units than input units, an encoder netWork w

illperfonndata-com

pression, It is also possible to introduce other kinds

of constraints

on the hidden units, so we can view an encoder network as a way

of ensU

ring that theinput can be reconstructed from

the activity in the hidden units 'whilst also m

aking

This research w

as supported by contract NO

OO

14-86-OO167 from the

Office

of N

aval Research

and a grant from the C

anadian National Science and E

ngineering Research C

ouncil, Geoffrey H

intonis a fellow

of the C

anadian Institute for Advanced R

esearch, We thank: M

ike Franzini, Conrad

Galland and G

eoffrey Goodhill for helpful discussions and help w

ith the simulations,

€I Am

erican Institute of Physics 1988

- !

359

the hidden units satisfy soxrte other constraint.

A second criticism

of back- propagation is that it is neurally im

plausible (and,hard to im

plement in hardw

are) because it requires all the connections to, be used

packwards and it requires the units to use different input-output functions for the

forward and backw

ard passes, Recirculation is designed to overcom

e this seeondcriticism in the special case

of encoder networks.

output units

input units

Fig,!, A diagram

of a three layer encoder netw

ork that ieams good codes-using

back-propagation, On the forw

ard pass, activity:flowsfrom

the input units in thebottom

layer to the output units in,the top layer, On the :backward pass, error-

derivatives flow from

the top layer to the bottom layer,

Instead of

using a separate group

of units for the input and output w

e use thevery sarne group of" visible

units, so the,inputvector is the initial state

of this glO

UP

and the output vector is, the state after infonnation has passed around the loop, The

difference between the activity

of a visible unit before and' aft~r sending activity

arowid the loop is the ,

derivative of

the squared reconstruction eXT

or, So, if the

visible units are linear , w

e can perfonn'gradient descent in the squared error

changing each

of a visible unit

's incoming weights

by an am

ount proportional to theproduct

of this difference and the activity

of the hidden unit from

which

, thecqrinection em

anates, So learning the w

eights from the hidden units t() the output,

units is simple. T

he harder problem is to learn the w

eights on corinectionscomirig

into hidden units because there is no direct specification

of the desired states of these

qnits. Back-propagation solves this problem

by back-p

ropagating eXTor-derivatives

from the o

utput units to generate eXTor-derivatives for the hidden

uriits,R

ecirculation solves the problem in a quite different w

ay that is easier to implem

entbut much harder to analyse,

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

360

TH

E R

EC

IRC

UL

AT

ION

PRO

CE

DU

RE

We i

ntroduce the recirculation procedure by considering a very ,sim

plearchitecture in w

hich there is just one group of hidden units, Each visible unit has a

qirected connection to every hidden unit , and each hidden unit has a directed

connection to every visible unit, The total input received by a unit is

LY

jW

jj

where

Yj is the sta~e of the ilb unit

jj

is the weight on the connection from

the ilb tothe J'Ih unit and

is the threshold of the J 'Ih unit, T

he threshold tenD can be

eliminated by giving every unit an extra input connection w

hose activity level isfIX

ed at 1, The w

eight on this special connection is the negative of the threshold, and

it can be learned in just the same w

ay as the other weights, This method of

implem

enting thresholds will be assum

ed throughout the paper,

The functions relating inputs to outputs of visible and hidden units are sm

oothm

onotonic functions with bounded derivatives, For hidden units w

e use the logisticfunction:

o(x.)

Other smooth' monotonic functions would serve as well, For visible units, our

mathem

atical analysis focuses on the linear case in which the output equals the total

input, though in simulations w

e use the logistic functio~,W

e have already given a verbal description' of the

learning rule for the hidden-to-visible connections, T

he weight

jj

, from the

lh hidden unit to the ith

visibleunit is changed as follow

s:

jj = E

Y/!) (Y

j(O)-yP))

(2)

(3)

where

(O)

is the state of the

ith visible unit at time 0 and

Yi(2)

is its state at time 2

after activity has passed around the loop once, The rule for the visible-to-hidden

connections is identical:

jj

EY

j (2)(y/!)-y/3))(4)

where

(l) is the state of the

lh hidden unit at tim

e 1 (on the fIrst pass around theloop) and

y/3) is its state at tim

e 3 (00: the second pass around the loop), Fig, 2

shows the nemrork exploded in time,

In general

' this rule for changing the visible-to-hidden co.nnections does' not

perfonn steepest descent in the squared reconstruction error , so it behaves differentlyfrom

back-propagation, This raises tw

o issues: Under w

hat conditions does it work

and under what conditions does it approxim

ate steepest descent?

(1)

361

time =

I

time =

0

Fig, 2, A diagram

showing the states of the visi~le and ~dden units explode~in

time, T

he visible units are at the bottom and the hidden unIts are at the top, T

lDle

goes from left to right.

CO

ND

ITIO

NS U

ND

ER

Wm

CH

RE

CIR

CU

LAT

ION

,A

PPRO

XIM

AT

ES G

RA

DIE

NT

DE

SCE

NT

For the simple architecture show

n in Fig, 2, the recirculation learning pro ~edure

changes the visible- to- hidden weights in the ~

irection, ~

f steepest descent m the

squared reconstruction error provided the followm

g conditions hold:1. T

he visible units are linear,2, The weights are symmetrical (Le,

jj =W

jj

for all

j),

3, The visible units have high regression,

Regres!!ion " m

eans that , after one pass around the lo~p, instead of setting

, the

activity of a visible unit

to be equal to its current total mput

(2), as qetenm

ned

by Eq I , w

e set its activity to be

(2) A

Yj(O

) (I-

A)X

P)(5)

where the regression

, A, is close to 1. U

sing high regression ensures that the visible, units

only change state slightly so that when the new

visible vector is sent around theloop again on the second pass , it has very sim

ilar effects to the fIrSt pass, In order to

make the learning rule for the hidden units as sim

ilar as ?~ssible to tl;te rule f ~r thevisible units , w

e also use regression in computing the activIty of the hidden um

ts onthe second pass

YP)

AY

/!) + (I-

A)O

(XP))

(6)

For a given input vector , the squared reconstruction erroris ,

&L

(Yk(2)- yiO

))2

For a hidden

unit

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

362

dE

dE

dYi2)dX

(2)

L,

-'----'-

L, (Y

i2)-(0)) Yk

(2)

dYP)

k dYk (2) dx

(2) dYP

)

where

dYi2)

;....

(2)

For a visible~

to-hiciden weight

iJE

--'- =

Yj

(1) Y

i(O)--:-

ji dY

P)

, using Eq 7 and the assumption that

jk for all

= Y

!(I) (O

) (L

, Yk(2) y,/(2)W

jk L, Yk (O

) y,/(2) Waw

..

The assum

ption that the visible units are linear(with a- gradient of 1) m

eans thatfor all

, y,/(2) = 1. So usingEq 1 we have

(I) (O) (x,(3)-x, (I))

awoo )

)1 N

ow, w

ith sufficiently high regression, w

e can assume thatthe states rifunits

only change slightly with tim

e so that

Yf(l)(xP)-

xP))

""

cr(xP)) cr(xP))

(1 ~ A.)

(y/3) yp)J

and(O

)

""

(2)

So by substituting in Eq 8 w

e get

dE "" ~Y

i(2)(Y

P)-Y

P)Jawoo (1-

11.)

An interesting property of E

q 9 is that it does not contain a term for the gradient

of the input-output function of unit

so recirculation learning can be applied evenwhen unitj uses an

unknown

non- linearity, To do back- propagation it is necessary to

know the gradient of the non-

linearity, but recircUlation

measures

the gradient bym

easuring the effect of a small difference in input

, so the term

y/3)-y/l)

implicitly

contains the gradient.

(7)

(8)

(9)

363

A SIM

UL

AT

ION

OF R

EC

IRC

UL

AT

ION

From a biological standpoint, the symmetry requirement that

ij ji

unrealistic unless it can be shown that this sym

metry of the w

eights can be learned:T

o investigate what w

ould happen if symm

etry was not enforced (and if the visible

units used the sarne non- linearity as the hidden units), we applied the recirculation

learning procedure to a network w

ith 4 visible units and 2 hidden units, The visible

vectors were 1000

, 0100, OOlO and 0001, so the 2 hidden units had to learn 4

different codes to represent these four visible vectors, All the w

eights and biases inthe netw

ork were started at sm

all random values unifonnly distributed in the range

;....0.5 to +0.5, W

e used regression in the hidden units, even though this is not strictlynecessary, but we ignored the term If(I-

A.) in E

q 9,

Using an e of 20 and a A

. of 0,75 for both the visible and the hidden units , the

network learned to produce a reconstruction error of less than 0, 1 on every unit in an

average of 48 weight updates (w

ith a maxim

um of 202 in 100 sim

ulations), Each

weight update w

as performed after trying all four training cases and the change w

asthe sum

of the four changes prescribed by Eq 3 or 4 as appropriate, T

he rmal

reconstruction error was m

easured using a regression of 0, even though high

regression was used during the learning, T

he learning speed is comparable w

ithback- propagation

, though a precise comparison is hard because the optim

al values ofe are different in the tw

o cases, Also

, the fact that we ignored the tenn If (I-

A.)

when m

odifying the visible- to-hidden weights

' means that recirculation tends to

change the visible- to-hidden weights m

ore slowly than the hidden- to-visible w

eightsand this w

ould also help back- propagation,

It i,s not imm

ediately obvious why the recirculation learning procedure w

orksw

hen the weights are not constrained to be sym

metrical, so w

e compared the w

eightchanges prescribed by the recirculation' procedure w

ith the weight changes that

would cause steepest, (jescent in the sum

squared reconstruction error (i,the w

eightchanges prescribed by back-propagation), As expected , recirculation and back-

propagation agree on the weight changes for the hidden- to-visible connections, even

though the gradient of the logistic function is not taken into account in weight

adjustments under recirculation, (C

onrad Galland has observed that this agreem

entis only slightly affected by using visible units that have the non- linear input-outputfunction show

n in Eq 2 because at any stage of the learning, all the visible units tend

to have similar slopes for their input-output functions, so the non- linearity scales all

the weight changes by approxim

ately the same am

ount,F

or the visible-to-hidden connections, recirculation initially prescribes weight

changes that are only randomly related to the direction of steepest descent , so these

changes do not help to improve the perform

ance of the system, A

s the learningproceeds , how

ever, these changes come to agree w

ith the direction of steepestdescent, T

he crucial observation is that this agreement occurs

after the hidden- to-

visible weights have changed in such a w

ay that they are approximately aligned

(symmetrical up to a constant factor) w

ith the visible- to-hidden weights, So it

appears that changing the hidden- to-visible w

eights, in t

he direction of steepest

descent creates the conditions that are necessary for the recirculation procedure tocause changes in the visible- to-hidden ,w

eights that follow the direction of steepest

descent, It is not hard to see w

hy this happens if we start w

ith random, zero-m

ean

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

364

visible-to-hidden weights, If the visible-to-

hidden weight

ji is positive, hidden unit

j will tend to have a higher than average activity level when the

itk visible unit has a

higher than average activity, So

Yj w

ill tend to be higher than average when the

reconstructed value of

Yi

should be higher than average -- i,e, when the tenD

(Yi (O

)-yP))

in Eq 3 is positive, It w

ill also be lower than average w

hen this tennisnegative, These relationships will be reversed if

ji is negative

, so

W ij

will grow

faster when

is positive than it will when

ji is negative, S

molensky4 presents a

mathem

atical analysis that shows w

hy sim

ilar learning procedure createssym

metrical w

eights in a purely linear system, W

illiams5 also analyses a related

learning rule for linear systems w

hich he calls the "symm

etric error correctionprocedure and he show

s that it perfonns principle components analysis, In our

simulations of recirculation

, the visible-to-hidden weights becom

e aligned with the

corresponding hidden-to-visible w

eights, though the hidden-to-visible weights are

generally of larger magnitude,

A PIC

TU

RE

OF R

EC

mC

UL

AT

ION

To gain m

ore insight into the conditions under which recirculation learning

produces the appropriate changes in the visible-to-hidden weights, w

e introduce thepictorial representation show

n in Fig, 3,T

he initial visible vectoris m

apped intothe reconstructed vector, C

, so the error vector is

AC

, U

sing high r~gression, thevisible vector that is sent around the loop on the second pass is

where the

difference vector

AP

is a small fraction of the error .vector

Ac.

If the regression issufficiently high and all the non-linearities in the system

have bounded derivativesand the weights have bounded magnitudes, the difference vectors

, BQ,

and

will be very sm

all and we can assw

ne that, to fIrst order, the system behaves linearly

in these difference vectors, If, for exam

ple, we moved

so as to double the lengthof

AP

we would also double the length of

BQ

and

CR

,

Fig, 3, A diagram showing some vectors

, P)

over the visible units, thc;:ir

hidden" images

, Q)

over the hidden units, and their "visible" images

(C, R

)over the visible units, The vectors

B'

and C' are the hidden and visible im

ages ofafter the visible-to-hidden w

eights have been changed by the learning procedure,

365

Suppose we change the visible- to-hidden w

eights in the manner prescribed by

Eq4

, using a very small value of E, Let

Q'

be the hidden image of

(Le, the im

ageof

in the hidden units) after the weight changes. T

o first orderQ

' w

ill lie between

and on the line

BQ

, T

his follows from

the observation that Eq 4 has the effect

of moving each

y/3) towards y/l) by an amount proportional to their difference,

Since is close to

Q,

a weight change that moves the hidden image of

from

Q'

will move the hidden image of

from

to w

here B

' lies on the extension of

the line

BQ

as show

n in Fig, 3, If the hidden-to-visible weights are not changed

, thevisible image of

will move from C to

where C

' lies on the extension of the lineC

R

as shown in Fig, 3, S

o the visible- to-hidden weight changes w

ill reduce thesquared reconstruction error provided the vector

CR

is approxim

ately parallel to thevector

AP,

But why should we expect the vector

CR

to be aligned with the vector

AP?

general we should not , except w

hen the visible- to-hidden and hidden- to-visible

weights are approximately aligned, T

he learning in the hidden- to-visible

connections has a tendency to cause' this alignment, In addition

, it is easy to modify

the recirculation learning procedure so as to increase the tendency for the learning inthe hidden- to-visible connections to cause alignm

ent. Eq 3 has the effect of m

ovingthe visible image of

closer to

by an amount proportional to the m

agnitude of theerror vector

AC

, If w

e apply the same rule on the next pass around the loop, w

emove the visible image of

closer to

by an amount proportional to the m

agnitudeof

PR,

If the vector

CR

is anti-

aligned with the vector

the magnitude of

AC

w

illexceed the magnitude of

so the result of these two m

ovements w

ill be toim

prove the alignment b

etween~ and

CR

, W

e have not yet tested this modified

procedure through simulations , how

ever,

This is only an infonnal argument and m

uch work rem

ains to be done inestablishing the precise conditions under w

hich the recirculation learning procedureapproximates steepest descent, T

he infonnal argument applies equally w

ell tosystem

s that contain longer loops which have several groups of hidden units

arranged in series, At each stage in the loop, the sam

e learning procedure can beapplied, and, the w

eight changes will approxim

ate gradient descent provided thedifference of the tw

o visible vectors that are sent around the loop aligns with the

, difference of their images, W

e have not yet done enough simulations to develop a

clear picture of the conditions under which the changes in the hidden- to-visible

weights produce the required alignm

ent,

USING A IllERARCHYOF CLOSED LOOPS

Instead of using a single loop that contains many hidden layers in series

, it is

possible to use a more m

odular system, E

ach module consists of one " visible" group

and one " hidden" group connected in a closed loop, but the visible group for onem

odule is actually composed of the hidden groups of several low

er level modules, as

shown in Fig, 4, Since the sam

e learning rule is used for both visible and hiddenunits, there is no problem

in applyjng it to systems in w

hich some units are the

visible units of one module and the h

idden units of another, B

allard6 has

experimented w

ith back- propagation in this kind of system, and w

e have run some

simulations of recirculation using the architecture show

n in Fig, 4, Thenetw

ork

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

366

learned to encode a set of vectors specified over the. bottom layer, A

fter learning,each of the vectors becam

e an attractor and the network w

as capable of completing a

partial vector, even though this involved passing infonnation through several layers,

00000000

Fig 4, A netW

ork in which the hidden units of the bottom

two m

odiIles are thevisible units of the top m

odule,

CO

NC

LU

SION

We have described a sim

ple learning procedure that is capable of fonning

representations in non-linear hidden units whose input-output f

unctions have

bounded derivatives, The procedure is easy to im

plement in hardw

are, even if the

non-linearity is unknown, G

iven some strong assum

ptions, the procedure perlonnsgradient descent in the reconstruction elT

or, If the symm

etry assumption is violated

the learning procedure still works because the changes in the hidden-to-visible

weights produce sym

metry, If the assum

ption about the linearity of the visible unitsis violated, the procedure still w

orks in the cases we have sim

ulated, For the generalcase ofa loop with many non-linear stages, w

e have aninfonnal pic~e of acondition that m

ust hold for the procedure to approximate gradient descent, but w

edo not have a fonnal analysis, and w

e do not. have sufficient experience with

simulations to give an em

pirical description of the general conditions under which

the learning procedure works,

RE

FER

EN

CE

S

1. D, E

, Rum

elhart, G, E, Hinton and R, J, Williams,

Nature

323, 533-536 (1986),

2, D, H, Ackley, G, E, Hinton and T, J, Sejnowski

Cognitive Science

147-169(1985), 3, G

, Cottrell, J, L. E

hnan and D, Z

ipser, Proc, Cognitive Science Society, Seattle

WA (1987),

4, p, Smolensky, Teclmical Report CU-C

S-355-87, University of Colorado at

Boulder (1986),

5, R, 1. W

illiams, T

eclmical R

eport 8501, Institute of C

ognitive Science, University

of California, San D

iego (1985).6, D

, H, B

allard, Proc, Am

erican Association for A

rtificial Intelligence, Seattle, W

A(1987),

367

SCHEMA FOR MOTOR

CO

NT

RO

L

UTILIZING A NETWORK MODEL OF THE

CE

RE

BE

LL

UM

James C. Houk, Ph.

Northwestern University Medical School, Chicago, Illinois

60201

AB

STR

AC

T

This paper outlines a schema for mOvement control

based on two stages of signal processing. The higher stage

is a neural network model that treats the cerebellum as an

array of, adjustable motor pattern generators. This netw

orkuses sen$ory input topreset and to trigger elem

ental pattern generators and to evaluate their perform

ance. The

actual patterned outputs, however, are produced by intrin-

sic circuitry that i~cludes recurrent loops and is thus

capable of self-sustained activity. These patterned

outputs are sent as motor commands to local feedback

systems called motor servos. The latter control the forces

and lengths of individual muscles. O

verall control is thus

achieved in two stages: (1) an adaptive cerebellar network

generates an array of feedforward motor commands and (2) a

set of local feedback systems translates these commands

into actual movements.

INT

RO

DU

CT

ION

There is considerable evidence that the cerebellum is

involved in the adaptive control of'movement1; although the

manner in which this control is achieved is not well under-

stood. As a means of probing these cerebellar mechanisms,

my colleagues and I have been conducting microelectrode

studies of the neural messages that flow through the inter-

mediate division of the cerebellum and onward to limb

muscles via the rubrospinal tract. We regard this cerebel~

lorubrospinal pathway as a useful model system for studying

general problems of sensorimotor integration and adaptive

brain function. A summary of our findings has been pub-

lished as a book chapter

On the basis of these and other neurophysiological

results, I recently hypothesized that the cerebellum func-

tions as an array of adjustable motor pattern generators

The outputs from these pattern generators are assumed to

function as motor commands, i. e., as neural control signals

that are sent to lower-level motor systems where they

produce movements. According to this hypothesis, the

cerebellum uses its extensive sensory input to preset the

(9 Am

erican Institute of Physics 1988

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

common

Pencil

, Pittsburgh LEARNING REPRESENT A TI - Stanford …jlmcc/papers/HintonMcC88.pdf · LEARNING...

Documents

Transcript of , Pittsburgh LEARNING REPRESENT A TI - Stanford …jlmcc/papers/HintonMcC88.pdf · LEARNING...