Natural Language Grammatical Inference with Recurrent...

Acceptedfor publication,IEEE TransactionsonKnowledgeandDataEngineering.

NaturalLanguageGrammaticalInferencewith RecurrentNeural

Networks

SteveLawrence,C. LeeGiles�, SandiwayFong�� !��"� ��$#&%'��()%*��!�+%,��-$.

NECResearchInstitute

4 IndependenceWay

Princeton,NJ08540

Abstract

This paperexaminesthe inductive inferenceof a complex grammarwith neuralnetworks– specif-

ically, the taskconsideredis thatof traininga network to classifynaturallanguagesentencesasgram-

maticalor ungrammatical,therebyexhibiting the samekind of discriminatorypower provided by the

PrinciplesandParameterslinguistic framework, or Government-and-Bindingtheory. Neuralnetworks

aretrained,without thedivision into learnedvs. innatecomponentsassumedby Chomsky, in anattempt

to producethesamejudgmentsasnative speakerson sharplygrammatical/ungrammaticaldata.How a

recurrentneuralnetwork couldpossesslinguisticcapability, andthepropertiesof variouscommonrecur-

rentneuralnetwork architecturesarediscussed.Theproblemexhibits trainingbehavior which is often

notpresentwith smallergrammars,andtrainingwasinitially difficult. However, afterimplementingsev-

eral techniquesaimedat improving the convergenceof the gradientdescentbackpropagation-through-

time training algorithm,significantlearningwaspossible. It was found that certainarchitecturesare

betterable to learnan appropriategrammar. The operationof the networks andtheir training is ana-

lyzed.Finally, theextractionof rulesin theform of deterministicfinite stateautomatais investigated.

Keywords: recurrentneuralnetworks, natural languageprocessing,grammaticalinference,

government-and-binding theory, gradient descent, simulated annealing, principles-and-

parametersframework, automataextraction.

/Also with theInstitutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,MD 20742.

1 Intr oduction

This paperconsidersthe taskof classifyingnaturallanguagesentencesasgrammaticalor ungrammatical.

Weattemptto trainneuralnetworks,without thebifurcationinto learnedvs. innatecomponentsassumedby

Chomsky, to producethesamejudgmentsasnative speakerson sharplygrammatical/ungrammaticaldata.

Only recurrentneuralnetworksareinvestigatedfor computationalreasons.Computationally, recurrentneu-

ral networks are more powerful than feedforward networks and somerecurrentarchitectureshave been

shown to beat leastTuring equivalent[53, 54]. We investigatethepropertiesof variouspopularrecurrent

neuralnetwork architectures,in particularElman,Narendra& Parthasarathy(N&P) andWilliams & Zipser

(W&Z) recurrentnetworks, andalsoFrasconi-Gori-Soda(FGS) locally recurrentnetworks. We find that

bothElmanandW&Z recurrentneuralnetworksareableto learnanappropriategrammarafterimplement-

ing techniquesfor improving theconvergenceof thegradientdescentbasedbackpropagation-through-time

trainingalgorithm.We analyzetheoperationof thenetworksandinvestigatea rule approximationof what

the recurrentnetwork haslearned– specifically, the extractionof rules in the form of deterministicfinite

stateautomata.

Previouswork [38] hascomparedneuralnetworkswith othermachinelearningparadigmson this problem

– this work focuseson recurrentneuralnetworks, investigatesadditionalnetworks,analyzestheoperation

of thenetworksandthetrainingalgorithm,andinvestigatesruleextraction.

This paperis organizedas follows: section2 provides the motivation for the task attempted.Section3

providesabrief introductionto formalgrammarsandgrammaticalinferenceanddescribesthedata.Section

4 lists the recurrentneuralnetwork modelsinvestigatedandprovidesdetailsof the dataencodingfor the

networks.Section5 presentstheresultsof investigationinto varioustrainingheuristics,andinvestigationof

training with simulatedannealing.Section6 presentsthemain resultsandsimulationdetails,andinvesti-

gatestheoperationof thenetworks.Theextractionof rulesin theform of deterministicfinite stateautomata

is investigatedin section7, andsection8 presentsadiscussionof theresultsandconclusions.

2 Moti vation

2.1 RepresentationalPower

Naturallanguagehastraditionallybeenhandledusingsymboliccomputationandrecursive processes.The

mostsuccessfulstochasticlanguagemodelshave beenbasedon finite-statedescriptionssuchasn-gramsor

hiddenMarkov models. However, finite-statemodelscannotrepresenthierarchicalstructuresasfound in

naturallanguage1 [48]. In thepastfew yearsseveral recurrentneuralnetwork architectureshave emerged

1Theinside-outsidere-estimationalgorithmis anextensionof hiddenMarkov modelsintendedto beusefulfor learninghierar-

chicalsystems.Thealgorithmis currentlyonly practicalfor relatively smallgrammars[48].

2

which have beenusedfor grammaticalinference[9, 21, 19, 20, 68]. Recurrentneuralnetworkshave been

usedfor several smallernaturallanguageproblems,e.g. papersusingthe Elmannetwork for naturallan-

guagetasksinclude: [1, 12, 24, 58, 59]. Neuralnetwork modelshave beenshown to be ableto account

for a varietyof phenomenain phonology[23, 61, 62, 18, 22], morphology[51, 41, 40] androle assignment

[42, 58]. Inductionof simplergrammarshasbeenaddressedoften – e.g. [64, 65, 19] on learningTomita

languages[60]. The taskconsideredherediffers from thesein that the grammaris morecomplex. The

recurrentneuralnetworks investigatedin this paperconstitutecomplex, dynamicalsystems– it hasbeen

shown thatrecurrentnetworkshave therepresentationalpower requiredfor hierarchicalsolutions[13], and

thatthey areTuringequivalent.

2.2 Languageand Its Acquisition

Certainlyoneof themostimportantquestionsfor thestudyof humanlanguageis: How do peopleunfail-

ingly manageto acquiresucha complex rule system?A systemsocomplex that it hasresistedtheefforts

of linguists to dateto adequatelydescribein a formal system[8]. A coupleof examplesof the kind of

knowledgenative speakersoftentake for grantedareprovidedin thissection.

For instance,any native speaker of English knows that the adjective eager obligatorily takes a comple-

mentizerfor with a sententialcomplementthat containsan overt subject,but that theverbbelieve cannot.

Moreover, eager maytake asententialcomplementwith anon-overt, i.e. animpliedor understood,subject,

but believecannot2:

*I ameagerJohnto behere I believe Johnto behere

I ameagerfor Johnto behere *I believe for Johnto behere

I ameagerto behere *I believe to behere

Suchgrammaticalityjudgmentsaresometimessubtlebut unarguablyform partof thenative speaker’s lan-

guagecompetence.In othercases,judgmentfalls not on acceptabilitybut on other aspectsof language

competencesuchasinterpretation.Considerthereferenceof theembeddedsubjectof thepredicateto talk

to in thefollowing examples:

Johnis toostubbornfor Mary to talk to

Johnis toostubbornto talk to

Johnis toostubbornto talk to Bill

In thefirst sentence,it is clearthatMary is thesubjectof theembeddedpredicate.As every native speaker

knows, thereis a strongcontrastin the co-referenceoptionsfor the understoodsubjectin the secondand

2As is conventional,anasteriskis usedto indicateungrammaticality.

3

third sentencesdespitetheirsurfacesimilarity. In thethird sentence,Johnmustbetheimpliedsubjectof the

predicateto talk to. By contrast,John is understoodastheobjectof thepredicatein thesecondsentence,

thesubjectherehaving arbitraryreference;in otherwords,thesentencecanbereadasJohnis toostubborn

for somearbitrary person to talk to John. The point to emphasizehereis that the languagefaculty has

impressive discriminatorypower, in thesensethata singleword, asseenin theexamplesabove, canresult

in sharpdifferencesin acceptabilityor alter theinterpretationof a sentenceconsiderably. Furthermore,the

judgmentsshown above arerobust in thesensethatvirtually all native speakersagreewith thedata.

In the light of suchexamplesandthe fact that suchcontrastscrop up not just in Englishbut in otherlan-

guages(for example,thestubborncontrastalsoholdsin Dutch),somelinguists(chiefly Chomsky [7]) have

hypothesizedthat it is only reasonablethatsuchknowledgeis only partially acquired:thelack of variation

found acrossspeakers,andindeed,languagesfor certainclassesof datasuggeststhat thereexists a fixed

componentof thelanguagesystem.In otherwords,thereis aninnatecomponentof thelanguagefacultyof

thehumanmind thatgovernslanguageprocessing.All languagesobey theseso-calleduniversalprinciples.

Sincelanguagesdodiffer with regardto thingslikesubject-object-verborder, theseprinciplesaresubjectto

parametersencodingsystematicvariationsfoundin particularlanguages.Undertheinnatenesshypothesis,

only the languageparametersplusthe language-specificlexicon areacquiredby thespeaker; in particular,

theprinciplesarenot learned.Basedon theseassumptions,thestudyof theselanguage-independent prin-

cipleshasbecomeknown asthePrinciples-and-Parametersframework, or Government-and-Binding(GB)

theory.

This paperinvestigateswhethera neuralnetwork canbe madeto exhibit the samekind of discriminatory

power on thesortof dataGB-linguistshave examined.More precisely, thegoalis to train aneuralnetwork

from scratch,i.e. without the division into learnedvs. innatecomponentsassumedby Chomsky, to pro-

ducethesamejudgmentsasnative speakerson thegrammatical/ungrammaticalpairsof thesortdiscussed

above. Insteadof usinginnateknowledge,positive andnegative examplesareused(a secondargumentfor

innatenessis thatit is notpossibleto learnthegrammarwithoutnegative examples).

3 Data

We first provide a brief introductionto formal grammars,grammaticalinference,andnaturallanguage;for

a thoroughintroductionseeHarrison[25] andFu[17]. Wethendetailthedatasetwhichwehaveusedin our

experiments.

3.1 Formal Grammars and Grammatical Inference

Briefly, a grammar0 is a four tuple 132547684�9:4<;+= , where 2 and 6 aresetsof terminalsandnonterminals

comprisingthealphabetof thegrammar, 9 is asetof productionrules,and ; is thestartsymbol.For every

4

grammarthereexists a language> , a setof stringsof the terminalsymbols,that the grammargenerates

or recognizes.Therealsoexist automatathatrecognizeandgeneratethegrammar. Grammaticalinference

is concernedmainly with the proceduresthat canbe usedto infer the syntacticor productionrulesof an

unknown grammar 0 basedon a finite set of strings ? from >A@B0DC , the languagegeneratedby 0 , and

possiblyalsoon a finite setof stringsfrom thecomplementof >E@B0FC [17]. This paperconsidersreplacing

theinferencealgorithmwith a neuralnetwork andthegrammaris thatof theEnglishlanguage.Thesimple

grammarusedby Elman [13] shown in table 1 containssomeof the structuresin the completeEnglish

grammar:e.g.agreement,verbargumentstructure,interactionswith relative clauses,andrecursion.

S G NP VP ”.”

NP G PropN H N H N RC

VP G V (NP)

RC G whoNP VP H whoVP (NP)

N G boy H girl H cat...

PropN G John H Mary

V G chaseH feed H see...

Table1. A simplegrammarencompassingasubsetof theEnglishlanguage(from [13]). NP= nounphrase,VP = verb

phrase,PropN= propernoun,RC = relativeclause,V = verb,N = noun,andS = thefull sentence.

In theChomsky hierarchyof phrasestructuredgrammars,thesimplestgrammarandits associatedautomata

areregulargrammarsandfinite-state-automata(FSA). However, it hasbeenfirmly established[6] that the

syntacticstructuresof naturallanguagecannotbeparsimoniouslydescribedby regular languages.Certain

phenomena(e.g. centerembedding)aremorecompactlydescribedby context-free grammarswhich are

recognizedby push-down automata,while others(e.g. crossed-serialdependenciesand agreement)are

betterdescribedby context-sensitive grammarswhicharerecognizedby linearboundedautomata[50].

3.2 Data

Thedatausedin thiswork consistsof 552Englishpositiveandnegativeexamplestakenfrom anintroductory

GB-linguisticstextbookby LasnikandUriagereka[37]. Mostof theseexamplesareorganizedinto minimal

pairslike theexampleI ameager for Johnto win/*I ameager Johnto win above. Theminimalnatureof the

changesinvolvedsuggeststhat thedatasetmayrepresentanespeciallydifficult taskfor themodels.Dueto

thesmall samplesize,the raw data,namelywords,werefirst converted(usinganexisting parser)into the

major syntacticcategoriesassumedunderGB-theory. Table2 summarizesthe partsof speechthat were

used.

Thepart-of-speechtaggingrepresentsthesolegrammaticalinformationsuppliedto themodelsaboutpar-

ticular sentencesin additionto the grammaticalitystatus.An importantrefinementthat wasimplemented

5

Category Examples

Nouns(N) John, bookanddestruction

Verbs(V) hit, beandsleep

Adjectives(A) eager, old andhappy

Prepositions(P) withoutandfrom

Complementizer(C) that or for asin I thoughtthat. . .

or I ameager for . . .

Determiner(D) theor each asin themanor eachman

Adverb(Adv) sincerely or whyasin I sincerelybelieve. . .

or Why did Johnwant. . .

Marker (Mrkr) possessive ’s, of, or to asin John’s mother,

thedestructionof . . . , or I want to help. . .

Table2. Partsof speech

wasto includesub-categorizationinformationfor themajorpredicates,namelynouns,verbs,adjectivesand

prepositions.Experimentsshowedthataddingsub-categorizationto thebarecategory informationimproved

theperformanceof themodels.For example,anintransitive verbsuchassleepwould beplacedinto a dif-

ferentclassfrom the obligatorily transitive verb hit. Similarly, verbsthat take sententialcomplementsor

doubleobjectssuchasseem, giveor persuadewould be representative of otherclasses3. Fleshingout the

sub-categorizationrequirementsalongtheselinesfor lexical itemsin thetrainingsetresultedin 9 classesfor

verbs,4 for nounsandadjectives,and2 for prepositions.Examplesof theinput dataareshown in table3.

Sentence Encoding Grammatical

Status

I ameagerfor Johnto behere n4 v2 a2c n4 v2 adv 1

n4 v2 a2c n4 p1v2 adv 1

I ameagerJohnto behere n4 v2 a2n4 v2 adv 0

n4 v2 a2n4 p1v2 adv 0

I ameagerto behere n4 v2 a2v2 adv 1

n4 v2 a2p1 v2 adv 1

Table3. Examplesof thepart-of-speechtagging

Taggingwasdonein acompletelycontext-freemanner. Obviously, aword,e.g.to, maybepartof morethan

onepart-of-speech.Thetaggingresultedin severalcontradictoryandduplicatedsentences.Variousmethods

3Following classicalGB theory, theseclassesaresynthesizedfrom the theta-gridsof individual predicatesvia the Canonical

StructuralRealization(CSR)mechanismof Pesetsky [49].

6

weretestedto dealwith thesecases,however they wereremoved altogetherfor the resultsreportedhere.

In addition,thenumberof positive andnegative exampleswasequalized(by randomlyremoving examples

from thehigherfrequency class)in all trainingandtestsetsin orderto reduceany effectsdueto differing a

priori classprobabilities(whenthenumberof samplesperclassvariesbetweenclassestheremaybea bias

towardspredictingthemorecommonclass[3, 2]).

4 Neural Network Models and Data Encoding

The following architectureswereinvestigated.Architectures1 to 3 aretopologicalrestrictionsof 4 when

thenumberof hiddennodesis equalandin thissensemaynothave therepresentationalcapabilityof model

4. It is expectedthat theFrasconi-Gori-Soda(FGS)architecturewill be unableto performthe taskandit

hasbeenincludedprimarily asacontrolcase.

1. Frasconi-Gori-Soda(FGS) locally recurrent networks[16]. A multilayer perceptronaugmentedwith

local feedbackaroundeachhiddennode.Thelocal-outputversionhasbeenused.TheFGSnetwork has

alsobeenstudiedby [43] – thenetwork is calledFGSin this paperin line with [63].

2. Narendra and Parthasarathy [44]. A recurrentnetwork with feedbackconnectionsfrom eachoutput

nodeto all hiddennodes.TheN&P network architecturehasalsobeenstudiedby Jordan[33, 34] – the

network is calledN&P in thispaperin line with [30].

3. Elman [13]. A recurrentnetwork with feedbackfrom eachhiddennodeto all hiddennodes. When

trainingtheElmannetwork backpropagation-through-time is usedratherthanthetruncatedversionused

by Elman,i.e. in thispaper“Elmannetwork” refersto thearchitectureusedby Elmanbut not thetraining

algorithm.

4. Williams andZipser[67]. A recurrentnetwork whereall nodesareconnectedto all othernodes.

Diagramsof thesearchitecturesareshown in figures1 to 4.

For inputto theneuralnetworks,thedatawasencodedinto afixedlengthwindow madeupof segmentscon-

tainingeightseparateinputs,correspondingto theclassificationsnoun,verb,adjective, etc. Sub-categories

of theclasseswerelinearlyencodedinto eachinput in amannerdemonstratedby thespecificvaluesfor the

nouninput: Not a noun= 0, nounclass1 = 0.5, nounclass2 = 0.667,nounclass3 = 0.833,nounclass

4 = 1. The linear orderwasdefinedaccordingto thesimilarity betweenthevarioussub-categories4. Two

outputswereusedin theneuralnetworks,correspondingto grammaticalandungrammaticalclassifications.

The datawasinput to the neuralnetworks with a window which is passedover the sentencein temporal

4A fixed lengthwindow madeup of segmentscontaining23 separateinputs,correspondingto theclassificationsnounclass1,

nounclass2, verbclass1, etc.wasalsotestedbut provedinferior.

7

Figure1. A Frasconi-Gori-Sodalocally recurrentnetwork. Not all connectionsareshown fully.

Figure2. A Narendra& Parthasarathyrecurrentnetwork. Not all connectionsareshown fully.

Figure3. An Elmanrecurrentnetwork. Not all connectionsareshown fully.

8

Figure4. A Williams & Zipserfully recurrentnetwork. Not all connectionsareshown fully.

orderfrom thebeginningto theendof thesentence(seefigure5). Thesizeof thewindow wasvariablefrom

oneword to the lengthof the longestsentence.We notethat thecasewherethe input window is small is

of greaterinterest– the larger the input window, the greaterthe capabilityof a network to correctlyclas-

sify thetrainingdatawithout forming a grammar. For example,if theinput window is equalto thelongest

sentence,thenthe network doesnot have to storeany information– it cansimply mapthe inputsdirectly

to theclassification.However, if theinput window is relatively small,thenthenetwork mustlearnto store

information. As will beshown later thesenetworks implementa grammar, anda deterministicfinite state

automatonwhich recognizesthisgrammarcanbeextractedfrom thenetwork. Thus,we aremostinterested

in the small input window case,wherethe networks arerequiredto form a grammarin orderto perform

well.

5 Gradient Descentand SimulatedAnnealing Learning

Backpropagation-through-time [66]5 hasbeenusedto train theglobally recurrentnetworks6, andthegradi-

entdescentalgorithmdescribedby theauthors[16] wasusedfor theFGSnetwork. Thestandardgradient

descentalgorithmswere found to be impracticalfor this problem7. The techniquesdescribedbelow for

improving convergencewereinvestigated.Due to the dependenceon the initial parameters,a numberof

simulationswereperformedwith differentinitial weightsandtrainingset/testsetcombinations.However,

dueto the computationalcomplexity of the task8, it wasnot possibleto performasmany simulationsas

5Backpropagation-through-timeextendsbackpropagationto includetemporalaspectsandarbitraryconnectiontopologiesby

consideringanequivalentfeedforwardnetwork createdby unfoldingtherecurrentnetwork in time.6Real-time[67] recurrentlearning(RTRL) wasalsotestedbut did notshow any significantconvergencefor thepresentproblem.7Without modifying thestandardgradientdescentalgorithmsit wasonly possibleto train networkswhich operatedon a large

temporalinputwindow. Thesenetworkswerenot forcedto modelthegrammar, they only memorizedandinterpolatedbetweenthe

trainingdata.8Eachindividualsimulationin this sectiontook anaverageof two hoursto completeona SunSparc10server.

9

Figure5. Depictionof how the neuralnetwork inputscomefrom an input window on the sentence.The window

movesfrom thebeginningto theendof thesentence.

desired.The standarddeviation of theNMSE valuesis includedto help assessthe significanceof the re-

sults.Table4 shows someresultsfor usingandnot usingthetechniqueslistedbelow. Exceptwherenoted,

the resultsin this sectionarefor Elmannetworks using: two word inputs,10 hiddennodes,the quadratic

costfunction,the logistic sigmoidfunction,sigmoidoutputactivations,onehiddenlayer, the learningrate

scheduleshown below, aninitial learningrateof 0.2,theweightinitializationstrategy discussedbelow, and

1 million stochasticupdates(targetvaluesareonly providedat theendof eachsentence).

Standard NMSE Std.Dev. Variation NMSE Std.Dev.

Update:batch 0.931 0.0036 Update: stochastic 0.366 0.035

Learningrate:constant 0.742 0.154 Learning rate: schedule 0.394 0.035

Activation: logistic 0.387 0.023 Activation: I'J3K�L 0.405 0.14

Sectioning: no 0.367 0.011 Sectioning:yes 0.573 0.051

Cost function: quadratic 0.470 0.078 Costfunction: entropy 0.651 0.0046

Table4. Comparisonsof usingandnot usingvariousconvergencetechniques.All otherparametersareconstantin

eachcase:Elmannetworksusing:two wordinputs(i.e. aslidingwindow of thecurrentandpreviousword),10hidden

nodes,the quadraticcost function, the logistic activation function, sigmoidoutputactivations,onehiddenlayer, a

learningrateschedulewith an initial learningrateof 0.2, the weight initialization strategy discussedbelow, and1

million stochasticupdates.EachNMSE result representsthe averageof four simulations. The standarddeviation

valuegivenis thestandarddeviationof thefour individual results.

10

1. Detectionof SignificantError Increases. If theNMSE increasessignificantlyduring training thennet-

work weightsarerestoredfrom apreviousepochandareperturbedto preventupdatingto thesamepoint.

This techniquewasfoundto increaserobustnessof thealgorithmwhenusinglearningrateslargeenough

to helpavoid problemsdueto local minimaand“flat spots”on theerrorsurface,particularlyin thecase

of theWilliams & Zipsernetwork.

2. Target Outputs.Targetsoutputswere0.1 and0.9 usingthe logistic activation functionand-0.8 and0.8

usingthe M�N$O"P activationfunction.Thishelpsavoid saturatingthesigmoidfunction. If targetsweresetto

theasymptotesof thesigmoidthiswould tendto: a) drive theweightsto infinity, b) causeoutlier datato

producevery largegradientsdueto thelargeweights,andc) producebinaryoutputsevenwhenincorrect

– leadingto decreasedreliability of theconfidencemeasure.

3. StochasticVersusBatch Update. In stochasticupdate,parametersareupdatedaftereachpatternpresen-

tation,whereasin truegradientdescent(oftencalled”batch” updating)gradientsareaccumulatedover

thecompletetrainingset.Batchupdateattemptsto follow thetruegradient,whereasa stochasticpathis

followedusingstochasticupdate.

Stochasticupdateis oftenmuchquicker thanbatchupdate,especiallywith large,redundantdatasets[39].

Additionally, thestochasticpathmayhelpthenetwork to escapefrom local minima. However, theerror

can jump aroundwithout converging unlessthe learningrate is reduced,most secondorder methods

do not work well with stochasticupdate,andstochasticupdateis harderto parallelizethanbatch[39].

Batchupdateprovides guaranteedconvergence(to local minima) andworks betterwith secondorder

techniques.However it canbevery slow, andmayconvergeto very poorlocalminima.

In theresultsreported,thetrainingtimeswereequalizedby reducingthenumberof updatesfor thebatch

case(for anequalnumberof weightupdatesbatchupdatewould otherwisebemuchslower). Batchup-

dateoftenconvergesquickerusingahigherlearningratethantheoptimalrateusedfor stochasticupdate9,

hencealteringthelearningratefor thebatchcasewasinvestigated.However, significantconvergencewas

notobtainedasshown in table4.

4. WeightInitialization. Randomweightsareinitialized with thegoalof ensuringthat thesigmoidsdo not

startout in saturationbut arenotvery small(correspondingto aflat partof theerrorsurface)[26]. In ad-

dition, several(20)setsof randomweightsaretestedandthesetwhichprovidesthebestperformanceon

thetrainingdatais chosen.In ourexperimentsonthecurrentproblem,it wasfoundthatthesetechniques

do notmake asignificantdifference.

5. LearningRateSchedules. Relatively high learningratesaretypically usedin orderto help avoid slow

convergenceandlocal minima. However, a constantlearningrateresultsin significantparameterand

performancefluctuationduring the entire training cycle suchthat the performanceof the network can

9Stochasticupdatedoesnotgenerallytolerateashigha learningrateasbatchupdatedueto thestochasticnatureof theupdates.

11

alter significantlyfrom the beginning to theendof thefinal epoch.Moody andDarken have proposed

“searchthenconverge” learningrateschedulesof theform [10, 11]:Q @SRTCVU Q�WX�Y[Z\ (1)

whereQ @SR�C is thelearningrateat time R , Q�W is theinitial learningrate,and ] is aconstant.

We have found that the learningrateduring thefinal epochstill resultsin considerableparameterfluc-

tuation,10 andhencewe have addedanadditionalterm to further reducethe learningrateover thefinal

epochs(our specificlearningrateschedulecanbe found in a later section). We have found the useof

learningrateschedulesto improve performanceconsiderablyasshown in table4.

6. Activation Function. Symmetricsigmoid functions(e.g. M�N$O"P ) often improve convergenceover the

standardlogistic function. For our particularproblemwe found that the differencewasminor andthat

thelogistic functionresultedin betterperformanceasshown in table4.

7. CostFunction.Therelative entropy costfunction[4, 29, 57, 26, 27] hasreceivedparticularattentionand

hasa naturalinterpretationin termsof learningprobabilities[36]. We investigatedusingbothquadratic

andrelative entropy costfunctions:

Definition 1 Thequadraticcostfunctionis definedas^`_bac8dfe5gih e+jlkme�nBo(2)p

Definition 2 Therelativeentropy costfunctionis definedas^`_ d�erq ac g ats h e nmuwv3x ats h eats kye s ac g a j h e nmuzv{x a j h ea j|kye�} (3)pwhere~ and � correspondto theactualanddesiredoutputvalues,� rangesover theoutputs(andalsothe

patternsfor batchupdate).We foundthequadraticcostfunctionto provide betterperformanceasshown

in table4. A possiblereasonfor this is that the useof the entropy cost function leadsto an increased

varianceof weightupdatesandthereforedecreasedrobustnessin parameterupdating.

8. Sectioningof theTraining Data. We investigateddividing the training datainto subsets.Initially, only

one of thesesubsetswas usedfor training. After 100% correctclassificationwas obtainedor a pre-

specifiedtime limit expired,anadditionalsubsetwasaddedto the “working” set. This continueduntil

theworking setcontainedtheentiretrainingset.Thedatawasorderedin termsof sentencelengthwith

10NMSE resultswhich areobtainedover an epochinvolving stochasticupdatecanbe misleading.We have beensurprisedto

find quitesignificantdifferencein theseon-lineNMSE calculationscomparedto a staticcalculationeven if thealgorithmappears

to have converged.

12

theshortestsentencesfirst. This enabledthenetworksto focuson thesimplerdatafirst. Elmansuggests

that the initial trainingconstrainslater trainingin a usefulway [13]. However, for our problem,theuse

of sectioninghasconsistentlydecreasedperformanceasshown in table4.

We have also investigatedthe useof simulatedannealing. Simulatedannealingis a global optimization

method[32, 35]. Whenminimizing a function,any downhill stepis acceptedandtheprocessrepeatsfrom

this new point. An uphill stepmayalsobeaccepted.It is thereforepossibleto escapefrom local minima.

As theoptimizationprocessproceeds,the lengthof thestepsdeclinesandthealgorithmconvergeson the

globaloptimum.Simulatedannealingmakesvery few assumptionsregardingthefunctionto beoptimized,

andis thereforequiterobustwith respectto non-quadraticerrorsurfaces.

Previouswork hasshown theuseof simulatedannealingfor finding theparametersof a recurrentnetwork

model to improve performance[56]. For comparisonwith the gradientdescentbasedalgorithmsthe use

of simulatedannealinghasbeeninvestigatedin orderto train exactly thesameElmannetwork ashasbeen

successfullytrainedto 100%correcttrainingsetclassificationusingbackpropagation-through-time (details

arein section6). No significantresultswereobtainedfrom thesetrials11. Theuseof simulatedannealing

hasnot beenfoundto improve performanceasin Simardet al. [56]. However, their problemwastheparity

problemusingnetworks with only four hiddenunits whereasthe networks consideredin this paperhave

many moreparameters.

This result provides an interestingcomparisonto the gradient descentbackpropagation-through-time

(BPTT)method.BPTTmakestheimplicit assumptionthattheerrorsurfaceis amenableto gradientdescent

optimization,and this assumptioncan be a major problemin practice. However, althoughdifficulty is

encounteredwith BPTT, the methodis significantly more successfulthan simulatedannealing(which

makesfew assumptions)for thisproblem.

6 Experimental Results

Resultsfor thefour neuralnetwork architecturesaregivenin thissection.Theresultsarebasedon multiple

training/testsetpartitionsandmultiple randomseeds.In addition,asetof Japanesecontroldatawasusedas

a testset(wedonotconsidertrainingmodelswith theJapanesedatabecausewedonothavea largeenough

dataset).JapaneseandEnglishareat theoppositeendsof thespectrumwith regardto word order. That is,

Japanesesentencepatternsarevery differentfrom English. In particular, Japanesesentencesaretypically

SOV (subject-object-verb) with theverbmoreor lessfixedandtheotherargumentsmoreor lessavailable

to freely permute. English datais of courseSVO and argumentpermutationis generallynot available.

For example,thecanonicalJapaneseword orderis simply ungrammaticalin English. Hence,it would be

extremelysurprisingif anEnglish-trainedmodelacceptsJapanese,i.e. it is expectedthata network trained

11Theadaptive simulatedannealingcodeby LesterIngber[31, 32] wasused.

13

on Englishwill not generalizeto Japanesedata.This is whatwe find – all modelsresultedin no significant

generalizationon theJapanesedata(50%erroron average).

Fivesimulationswereperformedfor eacharchitecture.Eachsimulationtookapproximatelyfour hoursona

SunSparc10 server. Table5 summarizestheresultsobtainedwith thevariousnetworks. In orderto make

thenumberof weightsin eacharchitectureapproximatelyequalwe have usedonly singleword inputsfor

theW&Z modelbut two word inputsfor theothers.This reductionin dimensionalityfor theW&Z network

improvedperformance.Thenetworkscontained20hiddenunits.Full simulationdetailsaregivenin section

6.

The goal was to train a network using only a small temporalinput window. Initially, this could not be

done. With the additionof the techniquesdescribedearlier it waspossibleto train Elmannetworks with

sequencesof thelast two wordsasinput to give 100%correct(99.6%averagedover 5 trials) classification

on the training data. Generalizationon the testdataresultedin 74.2%correctclassificationon average.

This is betterthantheperformanceobtainedusingany of theothernetworks,however it is still quite low.

Thedatais quitesparseandit is expectedthatincreasedgeneralizationperformancewill beobtainedasthe

amountof datais increased,aswell asincreaseddifficulty in training. Additionally, the datasethasbeen

hand-designedby GB linguiststo covera rangeof grammaticalstructuresandit is likely thattheseparation

into the training and test setscreatesa test set which containsmany grammaticalstructuresthat arenot

coveredin the training set. The Williams & Zipsernetwork alsoperformedreasonablywell with 71.3%

correctclassificationof thetestset.Notethatthetestsetperformancewasnotobservedto dropsignificantly

afterextendedtraining, indicatingthat theuseof a validationsetto controlpossibleoverfitting would not

alterperformancesignificantly.

TRAIN Classification Std.dev.

Elman 99.6% 0.84

FGS 67.1% 1.22

N&P 75.2% 1.41

W&Z 91.7% 2.26

ENGLISHTEST Classification Std.dev.

Elman 74.2% 3.82

FGS 59.0% 1.52

N&P 60.6% 0.97

W&Z 71.3% 0.75

Table5. Resultsof the network architecturecomparison.The classificationvaluesreportedarean averageof five

individualsimulations,andthestandarddeviationvalueis thestandarddeviationof thefive individual results.

Completedetailson a sampleElmannetwork areasfollows (othernetworksdiffer only in topologyexcept

14

for theW&Z for which betterresultswereobtainedusinganinput window of only oneword): thenetwork

containedthreelayersincluding the input layer. Thehiddenlayercontained20 nodes.Eachhiddenlayer

nodehada recurrentconnectionto all otherhiddenlayer nodes.The network wastrainedfor a total of 1

million stochasticupdates.All inputswerewithin therangezeroto one.All targetoutputswereeither0.1or

0.9.Biasinputswereused.Thebestof 20randomweightsetswaschosenbasedontrainingsetperformance.

Weightswereinitialized asshown in Haykin [26] whereweightsareinitialized on a nodeby nodebasisas

uniformly distributedrandomnumbersin the range @'�A��y�t�T4��y� �7C where �t� is the fan-in of neuron� .The logistic outputactivation function wasused. The quadraticcost function wasused. The searchthen

converge learningratescheduleusedwas Q U ��*�f� �*�� m� �� T� � �� *� � � �m� � �� y� � �� where Q U learningrate,

Q�W U initial learningrate, 2¡U total trainingepochs,¢£U currenttrainingepoch,¤{¥AU§¦m¨ , ¤,©ªU«¨��¬$¦ . The

trainingsetconsistedof 373non-contradictoryexamplesasdescribedearlier. TheEnglishtestsetconsisted

of 100non-contradictorysamplesandtheJapanesetestsetconsistedof 119non-contradictorysamples.

We now take a closerlook at theoperationof thenetworks. Theerrorduringtrainingfor a sampleof each

network architectureis shown in figure 6. The error at eachpoint in the graphsis the NMSE over the

completetrainingset. Note thenatureof theWilliams & Zipserlearningcurve andtheutility of detecting

andcorrectingfor significanterrorincreases12.

Figure7 shows anapproximationof the “complexity” of theerrorsurfacebasedon thefirst derivativesof

theerrorcriterionwith respectto eachweight ®[U°¯²±´³<µ³�¶ ±· ¶ where � sumsover all weightsin thenetwork

and 2D¸ is the total numberof weights.This valuehasbeenplottedaftereachepochduringtraining. Note

themorecomplex natureof theplot for theWilliams & Zipsernetwork.

Figures8 to 11 show sampleplotsof theerrorsurfacefor thevariousnetworks.Theerrorsurfacehasmany

dimensionsmakingvisualizationdifficult. We plot sampleviews showing thevariationof theerrorfor two

dimensions,andnotethat theseplotsareindicative only – quantitative conclusionsshouldbe drawn from

thetesterrorandnot from theseplots. Eachplot shown in thefiguresis with respectto only two randomly

chosendimensions.In eachcase,the centerof the plot correspondsto the valuesof the parametersafter

training. Taken together, the plots provide an approximateindication of the natureof the error surface

for thedifferentnetwork types. The FGSnetwork errorsurfaceappearsto be thesmoothest,however the

resultsindicatethatthesolutionsfounddonotperformverywell, indicatingthattheminimafoundarepoor

comparedto the global optimumand/orthat the network is not capableof implementinga mappingwith

low error. TheWilliams & Zipserfully connectednetwork hasgreaterrepresentationalcapabilitythanthe

Elmanarchitecture(in thesensethatit canperformagreatervarietyof computationswith thesamenumber

of hiddenunits). However, comparingtheElmanandW&Z network errorsurfaceplots,it canbeobserved

thattheW&Z network hasagreaterpercentageof flat spots.Thesegraphsarenotconclusive (becausethey

only show two dimensionsandareonly plottedaroundonepoint in theweightspace),however they backup

12The learningcurve for theWilliams & Zipsernetwork canbemadesmootherby reducingthe learningratebut this tendsto

promoteconvergenceto poorerlocalminima.

15

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500Epoch

NMSE - final 0.835

FGS

0.01

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500Epoch

NMSE - final 0.0569

Elman

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500Epoch

NMSE - final 0.693

N&P

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500Epoch

NMSE - final 0.258

W&Z

Figure6. AverageNMSE(log scale)over thetrainingsetduringtraining.Topto bottom:Frasconi-Gori-Soda,Elman,

Narendra& ParthasarathyandWilliams & Zipser.

thehypothesisthat theW&Z network performsworsebecausetheerrorsurfacepresentsgreaterdifficulty

to thetrainingmethod.

7 Automata Extraction

The extractionof symbolicknowledgefrom trainedneuralnetworks allows the exchangeof information

betweenconnectionistandsymbolicknowledgerepresentationsandhasbeenof greatinterestfor under-

standingwhat the neuralnetwork is actuallydoing [52]. In additionsymbolicknowledgecanbe inserted

into recurrentneuralnetworksandevenrefinedaftertraining[15, 47, 45].

Theorderedtriple of a discreteMarkov process( 1 state;input G next-state= ) canbeextractedfrom a RNN

16

1e-05

0.0001

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500Epoch

FGS

1e-05

0.0001

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500Epoch

Elman

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500Epoch

N&P

1e-05

0.0001

0.001

0.01

0.1

1

0 500 1000 1500 2000 2500 3000 3500Epoch

W&Z

Figure7. Approximate“complexity” of theerrorsurfaceduringtraining.Top to bottom:Frasconi-Gori-Soda,Elman,

Narendra& ParthasarathyandWilliams & Zipser.

andusedto form an equivalentdeterministicfinite stateautomata(DFA). This canbe doneby clustering

theactivationvaluesof therecurrentstateneurons[46]. Theautomataextractedwith this processcanonly

recognizeregulargrammars13.

However, naturallanguage[6] cannotbeparsimoniouslydescribedby regular languages- certainphenom-

ena(e.g. centerembedding)aremorecompactlydescribedby context-free grammars,while others(e.g.

crossed-serialdependenciesandagreement)arebetterdescribedby context-sensitive grammars.Hence,the

networksmaybeimplementingmoreparsimoniousversionsof thegrammarwhichweareunableto extract

with this technique.

13A regular grammar¹ is a 4-tuple ¹»º½¼<¾�¿BÀE¿�Á�¿�Â�Ã where ¾ is the startsymbol, À and Á arenon-terminalandterminal

symbols,respectively, and Â representsproductionsof theform ÄÆÅÈÇ or ÄÆÅÉÇ3Ê where Ä�¿�ÊÌË&À and Ç)ËAÁ .

17

-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 0: -0.4992 -6

-5-4

-3-2

-10

12

34

Weight 0: -0.4992

0.680.7

0.720.740.760.780.8

0.820.840.860.88

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 41: -0.07465 -5

-4-3

-2-1

01

23

45

Weight 162: 0.4955

0.650.7

0.750.8

0.850.9

0.95

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 109: -0.0295 -5

-4-3

-2-1

01

23

45

Weight 72: 0.1065

0.650.7

0.750.8

0.850.9

0.951

1.051.1

1.151.2

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 28: 0.01001 -6-5-4-3-2-10 1 2 3 4 5

Weight 43: -0.07529

0.670.6750.68

0.6850.69

0.6950.7

0.705

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 127: -0.004546 -5

-4-3

-2-1

01

23

45

Weight 96: 0.01305

0.6830.6840.6850.6860.6870.6880.6890.69

0.6910.6920.693

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 146: -0.1471 -6-5-4-3-2-10 1 2 3 4 5

Weight 45: -0.1257

0.680.6850.69

0.6950.7

0.7050.71

0.7150.72

0.7250.73

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 81: 0.1939 -6-5-4-3-2-10 1 2 3 4 5

Weight 146: -0.1471

0.680.690.7

0.710.720.730.74

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 144: 0.1274 -5

-4-3

-2-1

01

23

45

Weight 67: 0.7342

0.640.660.680.7

0.720.740.760.780.8

0.82

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 154: 0.04359 -6

-5-4

-3-2

-10

12

34

Weight 166: -0.5495

0.650.7

0.750.8

0.850.9

0.951

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 101: 0.03919 -6

-5-4

-3-2

-10

12

34

Weight 76: -0.3808

0.680.6820.6840.6860.6880.69

0.6920.6940.6960.698

0.70.702

-4 -3 -2 -1 0 1 2 3 4 5 6Weight 13: 1.463 -5

-4-3

-2-1

01

23

45

Weight 7: 0.6851

0.660.680.7

0.720.740.760.780.8

0.820.84

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 60: 0.334 -5

-4-3

-2-1

01

23

45

Weight 106: 0.009545

0.680.6820.6840.6860.6880.69

0.6920.6940.6960.698

0.70.702

Figure8. Error surfaceplots for theFGSnetwork. Eachplot is with respectto two randomlychosendimensions.In

eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.

-2 -1 0 1 2 3 4 5 6 7 8Weight 0: 3.033 -2

-10

12

34

56

78

Weight 0: 3.033

1.211.221.231.241.251.261.271.28

-4 -3 -2 -1 0 1 2 3 4 5 6Weight 132: 1.213 -13-12-11-10-9

-8-7-6-5-4-3-2

Weight 100: -7.007

1.221.241.261.281.3

1.321.341.361.381.4

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 152: 0.6957 -8

-7-6

-5-4

-3-2

-10

12

Weight 47: -2.201

1.211.221.231.241.251.261.271.281.291.3

1.31

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 84: -0.1601 -5

-4-3

-2-1

01

23

45

Weight 152: 0.6957

1.211.221.231.241.251.261.271.28

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 91: 0.6162 -5

-4-3

-2-1

01

23

45

Weight 24: 0.1485

1.241.251.261.271.281.291.3

1.311.321.33

-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 160: -0.5467 -6

-5-4

-3-2

-10

12

34

Weight 172: -0.7099

1.241.251.261.271.281.291.3

1.311.32

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 105: -2.075 -9-8-7-6-5-4-3-2-10 1 2

Weight 79: -3.022

1.221.231.241.251.261.271.28

-7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 13: -1.695 -7-6-5-4-3-2-10 1 2 3 4

Weight 8: -1.008

1.241.2451.25

1.2551.26

1.2651.27

1.275

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 195: 0.1977 -3

-2-1

01

23

45

67

Weight 5: 2.464

1.241.261.281.3

1.321.341.361.381.4

1.421.44

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 86: -2.148 -3

-2-1

01

23

45

67

Weight 155: 2.094

1.221.2251.23

1.2351.24

1.2451.25

1.2551.26

1.2651.27

1.275

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 163: -2.216 -4-3-2-10 1 2 3 4 5 6 7

Weight 138: 1.867

1.221.231.241.251.261.271.281.291.3

1.31

-15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5Weight 126: -9.481 -8-7-6-5-4-3-2-10 1 2 3

Weight 86: -2.148

1.221.241.261.281.3

1.321.341.36

Figure9. Error surfaceplotsfor theN&P network. Eachplot is with respectto two randomlychosendimensions.In

eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.

Thealgorithmwe usefor automataextraction(from [19]) worksasfollows: afterthenetwork is trained(or

evenduringtraining),weapplyaprocedurefor extractingwhatthenetwork haslearned— i.e., thenetwork’s

currentconceptionof whatDFA it haslearned.TheDFA extractionprocessincludesthe following steps:

1) clusteringof the recurrentnetwork activation space,Í , to form DFA states,2) constructinga transition

18

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1Weight 0: -3.612 -9

-8-7

-6-5

-4-3

-2-1

01

Weight 0: -3.612

0.60.6020.6040.6060.6080.61

0.6120.6140.616

-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 102: -0.7384 3

45

67

89

1011

1213

Weight 122: 8.32

0.580.590.6

0.610.620.630.640.65

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 244: -8.597 -8

-7-6

-5-4

-3-2

-10

12

Weight 81: -2.907

0.560.570.580.590.6

0.610.620.630.640.650.66

-2 -1 0 1 2 3 4 5 6 7 8 9Weight 136: 3.962 -7

-6-5

-4-3

-2-1

01

23

Weight 151: -1.307

0.5850.59

0.5950.6

0.6050.61

0.6150.62

0.6250.63

0.635

4 5 6 7 8 9 10 11 12 13 14Weight 263: 9.384 -10-9

-8-7-6-5-4-3-2-10 1

Weight 215: -4.061

0.570.580.590.6

0.610.620.630.640.650.66

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 265: -8.874 -5

-4-3

-2-1

01

23

45

Weight 16: 0.3793

0.60.6050.61

0.6150.62

0.6250.63

0.6350.64

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1Weight 62: -5.243 -7

-6-5

-4-3

-2-1

01

23

Weight 226: -1.218

0.550.560.570.580.590.6

0.610.620.630.640.65

-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6Weight 128: -10.25 -14

-13-12

-11-10

-9-8

-7-6

-5-4

Weight 162: -8.527

0.580.590.6

0.610.620.630.640.650.66

-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 223: -0.2433 4

56

78

910

1112

1314

Weight 118: 9.467

0.580.590.6

0.610.620.630.640.650.66

-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 256: 0.1506 4

56

78

910

1112

1314

Weight 118: 9.467

0.580.590.6

0.610.620.630.640.65

-1 0 1 2 3 4 5 6 7 8 9Weight 228: 4.723 -5

-4-3

-2-1

01

23

45

Weight 8: 0.3707

0.580.5850.59

0.5950.6

0.6050.61

0.6150.62

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 46: -8.978 -1

01

23

45

67

89

Weight 228: 4.723

0.590.595

0.60.6050.61

0.6150.62

0.625

Figure10. Error surfaceplotsfor theElmannetwork. Eachplot is with respectto two randomlychosendimensions.

In eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.

-2 -1 0 1 2 3 4 5 6 7 8 9Weight 0: 3.879 -2-10 1 2 3 4 5 6 7 8 9

Weight 0: 3.879

1.41

1.415

1.42

1.425

1.43

1.435

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 65: -2.406 -15

-14-13

-12-11

-10-9

-8-7

-6-5

Weight 153: -10

1.4331.43351.434

1.43451.435

1.43551.436

1.43651.437

1.43751.438

-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6Weight 124: -10.8 9

1011

1213

1415

1617

1819

Weight 105: 14.08

1.4321.4341.4361.4381.44

1.4421.4441.4461.4481.45

1.452

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 122: -3.084 -15

-14-13

-12-11

-10-9

-8-7

-6-5

Weight 158: -9.731

1.43351.433521.433541.433561.433581.4336

1.433621.433641.433661.433681.4337

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 74: -8.938 -112

-111-110

-109-108

-107-106

-105-104

-103-102

Weight 204: -107

1.431.4351.44

1.4451.45

1.4551.46

1.4651.47

4 5 6 7 8 9 10 11 12 13 14Weight 192: 9.58 -15

-14-13

-12-11

-10-9

-8-7

-6-5

Weight 86: -9.935

1.3951.4

1.4051.41

1.4151.42

1.4251.43

1.4351.44

3 4 5 6 7 8 9 10 11 12 13Weight 170: 8.782 3

45

67

89

1011

1213

Weight 2: 8.368

1.4161.4181.42

1.4221.4241.4261.4281.43

1.4321.434

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 214: -8.855 -5

-4-3

-2-1

01

23

45

Weight 39: 0.5599

1.411.4151.42

1.4251.43

1.4351.44

1.4451.45

1.4551.46

33 34 35 36 37 38 39 40 41 42 43Weight 147: 38.69 0

12

34

56

78

910

Weight 223: 5.656

1.4051.41

1.4151.42

1.4251.43

1.4351.44

1.445

-19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9Weight 210: -13.29 -15

-14-13

-12-11

-10-9

-8-7

-6-5

Weight 86: -9.935

1.391.4

1.411.421.431.441.451.46

-2 -1 0 1 2 3 4 5 6 7 8 9Weight 63: 3.889 -8-7-6-5-4-3-2-10 1 2 3

Weight 81: -2.165

1.421.4251.43

1.4351.44

1.4451.45

1.455

-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 88: -8.262 4

56

78

910

1112

1314

Weight 85: 9.631

1.381.391.4

1.411.421.431.441.451.46

Figure11. Error surfaceplots for theW&Z network. Eachplot is with respectto two randomlychosendimensions.

In eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.

diagramby connectingthesestatestogetherwith the alphabetlabelledarcs,3) putting thesetransitions

togetherto make thefull digraph– forming loops,and4) reducingthedigraphto a minimal representation.

The hypothesisis that during training, the network begins to partition (or quantize)its statespaceinto

fairly well-separated,distinct regionsor clusters,which representcorrespondingstatesin somefinite state

19

automaton(recently, it hasbeenproved that arbitraryDFAs canbe stablyencodedinto recurrentneural

networks[45]). Onesimplewayof findingtheseclustersis to divideeachneuron’s rangeinto Î partitionsof

equalwidth. Thusfor 2 hiddenneurons,thereexist Î · possiblepartition states. TheDFA is constructed

by generatinga statetransitiondiagram,i.e. associatinganinput symbolwith thepartition stateit just left

andthepartition stateit activates.Theinitial partition state, or startstateof theDFA, is determinedfrom

theinitial valueof Í´Ï ZÑÐ W�Ò . If thenext input symbolmapsto thesamepartition statevalue,we assumethat

a loop is formed. Otherwise,a new statein theDFA is formed. TheDFA thusconstructedmaycontaina

maximumof Î · states;in practiceit is usuallymuchless,sincenot all partition statesarereachedby Í´Ï Z Ò .Eventuallythis processmustterminatesincethereareonly a finite numberof partitionsavailable;and,in

practice,many of the partitionsarenever reached.The derivedDFA canthenbe reducedto its minimal

DFA usingstandardminimizationalgorithms[28]. It shouldbenotedthatthisDFA extractionmethodmay

beappliedto any discrete-timerecurrentnet,regardlessof orderor hiddenlayers.Recently, theextraction

processhasbeenprovento convergeandextractany DFA learnedor encodedby theneuralnetwork [5].

The extractedDFAs dependon the quantizationlevel, Î . We extractedDFAs using valuesof Î starting

from 3 andusedstandardminimizationtechniquesto comparethe resultingautomata[28]. We passedthe

training and test datasetsthroughthe extractedDFAs. We found that the extractedautomatacorrectly

classified95%of thetrainingdataand60%of thetestdatafor ÎÓUÕÔ . Smallervaluesof Î producedDFAs

with lower performanceandlargervaluesof Î did not producesignificantlybetterperformance.A sample

extractedautomatacanbeseenin figure12. It is difficult to interprettheextractedautomataanda topic for

future researchis analysisof theextractedautomatawith a view to aiding interpretation.Additionally, an

importantopenquestionis how well theextractedautomataapproximatethegrammarimplementedby the

recurrentnetwork whichmaynotbea regulargrammar.

Automataextractionmayalsobeusefulfor improving theperformanceof thesystemvia aniterative com-

binationof rule extractionandrule insertion. Significantlearningtime improvementscanbe achieved by

training networks with prior knowledge[46]. This may leadto the ability to train larger networks which

encompassmoreof thetargetgrammar.

8 Discussion

This paperhasinvestigatedthe useof variousrecurrentneuralnetwork architectures(FGS,N&P, Elman

andW&Z) for classifyingnaturallanguagesentencesasgrammaticalor ungrammatical,therebyexhibiting

thesamekind of discriminatorypower providedby thePrinciplesandParameterslinguistic framework, or

Government-and-Bindingtheory. From bestto worst performance,the architectureswere: Elman,W&Z,

N&P andFGS.It is not surprisingthat theElmannetwork outperformstheFGSandN&P networks. The

computationalpower of Elmannetworks hasbeenshown to be at leastTuring equivalent [55], wherethe

N&P networkshavebeenshown to beTuringequivalent[54] but to within alinearslowdown. FGSnetworks

20

1

2

3

4

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

5152

53

54

55

56

57

58

59

6061

62

63

64

65

66

67

68

69

70

7172

73

74

75

76

77

78

79

80

81

82

83

84

85

8687

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120 121

Figure12. An automataextractedfrom an Elmannetwork trainedto performthe naturallanguagetask. The start

stateis state1 at thebottomleft andtheacceptingstateis state17 at thetop right. All stringswhich do not reachthe

acceptingstatearerejected.

have recentlybeenshown to be themostcomputationallylimited [14]. Elmannetworks arejust a special

caseof W&Z networks– thefactthattheElmanandW&Z networksarethetopperformersis notsurprising.

However, theoreticallywhy theElmannetwork outperformedtheW&Z network is anopenquestion.Our

experimentalresultssuggestthat this is a trainingissueandnot a representationalissue.Backpropagation-

through-time(BPTT) is an iterative algorithmthat is not guaranteedto find theglobalminima of thecost

function error surface. The error surfaceis different for the ElmanandW&Z networks, andour results

suggestthat the error surfaceof the W&Z network is lesssuitablefor the BPTT training algorithmused.

However, all architecturesdo learnsomerepresentationof thegrammar.

Are thenetworkslearningthegrammar?Thehierarchyof architectureswith increasingcomputationalpower

(for a givennumberof hiddennodes)give aninsightinto whethertheincreasedpower is usedto modelthe

21

morecomplex structuresfoundin thegrammar. Thefactthatthemorepowerful ElmanandW&Z networks

providedincreasedperformancesuggeststhatthey wereableto find structurein thedatawhichit maynotbe

possibleto modelwith theFGSnetwork. Additionally, investigationof thedatasuggeststhat100%correct

classificationonthetrainingdatawith only two wordinputswouldnotbepossibleunlessthenetworkswere

ableto learnsignificantaspectsof thegrammar.

Another comparisonof recurrentneuralnetwork architectures,that of Giles and Horne [30], compared

variousnetworkson randomlygenerated6 and64-statefinite memorymachines.Thelocally recurrentand

Narendra& Parthasarathynetworksprovedasgoodor superiorto morepowerful networks like theElman

network, indicatingthateitherthetaskdid not requiretheincreasedpower, or thevanilla backpropagation-

through-timelearningalgorithmusedwasunableto exploit it.

This paperhasshown thatbothElmanandW&Z recurrentneuralnetworksareableto learnanappropriate

grammarfor discriminatingbetweenthe sharplygrammatical/ungrammaticalpairsusedby GB-linguists.

However, generalizationis limited by the amountandnatureof the dataavailable,andit is expectedthat

increaseddifficulty will be encounteredin training the modelsasmoredatais used. It is clearthat there

is considerabledifficulty scalingthe modelsconsideredhereup to larger problems.We needto continue

to addresstheconvergenceof the trainingalgorithms,andbelieve that further improvementis possibleby

addressingthe natureof parameterupdatingduring gradientdescent.However, a point mustbe reached

afterwhich improvementwith gradientdescentbasedalgorithmsrequiresconsiderationof thenatureof the

error surface. This is relatedto the input andoutputencodings(rarely are they chosenwith the specific

aim of controlling theerrorsurface),theability of parameterupdatesto modify network behavior without

destroying previously learnedinformation, and the methodby which the network implementsstructures

suchashierarchicalandrecursive relations.

Acknowledgments

This work hasbeenpartially supportedby theAustralianTelecommunicationsandElectronicsResearchBoard(SL).

References

[1] RobertB. Allen. Sequentialconnectionistnetworksfor answeringsimplequestionsabouta microworld. In 5th

AnnualProceedingsof theCognitiveScienceSociety, pages489–495,1983.

[2] EtienneBarnardandElizabethC.Botha.Back-propagationusesprior informationefficiently. IEEETransactions

on Neural Networks, 4(5):794–802,September1993.

[3] EtienneBarnardandDavid Casasent.A comparisonbetweencriterion functionsfor linear classifiers,with an

applicationto neuralnets.IEEE TransactionsonSystems,Man,andCybernetics, 19(5):1030–1041, 1989.

22

[4] E.B. BaumandF. Wilczek. Supervisedlearningof probability distributionsby neuralnetworks. In D.Z. An-

derson,editor, Neural InformationProcessingSystems, pages52–61,New York, 1988.AmericanInstituteof

Physics.

[5] M.P. Casey. Thedynamicsof discrete-timecomputation,with applicationto recurrentneuralnetworksandfinite

statemachineextraction.Neural Computation, 8(6):1135–1178,1996.

[6] N.A. Chomsky. Threemodelsfor the descriptionof language. IRE Transactionson InformationTheory, IT-

2:113–124,1956.

[7] N.A. Chomsky. Lectureson GovernmentandBinding. ForisPublications,1981.

[8] N.A. Chomsky. Knowledgeof Language: Its Nature, Origin, andUse. Prager, 1986.

[9] A. Cleeremans,D. Servan-Schreiber, andJ.L.McClelland.Finite stateautomataandsimplerecurrentnetworks.

Neural Computation, 1(3):372–381,1989.

[10] C. Darken andJ.E. Moody. Note on learningrateschedulesfor stochasticoptimization. In R.P. Lippmann,

J.E.Moody, andD. S.Touretzky, editors,Advancesin Neural InformationProcessingSystems, volume3, pages

832–838.MorganKaufmann,SanMateo,CA, 1991.

[11] C. DarkenandJ.E.Moody. Towardsfasterstochasticgradientsearch.In Neural InformationProcessingSystems

4, pages1009–1016.MorganKaufmann,1992.

[12] J.L. Elman. Structuredrepresentationsandconnectionistmodels. In 6th AnnualProceedingsof the Cognitive

ScienceSociety, pages17–25,1984.

[13] J.L.Elman.Distributedrepresentations,simplerecurrentnetworks,andgrammaticalstructure.MachineLearn-

ing, 7(2/3):195–226,1991.

[14] P. FrasconiandM. Gori. Computationalcapabilitiesof local-feedbackrecurrentnetworksactingasfinite-state

machines.IEEE Transactionson Neural Networks, 7(6),1996.1521-1524.

[15] P. Frasconi,M. Gori, M. Maggini,andG. Soda.Unified integrationof explicit rulesandlearningby examplein

recurrentnetworks. IEEE Transactionson KnowledgeandData Engineering, 7(2):340–346,1995.

[16] P. Frasconi,M. Gori, andG. Soda.Local feedbackmultilayerednetworks. Neural Computation, 4(1):120–130,

1992.

[17] K.S.Fu. SyntacticPatternRecognitionandApplications. Prentice-Hall,EnglewoodCliffs, N.J,1982.

[18] M. GasserandC. Lee.Networksthatlearnphonology. Technicalreport,ComputerScienceDepartment,Indiana

University, 1990.

[19] C. Lee Giles, C.B. Miller, D. Chen,H.H. Chen,G.Z. Sun,andY.C. Lee. Learningandextractingfinite state

automatawith second-orderrecurrentneuralnetworks. Neural Computation, 4(3):393–405,1992.

[20] C. LeeGiles,C.B. Miller, D. Chen,G.Z. Sun,H.H. Chen,andY.C. Lee. Extractingandlearninganunknown

grammarwith recurrentneuralnetworks. In J.E.Moody, S.J.Hanson,andR.PLippmann,editors,Advancesin

Neural InformationProcessingSystems4, pages317–324,SanMateo,CA, 1992.MorganKaufmannPublishers.

[21] C. Lee Giles, G.Z. Sun,H.H. Chen,Y.C. Lee,andD. Chen. Higher orderrecurrentnetworks & grammatical

inference.In D.S.Touretzky, editor, Advancesin Neural InformationProcessingSystems2, pages380–387,San

Mateo,CA, 1990.MorganKaufmannPublishers.

23

[22] M. Hare. Therole of similarity in Hungarianvowel harmony: A connectionistaccount.TechnicalReportCRL

9004,Centrefor Researchin Language,Universityof California,SanDiego,1990.

[23] M. Hare,D. Corina,andG.W. Cottrell. Connectionistperspectiveon prosodicstructure.TechnicalReportCRL

NewsletterVolume3 Number2, Centrefor Researchin Language,Universityof California,SanDiego,1989.

[24] CatherineL. HarrisandJ.L. Elman. Representingvariableinformationwith simplerecurrentnetworks. In 6th

AnnualProceedingsof theCognitiveScienceSociety, pages635–642,1984.

[25] M.H. Harrison.Introductionto FormalLanguageTheory. Addison-Wesley PublishingCompany, Reading,MA,

1978.

[26] S.Haykin. Neural Networks,A ComprehensiveFoundation. Macmillan,New York, NY, 1994.

[27] J.A. Hertz, A. Krogh, andR.G. Palmer. Introductionto the Theoryof Neural Computation. Addison-Wesley

PublishingCompany, RedwoodCity, CA, 1991.

[28] J.E.HopcroftandJ.D.Ullman. Introductionto AutomataTheory, Languages,andComputation. Addison-Wesley

PublishingCompany, Reading,MA, 1979.

[29] J.Hopfield.Learningalgorithmsandprobabilitydistributionsin feed-forwardandfeed-backnetworks.Proceed-

ingsof theNationalAcademyof Science, 84:8429–8433,1987.

[30] B. G. Horne and C. Lee Giles. An experimentalcomparisonof recurrentneuralnetworks. In G. Tesauro,

D. Touretzky, andT. Leen,editors,Advancesin Neural InformationProcessingSystems7, pages697–704.MIT

Press,1995.

[31] L. Ingber. Very fastsimulatedre-annealing.MathematicalComputerModelling, 12:967–973,1989.

[32] L. Ingber. Adaptivesimulatedannealing(ASA). Technicalreport,LesterIngberResearch,McLean,VA, 1993.

[33] M.I. Jordan.Attractordynamicsandparallelismin a connectionistsequentialmachine. In Proceedingsof the

NinthAnnualconferenceof theCognitiveScienceSociety, pages531–546.LawrenceErlbaum,1986.

[34] M.I. Jordan. Serial order: A parallel distributed processingapproach. TechnicalReport ICS Report8604,

Institutefor CognitiveScience,Universityof Californiaat SanDiego,La Jolla,CA, May 1986.

[35] S. Kirkpatrick andG.B. Sorkin. Simulatedannealing. In Michael A. Arbib, editor, TheHandbookof Brain

TheoryandNeural Networks, pages876–878.MIT Press,Cambridge,Massachusetts,1995.

[36] S.Kullback. InformationTheoryandStatistics. Wiley, New York, 1959.

[37] H. LasnikandJ.Uriagereka.A Coursein GB Syntax:Lectureson BindingandEmptyCategories. MIT Press,

Cambridge,MA, 1988.

[38] Steve Lawrence,Sandiway Fong,andC. LeeGiles. Naturallanguagegrammaticalinference:A comparisonof

recurrentneuralnetworksandmachinelearningmethods.In StefanWermter, EllenRiloff, andGabrieleScheler,

editors,Symbolic,Connectionist,and StatisticalApproachesto Learning for Natural Language Processing,

Lecturenotesin AI, pages33–47.SpringerVerlag,New York, 1996.

[39] Y. Le Cun. Efficient learningandsecondordermethods.Tutorial presentedat NeuralInformationProcessing

Systems5, 1993.

[40] L.R. LeerinkandM. Jabri. Learningthepasttenseof Englishverbsusingrecurrentneuralnetworks. In Peter

Bartlett, Anthony Burkitt, andRobertWilliamson, editors,Australian Conferenceon Neural Networks, pages

222–226.AustralianNationalUniversity, 1996.

24

[41] B. MacWhinney, J. Leinbach,R. Taraban,andJ. McDonald. Languagelearning: cuesor rules? Journal of

MemoryandLanguage, 28:255–277,1989.

[42] R.MiikkulainenandM. Dyer. Encodinginput/outputrepresentationsin connectionistcognitivesystems.In D. S.

Touretzky, G. E. Hinton, andT. J. Sejnowski, editors,Proceedingsof the 1988ConnectionistModelsSummer

School, pages188–195,LosAltos, CA, 1989.MorganKaufmann.

[43] M.C. Mozer. A focusedbackpropagationalgorithm for temporalpattern recognition. Complex Systems,

3(4):349–381,August1989.

[44] K.S. NarendraandK. Parthasarathy. Identificationandcontrol of dynamicalsystemsusingneuralnetworks.

IEEETransactionson Neural Networks, 1(1):4–27,1990.

[45] C.W. Omlin andC.L. Giles. Constructingdeterministicfinite-stateautomatain recurrentneuralnetworks. Jour-

nal of theACM, 45(6):937,1996.

[46] C.W. Omlin andC.L. Giles. Extractionof rulesfrom discrete-timerecurrentneuralnetworks. Neural Networks,

9(1):41–52,1996.

[47] C.W. Omlin andC.L. Giles. Rule revision with recurrentneuralnetworks. IEEE Transactionson Knowledge

andData Engineering, 8(1):183–188,1996.

[48] F. PereiraandY. Schabes.Inside-outsidere-estimationfrom partially bracketedcorpora.In Proceedingsof the

30thannualmeetingof theACL, pages128–135,Newark,1992.

[49] D. M. Pesetsky. PathsandCategories. PhDthesis,MIT, 1982.

[50] J.B.Pollack.Theinductionof dynamicalrecognizers.MachineLearning, 7:227–252,1991.

[51] D.E. RumelhartandJ.L. McClelland. On learningthe pasttensesof Englishverbs. In D.E. Rumelhartand

J.L. McClelland,editors,Parallel DistributedProcessing, volume 2, chapter18, pages216–271.MIT Press,

Cambridge,1986.

[52] J.W. Shavlik. Combiningsymbolicandneurallearning.MachineLearning, 14(3):321–331,1994.

[53] H.T. Siegelmann.Computationbeyondtheturing limit. Science, 268:545–548,1995.

[54] H.T. Siegelmann,B.G. Horne,andC.L. Giles. Computationalcapabilitiesof recurrentNARX neuralnetworks.

IEEETrans.on Systems,ManandCybernetics- Part B, 27(2):208,1997.

[55] H.T. SiegelmannandE.D.Sontag.On thecomputationalpowerof neuralnets.Journalof ComputerandSystem

Sciences, 50(1):132–150,1995.

[56] P. Simard,M.B. Ottaway, andD.H. Ballard. Analysisof recurrentbackpropagation.In D. Touretzky, G. Hinton,

andT. Sejnowski, editors,Proceedingsof the1988ConnectionistModelsSummerSchool, pages103–112,San

Mateo,1989.(Pittsburg 1988),MorganKaufmann.

[57] S.A. Solla, E. Levin, and M. Fleisher. Acceleratedlearningin layeredneuralnetworks. Complex Systems,

2:625–639,1988.

[58] M. F. St. JohnandJ.L. McClelland. Learningandapplyingcontextual constraintsin sentencecomprehension.

Artificial Intelligence, 46:5–46,1990.

[59] AndreasStolcke. Learningfeature-basedsemanticswith simplerecurrentnetworks. TechnicalReportTR-90-

015,InternationalComputerScienceInstitute,Berkeley, California,April 1990.

25

[60] M. Tomita. Dynamicconstructionof finite-stateautomatafrom examplesusinghill-climbing. In Proceedingsof

theFourth AnnualCognitiveScienceConference, pages105–108,Ann Arbor, MI, 1982.

[61] D. S. Touretzky. Rulesand mapsin connectionistsymbol processing. TechnicalReportCMU-CS-89-158,

CarnegieMellon University:Departmentof ComputerScience,Pittsburgh,PA, 1989.

[62] D. S.Touretzky. Towardsa connectionistphonology:The“many maps”approachto sequencemanipulation.In

Proceedingsof the11thAnnualConferenceof theCognitiveScienceSociety, pages188–195,1989.

[63] A.C. Tsoi andA.D. Back. Locally recurrentglobally feedforwardnetworks: A critical review of architectures.

IEEETransactionson Neural Networks, 5(2):229–239,1994.

[64] R.L. WatrousandG.M. Kuhn. Inductionof finite statelanguagesusingsecond-orderrecurrentnetworks. In J.E.

Moody, S.J.Hanson,andR.PLippmann,editors,Advancesin Neural InformationProcessingSystems4, pages

309–316,SanMateo,CA, 1992.MorganKaufmannPublishers.

[65] R.L. WatrousandG.M. Kuhn. Inductionof finite-statelanguagesusingsecond-orderrecurrentnetworks.Neural

Computation, 4(3):406,1992.

[66] R.J.Williams andJ.Peng.An efficientgradient-basedalgorithmfor on-linetrainingof recurrentnetwork trajec-

tories.Neural Computation, 2(4):490–501,1990.

[67] R.J.Williams andD. Zipser. A learningalgorithmfor continuallyrunningfully recurrentneuralnetworks.Neural

Computation, 1(2):270–280,1989.

[68] Z. Zeng,R.M. Goodman,andP. Smyth. Learningfinite statemachineswith self-clusteringrecurrentnetworks.

Neural Computation, 5(6):976–990,1993.

26

Natural Language Grammatical Inference with Recurrent...

Documents

Transcript of Natural Language Grammatical Inference with Recurrent...