Artificial Neural nets

download Artificial Neural nets

of 21

description

Why are ann diff to train, micheal nielsen,ANN,RNN, deep learning details with diagram.HF detection , gradient descent usage

Transcript of Artificial Neural nets

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 1/21

    Imagineyou'reanengineerwhohasbeenaskedtodesignacomputerfromscratch.Onedayyou'reworkingawayinyouroffice,designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,

    whenyourbosswalksinwithbadnews.Thecustomerhasjustaddedasurprisingdesignrequirement:thecircuitfortheentirecomputermustbejusttwolayersdeep:

    You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"

    Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomerwants,theyget."

    Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyouANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda

    manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs

    andthennegatetheoutput.Withthesespecialgatesitturnsouttobepossibletocomputeanyfunctionatallusingacircuitthat'sjusttwolayersdeep.

    Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.Inpractice,whensolvingcircuitdesignproblems(ormostanykindofalgorithmicproblem),weusuallystartbyfiguringouthowtosolvesubproblems,andthengraduallyintegratethesolutions.In

    CHAPTER5

    Whyaredeepneuralnetworkshardtotrain?

    NeuralNetworksandDeepLearningWhatthisbookisaboutOntheexercisesandproblemsUsingneuralnetstorecognizehandwrittendigitsHowthebackpropagationalgorithmworksImprovingthewayneuralnetworkslearnAvisualproofthatneuralnetscancomputeanyfunctionWhyaredeepneuralnetworkshardtotrain?DeeplearningIsthereasimplealgorithmforintelligence?AcknowledgementsFrequentlyAskedQuestions

    Sponsors

    Thankstoallthesupporterswhomadethebookpossible.ThanksalsotoallthecontributorstotheBugfinderHallofFame.

    Thebookiscurrentlyabetarelease,andisstillunderactivedevelopment.Pleasesenderrorreportstomn@michaelnielsen.org.Forotherenquiries,pleaseseetheFAQfirst.

    ResourcesCoderepository

    Mailinglistforbookannouncements

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 2/21

    otherwords,webuilduptoasolutionthroughmultiplelayersofabstraction.

    Forinstance,supposewe'redesigningalogicalcircuittomultiplytwonumbers.Chancesarewewanttobuilditupoutofsubcircuitsdoingoperationslikeaddingtwonumbers.Thesubcircuitsforaddingtwonumberswill,inturn,bebuiltupoutofsubsubcircuitsforaddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:

    Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuitelements.Infact,it'llprobablycontainmorethanthreelayers,aswebreakthesubtasksdownintosmallerunitsthanI'vedescribed.Butyougetthegeneralidea.

    Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renotjusthelpfulfordesign.Thereare,infact,mathematicalproofsshowingthatforsomefunctionsveryshallowcircuitsrequireexponentiallymorecircuitelementstocomputethandodeepcircuits.Forinstance,afamous1984paperbyFurst,SaxeandSipser*showedthatcomputingtheparityofasetofbitsrequiresexponentiallymanygates,ifdonewithashallowcircuit.Ontheotherhand,ifyouusedeepercircuitsit'seasytocomputetheparityusingasmallcircuit:youjustcomputetheparityofpairsofbits,thenusethoseresultstocomputetheparityofpairsofpairsofbits,andsoon,buildingupquicklytotheoverallparity.Deepcircuitsthuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.

    MichaelNielsen'sprojectannouncementmailinglist

    ByMichaelNielsen/Jul2015

    *SeeParity,Circuits,andthePolynomialTimeHierarchy,byMerrickFurst,JamesB.Saxe,andMichaelSipser(1984).

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 3/21

    Uptonow,thisbookhasapproachedneuralnetworkslikethecrazycustomer.Almostallthenetworkswe'veworkedwithhavejustasinglehiddenlayerofneurons(plustheinputandoutputlayers):

    Thesesimplenetworkshavebeenremarkablyuseful:inearlierchaptersweusednetworkslikethistoclassifyhandwrittendigitswithbetterthan98percentaccuracy!Nonetheless,intuitivelywe'dexpectnetworkswithmanymorehiddenlayerstobemorepowerful:

    Suchnetworkscouldusetheintermediatelayerstobuildupmultiplelayersofabstraction,justaswedoinBooleancircuits.Forinstance,ifwe'redoingvisualpatternrecognition,thentheneuronsinthefirstlayermightlearntorecognizeedges,theneuronsinthesecondlayercouldlearntorecognizemorecomplexshapes,saytriangleorrectangles,builtupfromedges.Thethirdlayerwouldthenrecognizestillmorecomplexshapes.Andsoon.Thesemultiplelayersofabstractionseemlikelytogivedeepnetworksacompellingadvantageinlearningtosolvecomplexpattern

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 4/21

    recognitionproblems.Moreover,justasinthecaseofcircuits,therearetheoreticalresultssuggestingthatdeepnetworksareintrinsicallymorepowerfulthanshallownetworks*.

    Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltrytrainingdeepnetworksusingourworkhorselearningalgorithmstochasticgradientdescentbybackpropagation.Butwe'llrunintotrouble,withourdeepnetworksnotperformingmuch(ifatall)betterthanshallownetworks.

    Thatfailureseemssurprisinginthelightofthediscussionabove.Ratherthangiveupondeepnetworks,we'lldigdownandtrytounderstandwhat'smakingourdeepnetworkshardtotrain.Whenwelookclosely,we'lldiscoverthatthedifferentlayersinourdeepnetworkarelearningatvastlydifferentspeeds.Inparticular,whenlaterlayersinthenetworkarelearningwell,earlylayersoftengetstuckduringtraining,learningalmostnothingatall.Thisstucknessisn'tsimplyduetobadluck.Rather,we'lldiscovertherearefundamentalreasonsthelearningslowdownoccurs,connectedtoouruseofgradientbasedlearningtechniques.

    Aswedelveintotheproblemmoredeeply,we'lllearnthattheoppositephenomenoncanalsooccur:theearlylayersmaybelearningwell,butlaterlayerscanbecomestuck.Infact,we'llfindthatthere'sanintrinsicinstabilityassociatedtolearningbygradientdescentindeep,manylayerneuralnetworks.Thisinstabilitytendstoresultineithertheearlyorthelaterlayersgettingstuckduringtraining.

    Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,wecanbegintogaininsightintowhat'srequiredtotraindeepnetworkseffectively.Andsotheseinvestigationsaregoodpreparationforthenextchapter,wherewe'llusedeeplearningtoattackimagerecognitionproblems.

    ThevanishinggradientproblemSo,whatgoeswrongwhenwetrytotrainadeepnetwork?

    *ForcertainproblemsandnetworkarchitecturesthisisprovedinOnthenumberofresponseregionsofdeepfeedforwardnetworkswithpiecewiselinearactivations,byRazvanPascanu,GuidoMontfar,andYoshuaBengio(2014).Seealsothemoreinformaldiscussioninsection2ofLearningdeeparchitecturesforAI,byYoshuaBengio(2009).

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 5/21

    Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwithjustasinglehiddenlayer.Asperusual,we'llusetheMNISTdigitclassificationproblemasourplaygroundforlearningandexperimentation*.

    Ifyouwish,youcanfollowalongbytrainingnetworksonyourcomputer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowishtofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthecode,whichyoucangetbycloningtherelevantrepositoryfromthecommandline:

    gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git

    Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.You'llneedtochangeintothesrcsubdirectory.

    Then,fromaPythonshellweloadtheMNISTdata:

    >>>importmnist_loader

    >>>training_data,validation_data,test_data=\

    ...mnist_loader.load_data_wrapper()

    Wesetupournetwork:

    >>>importnetwork2

    >>>net=network2.Network([784,30,10])

    Thisnetworkhas784neuronsintheinputlayer,correspondingtothe pixelsintheinputimage.Weuse30hiddenneurons,aswellas10outputneurons,correspondingtothe10possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').

    Let'strytrainingournetworkfor30completeepochs,usingminibatchesof10trainingexamplesatatime,alearningrate ,andregularizationparameter .Aswetrainwe'llmonitortheclassificationaccuracyonthevalidation_data*:

    >>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

    ...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

    Wegetaclassificationaccuracyof96.48percent(orthereaboutsit'llvaryabitfromruntorun),comparabletoourearlierresultswithasimilarconfiguration.

    *IintroducedtheMNISTproblemanddatahereandhere.

    g

    I

    M

    *Notethatthenetworkstakequitesometimetotrainuptoafewminutespertrainingepoch,dependingonthespeedofyourmachine.Soifyou'rerunningthecodeit'sbesttocontinuereadingandreturnlater,nottowaitforthecodetofinishexecuting.

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 6/21

    Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,andtrytrainingwiththesamehyperparameters:

    >>>net=network2.Network([784,30,30,10])

    >>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

    ...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

    Thisgivesanimprovedclassificationaccuracy,96.90percent.That'sencouraging:alittlemoredepthishelping.Let'saddanother30neuronhiddenlayer:

    >>>net=network2.Network([784,30,30,30,10])

    >>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

    ...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

    Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57percent,closetoouroriginalshallownetwork.Andsupposeweinsertonefurtherhiddenlayer:

    >>>net=network2.Network([784,30,30,30,30,10])

    >>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

    ...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

    Theclassificationaccuracydropsagain,to96.53percent.That'sprobablynotastatisticallysignificantdrop,butit'snotencouraging,either.

    Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersoughttomakethenetworkabletolearnmorecomplexclassificationfunctions,andthusdoabetterjobclassifying.Certainly,thingsshouldn'tgetworse,sincetheextralayerscan,intheworstcase,simplydonothing*.Butthat'snotwhat'sgoingon.

    Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreallycouldhelpinprinciple,andtheproblemisthatourlearningalgorithmisn'tfindingtherightweightsandbiases.We'dliketofigureoutwhat'sgoingwronginourlearningalgorithm,andhowtodobetter.

    Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthenetworklearns.Below,I'veplottedpartofanetwork,i.e.,anetworkwithtwohiddenlayers,eachcontaininghiddenneurons.Eachneuroninthediagramhasalittlebaronit,

    *Seethislaterproblemtounderstandhowtobuildahiddenlayerthatdoesnothing.

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 7/21

    representinghowquicklythatneuronischangingasthenetworklearns.Abigbarmeanstheneuron'sweightsandbiasarechangingrapidly,whileasmallbarmeanstheweightsandbiasarechangingslowly.Moreprecisely,thebarsdenotethegradient foreachneuron,i.e.,therateofchangeofthecostwithrespecttotheneuron'sbias.BackinChapter2wesawthatthisgradientquantitycontrollednotjusthowrapidlythebiaschangesduringlearning,butalsohowrapidlytheweightsinputtotheneuronchange,too.Don'tworryifyoudon'trecallthedetails:thethingtokeepinmindissimplythatthesebarsshowhowquicklyeachneuron'sweightsandbiasarechangingasthenetworklearns.

    Tokeepthediagramsimple,I'veshownjustthetopsixneuronsinthetwohiddenlayers.I'veomittedtheinputneurons,sincethey'vegotnoweightsorbiasestolearn.I'vealsoomittedtheoutputneurons,sincewe'redoinglayerwisecomparisons,anditmakesmostsensetocomparelayerswiththesamenumberofneurons.Theresultsareplottedattheverybeginningoftraining,i.e.,immediatelyafterthenetworkisinitialized.Heretheyare*:

    *Thedataplottedisgeneratedusingtheprogramgenerate_gradient.py.Thesameprogramisalsousedtogeneratetheresultsquotedlaterinthissection.

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 8/21

    Thenetworkwasinitializedrandomly,andsoit'snotsurprisingthatthere'salotofvariationinhowrapidlytheneuronslearn.Still,onethingthatjumpsoutisthatthebarsinthesecondhiddenlayeraremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asaresult,theneuronsinthesecondhiddenlayerwilllearnquiteabitfasterthantheneuronsinthefirsthiddenlayer.Isthismerelyacoincidence,oraretheneuronsinthesecondhiddenlayerlikelytolearnfasterthanneuronsinthefirsthiddenlayeringeneral?

    Todeterminewhetherthisisthecase,ithelpstohaveaglobalwayofcomparingthespeedoflearninginthefirstandsecondhiddenlayers.Todothis,let'sdenotethegradientas ,i.e.,thegradientforthe thneuroninthe thlayer*.Wecanthinkofthegradient asavectorwhoseentriesdeterminehowquicklythefirsthiddenlayerlearns,and asavectorwhoseentriesdeterminehowquicklythesecondhiddenlayerlearns.We'llthenusethelengthsofthesevectorsas(rough!)globalmeasuresofthespeedat

    F

    (

    &

    (

    &

    & (

    *BackinChapter2wereferredtothisastheerror,butherewe'lladopttheinformalterm"gradient".Isay"informal"becauseofcoursethisdoesn'texplicitlyincludethepartialderivativesofthecostwithrespecttotheweights, .3

    F

    F

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 9/21

    whichthelayersarelearning.So,forinstance,thelengthmeasuresthespeedatwhichthefirsthiddenlayerislearning,whilethelength measuresthespeedatwhichthesecondhiddenlayerislearning.

    Withthesedefinitions,andinthesameconfigurationaswasplottedabove,wefind and .Sothisconfirmsourearliersuspicion:theneuronsinthesecondhiddenlayerreallyarelearningmuchfasterthantheneuronsinthefirsthiddenlayer.

    Whathappensifweaddmorehiddenlayers?Ifwehavethreehiddenlayers,ina network,thentherespectivespeedsoflearningturnouttobe0.012,0.060,and0.283.Again,earlierhiddenlayersarelearningmuchslowerthanlaterhiddenlayers.Supposeweaddyetanotherlayerwith hiddenneurons.Inthatcase,therespectivespeedsoflearningare0.003,0.017,0.070,and0.285.Thepatternholds:earlylayerslearnslowerthanlaterlayers.

    We'vebeenlookingatthespeedoflearningatthestartoftraining,thatis,justafterthenetworksareinitialized.Howdoesthespeedoflearningchangeaswetrainournetworks?Let'sreturntolookatthenetworkwithjusttwohiddenlayers.Thespeedoflearningchangesasfollows:

    Togeneratetheseresults,Iusedbatchgradientdescentwithjust

    F

    F

    F

    F

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 10/21

    1,000trainingimages,trainedover500epochs.ThisisabitdifferentthanthewayweusuallytrainI'veusednominibatches,andjust1,000trainingimages,ratherthanthefull50,000imagetrainingset.I'mnottryingtodoanythingsneaky,orpullthewooloveryoureyes,butitturnsoutthatusingminibatchstochasticgradientdescentgivesmuchnoisier(albeitverysimilar,whenyouaverageawaythenoise)results.UsingtheparametersI'vechosenisaneasywayofsmoothingtheresultsout,sowecanseewhat'sgoingon.

    Inanycase,asyoucanseethetwolayersstartoutlearningatverydifferentspeeds(aswealreadyknow).Thespeedinbothlayersthendropsveryquickly,beforerebounding.Butthroughitall,thefirsthiddenlayerlearnsmuchmoreslowlythanthesecondhiddenlayer.

    Whataboutmorecomplexnetworks?Here'stheresultsofasimilarexperiment,butthistimewiththreehiddenlayers(a

    network):

    Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Finally,let'saddafourthhiddenlayers(a

    network),andseewhathappenswhenwetrain:

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 11/21

    Again,earlyhiddenlayerslearnmuchmoreslowlythanlaterhiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly100timesslowerthanthefinalhiddenlayer.Nowonderwewerehavingtroubletrainingthesenetworksearlier!

    Wehavehereanimportantobservation:inatleastsomedeepneuralnetworks,thegradienttendstogetsmalleraswemovebackwardthroughthehiddenlayers.Thismeansthatneuronsintheearlierlayerslearnmuchmoreslowlythanneuronsinlaterlayers.Andwhilewe'veseenthisinjustasinglenetwork,therearefundamentalreasonswhythishappensinmanyneuralnetworks.Thephenomenonisknownasthevanishinggradientproblem*.

    Whydoesthevanishinggradientproblemoccur?Aretherewayswecanavoidit?Andhowshouldwedealwithitintrainingdeepneuralnetworks?Infact,we'lllearnshortlythatit'snotinevitable,althoughthealternativeisnotveryattractive,either:sometimesthegradientgetsmuchlargerinearlierlayers!Thisistheexplodinggradientproblem,andit'snotmuchbetternewsthanthevanishinggradientproblem.Moregenerally,itturnsoutthatthegradientindeepneuralnetworksisunstable,tendingtoeitherexplodeorvanishinearlierlayers.Thisinstabilityisafundamentalproblemforgradientbasedlearningindeepneuralnetworks.It'ssomethingweneedtounderstand,and,ifpossible,takestepstoaddress.

    *SeeGradientflowinrecurrentnets:thedifficultyoflearninglongtermdependencies,bySeppHochreiter,YoshuaBengio,PaoloFrasconi,andJrgenSchmidhuber(2001).Thispaperstudiedrecurrentneuralnets,buttheessentialphenomenonisthesameasinthefeedforwardnetworkswearestudying.SeealsoSeppHochreiter'searlierDiplomaThesis,UntersuchungenzudynamischenneuronalenNetzen(1991,inGerman).

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 12/21

    Oneresponsetovanishing(orunstable)gradientsistowonderifthey'rereallysuchaproblem.Momentarilysteppingawayfromneuralnets,imagineweweretryingtonumericallyminimizeafunction ofasinglevariable.Wouldn'titbegoodnewsifthederivative wassmall?Wouldn'tthatmeanwewerealreadynearanextremum?Inasimilarway,mightthesmallgradientinearlylayersofadeepnetworkmeanthatwedon'tneedtodomuchadjustmentoftheweightsandbiases?

    Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedtheweightandbiasesinthenetwork.Itisextremelyunlikelyourinitialweightsandbiaseswilldoagoodjobatwhateveritiswewantournetworktodo.Tobeconcrete,considerthefirstlayerofweightsina networkfortheMNISTproblem.Therandominitializationmeansthefirstlayerthrowsawaymostinformationabouttheinputimage.Eveniflaterlayershavebeenextensivelytrained,theywillstillfinditextremelydifficulttoidentifytheinputimage,simplybecausetheydon'thaveenoughinformation.Andsoitcan'tpossiblybethecasethatnotmuchlearningneedstobedoneinthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedtofigureouthowtoaddressthevanishinggradientproblem.

    What'scausingthevanishinggradientproblem?UnstablegradientsindeepneuralnetsTogetinsightintowhythevanishinggradientproblemoccurs,let'sconsiderthesimplestdeepneuralnetwork:onewithjustasingleneuronineachlayer.Here'sanetworkwiththreehiddenlayers:

    Here, aretheweights, arethebiases,and issomecostfunction.Justtoremindyouhowthisworks,theoutputfromthe thneuronis ,where istheusualsigmoid

    activationfunction,and istheweightedinputtotheneuron.I'vedrawnthecost attheendtoemphasizethatthe

    "4

    4"

    3

    3

    &

    & U 6

    &

    U

    6

    &

    3

    &

    &

    &

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 13/21

    costisafunctionofthenetwork'soutput, :iftheactualoutputfromthenetworkisclosetothedesiredoutput,thenthecostwillbelow,whileifit'sfaraway,thecostwillbehigh.

    We'regoingtostudythegradient associatedtothefirsthiddenneuron.We'llfigureoutanexpressionfor ,andbystudyingthatexpressionwe'llunderstandwhythevanishinggradientproblemoccurs.

    I'llstartbysimplyshowingyoutheexpressionfor .Itlooksforbidding,butit'sactuallygotasimplestructure,whichI'lldescribeinamoment.Here'stheexpression(ignorethenetwork,fornow,andnotethat isjustthederivativeofthe function):

    Thestructureintheexpressionisasfollows:thereisa termintheproductforeachneuroninthenetworkaweight termforeachweightinthenetworkandafinal term,correspondingtothecostfunctionattheend.NoticethatI'veplacedeachtermintheexpressionabovethecorrespondingpartofthenetwork.Sothenetworkitselfisamnemonicfortheexpression.

    You'rewelcometotakethisexpressionforgranted,andskiptothediscussionofhowitrelatestothevanishinggradientproblem.There'snoharmindoingthis,sincetheexpressionisaspecialcaseofourearlierdiscussionofbackpropagation.Butthere'salsoasimpleexplanationofwhytheexpressionistrue,andsoit'sfun(andperhapsenlightening)totakealookatthatexplanation.

    Imaginewemakeasmallchange inthebias .Thatwillsetoffacascadingseriesofchangesintherestofthenetwork.First,itcausesachange intheoutputfromthefirsthiddenneuron.That,inturn,willcauseachange intheweightedinputtothesecondhiddenneuron.Thenachange intheoutputfromthesecondhiddenneuron.Andsoon,allthewaythroughtoachange

    U

    U

    U

    6

    &

    3

    &

    6

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 14/21

    inthecostattheoutput.Wehave

    Thissuggeststhatwecanfigureoutanexpressionforthegradientbycarefullytrackingtheeffectofeachstepinthiscascade.

    Todothis,let'sthinkabouthow causestheoutput fromthefirsthiddenneurontochange.Wehave ,so

    That termshouldlookfamiliar:it'sthefirstterminourclaimedexpressionforthegradient .Intuitively,thistermconvertsachange inthebiasintoachange intheoutputactivation.Thatchange inturncausesachangeintheweightedinput tothesecondhiddenneuron:

    Combiningourexpressionsfor and ,weseehowthechangeinthebias propagatesalongthenetworktoaffect :

    Again,thatshouldlookfamiliar:we'venowgotthefirsttwotermsinourclaimedexpressionforthegradient .

    Wecankeepgoinginthisfashion,trackingthewaychangespropagatethroughtherestofthenetwork.Ateachneuronwepickupa term,andthrougheachweightwepickupa term.Theendresultisanexpressionrelatingthefinalchange incosttotheinitialchange inthebias:

    Dividingby wedoindeedgetthedesiredexpressionforthe

    U U

    6

    3

    U 3

    U

    6

    U

    6

    6

    3

    6

    6

    3

    6

    6

    6

    U

    6

    3

    U

    6

    &

    3

    &

    U

    6

    3

    U

    6

    U

    6

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 15/21

    gradient:

    Whythevanishinggradientproblemoccurs:Tounderstandwhythevanishinggradientproblemoccurs,let'sexplicitlywriteouttheentireexpressionforthegradient:

    Exceptingtheverylastterm,thisexpressionisaproductoftermsoftheform .Tounderstandhoweachofthosetermsbehave,let'slookataplotofthefunction :

    4 3 2 1 0 1 2 3 40.00

    0.05

    0.10

    0.15

    0.20

    0.25

    z

    Derivativeofsigmoidfunction

    Thederivativereachesamaximumat .Now,ifweuseourstandardapproachtoinitializingtheweightsinthenetwork,thenwe'llchoosetheweightsusingaGaussianwithmean andstandarddeviation .Sotheweightswillusuallysatisfy .Puttingtheseobservationstogether,weseethatthetermswillusuallysatisfy .Andwhenwetakeaproductofmanysuchterms,theproductwilltendtoexponentiallydecrease:themoreterms,thesmallertheproductwillbe.Thisisstartingtosmelllikeapossibleexplanationforthevanishinggradientproblem.

    Tomakethisallabitmoreexplicit,let'scomparetheexpressionfortoanexpressionforthegradientwithrespecttoalater

    bias,say .Ofcourse,wehaven'texplicitlyworkedoutan

    U

    6

    3

    U

    6

    U

    6

    U

    6

    3

    U

    6

    3

    U

    6

    3

    U

    6

    3

    &

    U

    6

    &

    U

    U

    ] ] 3

    &

    3

    &

    U

    6

    &

    ] ] 3

    &

    U

    6

    &

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 16/21

    expressionfor ,butitfollowsthesamepatterndescribedabovefor .Here'sthecomparisonofthetwoexpressions:

    Thetwoexpressionssharemanyterms.Butthegradientincludestwoextratermseachoftheform .Aswe'veseen,suchtermsaretypicallylessthan inmagnitude.Andsothegradient willusuallybeafactorof (ormore)smallerthan .Thisistheessentialoriginofthevanishinggradientproblem.

    Ofcourse,thisisaninformalargument,notarigorousproofthatthevanishinggradientproblemwilloccur.Thereareseveralpossibleescapeclauses.Inparticular,wemightwonderwhethertheweights couldgrowduringtraining.Iftheydo,it'spossibletheterms intheproductwillnolongersatisfy

    .Indeed,ifthetermsgetlargeenoughgreaterthan thenwewillnolongerhaveavanishinggradientproblem.Instead,thegradientwillactuallygrowexponentiallyaswemovebackwardthroughthelayers.Insteadofavanishinggradientproblem,we'llhaveanexplodinggradientproblem.

    Theexplodinggradientproblem:Let'slookatanexplicitexamplewhereexplodinggradientsoccur.Theexampleissomewhatcontrived:I'mgoingtofixparametersinthenetworkinjusttherightwaytoensurewegetanexplodinggradient.Buteventhoughtheexampleiscontrived,ithasthevirtueoffirmlyestablishingthatexplodinggradientsaren'tmerelyahypotheticalpossibility,theyreallycanhappen.

    Therearetwostepstogettinganexplodinggradient.First,we

    3

    &

    U

    6

    &

    3

    &

    3

    &

    U

    6

    &

    ] ] 3

    &

    U

    6

    &

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 17/21

    choosealltheweightsinthenetworktobelarge,say.Second,we'llchoosethebiasessothat

    the termsarenottoosmall.That'sactuallyprettyeasytodo:allweneeddoischoosethebiasestoensurethattheweightedinputtoeachneuronis (andso ).So,forinstance,wewant .Wecanachievethisbysetting

    .Wecanusethesameideatoselecttheotherbiases.Whenwedothis,weseethatalltheterms areequalto

    .Withthesechoiceswegetanexplodinggradient.

    Theunstablegradientproblem:Thefundamentalproblemhereisn'tsomuchthevanishinggradientproblemortheexplodinggradientproblem.It'sthatthegradientinearlylayersistheproductoftermsfromallthelaterlayers.Whentherearemanylayers,that'sanintrinsicallyunstablesituation.Theonlywayalllayerscanlearnatclosetothesamespeedisifallthoseproductsoftermscomeclosetobalancingout.Withoutsomemechanismorunderlyingreasonforthatbalancingtooccur,it'shighlyunlikelytohappensimplybychance.Inshort,therealproblemhereisthatneuralnetworkssufferfromanunstablegradientproblem.Asaresult,ifweusestandardgradientbasedlearningtechniques,differentlayersinthenetworkwilltendtolearnatwildlydifferentspeeds.

    Exercise

    Inourdiscussionofthevanishinggradientproblem,wemadeuseofthefactthat .Supposeweusedadifferentactivationfunction,onewhosederivativecouldbemuchlarger.Wouldthathelpusavoidtheunstablegradientproblem?

    Theprevalenceofthevanishinggradientproblem:We'veseenthatthegradientcaneithervanishorexplodeintheearlylayersofadeepnetwork.Infact,whenusingsigmoidneuronsthegradientwillusuallyvanish.Toseewhy,consideragaintheexpression .Toavoidthevanishinggradientproblemweneed .Youmightthinkthiscouldhappeneasilyif isverylarge.However,it'smoredifficultthanitlooks.Thereasonisthatthe termalsodependson : ,where

    3

    3

    3

    3

    U

    6

    &

    6

    &

    U

    6

    &

    6

    3

    3

    &

    U

    6

    &

    ] 6] U

    ]3 6]U

    ]3 6] U

    3

    6U

    3 6 3 U

    U

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 18/21

    istheinputactivation.Sowhenwemake large,weneedtobecarefulthatwe'renotsimultaneouslymaking small.Thatturnsouttobeaconsiderableconstraint.Thereasonisthatwhenwemake largewetendtomake verylarge.Lookingatthegraphof youcanseethatthisputsusoffinthe"wings"ofthefunction,whereittakesverysmallvalues.Theonlywaytoavoidthisisiftheinputactivationfallswithinafairlynarrowrangeofvalues(thisqualitativeexplanationismadequantitativeinthefirstproblembelow).Sometimesthatwillchancetohappen.Moreoften,though,itdoesnothappen.Andsointhegenericcasewehavevanishinggradients.

    Problems

    Considertheproduct .Suppose.(1)Arguethatthiscanonlyeveroccurif

    .(2)Supposingthat ,considerthesetofinputactivations forwhich .Showthatthesetofsatisfyingthatconstraintcanrangeoveranintervalno

    greaterinwidththan

    (3)Shownumericallythattheaboveexpressionboundingthewidthoftherangeisgreatestat ,whereittakesavalue .Andsoevengiventhateverythinglinesupjustperfectly,westillhaveafairlynarrowrangeofinputactivationswhichcanavoidthevanishinggradientproblem.

    Identityneuron:Consideraneuronwithasingleinput, ,acorrespondingweight, ,abias ,andaweight ontheoutput.Showthatbychoosingtheweightsandbiasappropriately,wecanensure for .Suchaneuroncanthusbeusedasakindofidentityneuron,thatis,aneuronwhoseoutputisthesame(uptorescalingbyaweightfactor)asitsinput.Hint:Ithelpstorewrite

    ,toassume issmall,andtouseaTaylorseries

    3

    3 U

    3 3

    U

    U

    ]3 3 ]U

    ]3 3 ] U

    ]3] ]3]

    ]3 3 ] U

    MO

    ]3]

    ]3] ]3]

    ]3]

    4

    3

    3

    U 4 43

    3

    4

    4 3

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 19/21

    expansionin .

    UnstablegradientsinmorecomplexnetworksWe'vebeenstudyingtoynetworks,withjustoneneuronineachhiddenlayer.Whataboutmorecomplexdeepnetworks,withmanyneuronsineachhiddenlayer?

    Infact,muchthesamebehaviouroccursinsuchnetworks.Intheearlierchapteronbackpropagationwesawthatthegradientinthethlayerofan layernetworkisgivenby:

    Here, isadiagonalmatrixwhoseentriesarethe valuesfortheweightedinputstothe thlayer.The aretheweightmatricesforthedifferentlayers.And isthevectorofpartialderivativesof withrespecttotheoutputactivations.

    Thisisamuchmorecomplicatedexpressionthaninthesingleneuroncase.Still,ifyoulookclosely,theessentialformisverysimilar,withlotsofpairsoftheform .What'smore,thematrices havesmallentriesonthediagonal,nonelargerthan.Providedtheweightmatrices aren'ttoolarge,eachadditional

    term tendstomakethegradientvectorsmaller,leadingtoavanishinggradient.Moregenerally,thelargenumberoftermsintheproducttendstoleadtoanunstablegradient,justasinourearlierexample.Inpractice,empiricallyitistypicallyfoundin

    3

    (

    F

    (

    6

    (

    3

    (

    6

    (

    3

    (

    6

    6

    (

    6U

    ( 3

    (

    3

    &

    6

    &

    6

    &

    3

    &

    3

    &

    6

    (

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 20/21

    sigmoidnetworksthatgradientsvanishexponentiallyquicklyinearlierlayers.Asaresult,learningslowsdowninthoselayers.Thisslowdownisn'tmerelyanaccidentoraninconvenience:it'safundamentalconsequenceoftheapproachwe'retakingtolearning.

    OtherobstaclestodeeplearningInthischapterwe'vefocusedonvanishinggradientsand,moregenerally,unstablegradientsasanobstacletodeeplearning.Infact,unstablegradientsarejustoneobstacletodeeplearning,albeitanimportantfundamentalobstacle.Muchongoingresearchaimstobetterunderstandthechallengesthatcanoccurwhentrainingdeepnetworks.Iwon'tcomprehensivelysummarizethatworkhere,butjustwanttobrieflymentionacoupleofpapers,togiveyoutheflavorofsomeofthequestionspeopleareasking.

    Asafirstexample,in2010GlorotandBengio*foundevidencesuggestingthattheuseofsigmoidactivationfunctionscancauseproblemstrainingdeepnetworks.Inparticular,theyfoundevidencethattheuseofsigmoidswillcausetheactivationsinthefinalhiddenlayertosaturatenear earlyintraining,substantiallyslowingdownlearning.Theysuggestedsomealternativeactivationfunctions,whichappearnottosufferasmuchfromthissaturationproblem.

    Asasecondexample,in2013Sutskever,Martens,DahlandHinton*studiedtheimpactondeeplearningofboththerandomweightinitializationandthemomentumscheduleinmomentumbasedstochasticgradientdescent.Inbothcases,makinggoodchoicesmadeasubstantialdifferenceintheabilitytotraindeepnetworks.

    Theseexamplessuggestthat"Whatmakesdeepnetworkshardtotrain?"isacomplexquestion.Inthischapter,we'vefocusedontheinstabilitiesassociatedtogradientbasedlearningindeepnetworks.Theresultsinthelasttwoparagraphssuggestthatthereisalsoaroleplayedbythechoiceofactivationfunction,thewayweightsare

    *Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks,byXavierGlorotandYoshuaBengio(2010).SeealsotheearlierdiscussionoftheuseofsigmoidsinEfficientBackProp,byYannLeCun,LonBottou,GenevieveOrrandKlausRobertMller(1998).

    *Ontheimportanceofinitializationandmomentumindeeplearning,byIlyaSutskever,JamesMartens,GeorgeDahlandGeoffreyHinton(2013).

  • 7/16/2015 Neuralnetworksanddeeplearning

    http://neuralnetworksanddeeplearning.com/chap5.html 21/21

    initialized,andevendetailsofhowlearningbygradientdescentisimplemented.And,ofcourse,choiceofnetworkarchitectureandotherhyperparametersisalsoimportant.Thus,manyfactorscanplayaroleinmakingdeepnetworkshardtotrain,andunderstandingallthosefactorsisstillasubjectofongoingresearch.Thisallseemsratherdownbeatandpessimisminducing.Butthegoodnewsisthatinthenextchapterwe'llturnthataround,anddevelopseveralapproachestodeeplearningthattosomeextentmanagetoovercomeorroutearoundallthesechallenges.

    Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",DeterminationPress,2015

    ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.Thismeansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,pleasecontactme.

    Lastupdate:FriJul1008:53:052015