Post on 11-Aug-2020
AlgorithmicIntelligenceLab
AlgorithmicIntelligenceLab
EE807:RecentAdvancesinDeepLearningLecture2
Slidemadeby
Insu HanandJongheon JeongKAISTEE
StochasticGradientDescent
AlgorithmicIntelligenceLab
1. Introduction• Empiricalriskminimization(ERM)
2. GradientDescendMethods• Gradientdescent(GD)• Stochasticgradientdescent(SGD)
3. MomentumandAdaptiveLearningRateMethods• Momentummethods• Learningratescheduling• Adaptivelearningratemethods(AdaGrad,RmsProp,Adam)
4. ChangingBatchSize• Increasingthebatchsizewithoutlearningratedecaying
5. Summary
TableofContents
2
AlgorithmicIntelligenceLab
• Giventrainingset
• Predictionfunctionparameterizedby
• Empiricalriskminimization: Findaparamaterthatminimizesthelossfunction
whereisalossfunctione.g.,MSE,crossentropy,
• Forexample,neuralnetworkhas
EmpiricalRiskMinimization(ERM)
3
Next,howtosolveERM?
AlgorithmicIntelligenceLab
• Gradientdescent(GD) updatesparametersiterativelybytakinggradient.
• (+) Convergestoglobal(local)minimumforconvex(non-convex)problem.• (−)Notefficientwithrespecttocomputationtime andmemoryspace forhuge𝑛.• Forexample,ImageNetdatasethas𝑛 =1,281,167 images fortraining.
GradientDescent(GD)
4
parameters
learningrate
lossfunction
Next,efficientGD
1.2Mof256x256RGBimages≈ 236GBmemory
AlgorithmicIntelligenceLab
• Stochasticgradientdescent(SGD) usesamples toapproximateGD
• Inpractice,minibatchsizescanbe32/64/128.
• Mainpracticalchallenges andcurrentsolutions:1. SGDcanbetoonoisyandmightbeunstable2. hardtofindagoodlearningrate
StochasticGradientDescent(SGD)
5*source:https://lovesnowbest.site/2018/02/16/Improving-Deep-Neural-Networks-Assignment-2/
Next,momentum
momentumadaptivelearningrate
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• Equivalenttotheweighted-sumofthefraction𝜇 ofpreviousupdate.
• (+) Momentumreducestheoscillationandacceleratestheconvergence.
MomentumMethods
6
momentum preservationratio
SGD
frictiontoverticalfluctuation
accelerationtoleftSGD+momentum
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• (−)Momentumcanfailtoconvergeevenforsimpleconvexoptimizations.• Nestrov’s acceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,
MomentumMethods:Nesterov’s Momentum
7
momentum preservationratio
“lookahead”gradient
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• Nesterov’sacceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,
MomentumMethods:Nesterov’s Momentum
8
momentum preservationratio
Quiz:fillinthepseudocodeofNesterov’acceleratedgradient
SGDSGD+momentum
NAG
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods
9
2. Learningratescheduling• Learningrateiscriticalforminimizingloss!
*source:http://cs231n.github.io/neural-networks-3/
Next,learningratescheduling
Toohigh→Mayignorethenarrowvalley,candivergeToolow →Mayfallintothelocalminima,slowconverge
AlgorithmicIntelligenceLab
2. Learningratescheduling:decaymethods• Anaivechoiceistheconstant learningrate• Commonlearningrateschedulesincludetime-based/step/exponentialdecay
• “Stepdecay”decreaseslearningratebyafactoreveryfewepochs• Typically,itisset= 0.01 anddropsbyhalfever= 10 epoch
Time-based Exponential Step(mostpopularinpractice)
AdaptiveLearningRateMethods:Learningrateannealing
10*source:https://towardsdatascience.com/
stepdecay exponentialdecay accuracy
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Smith’2015]proposedcycling learningrate(triangular)• Why“cycling”learningrate?
• Sometimes,increasinglearningrateishelpfultoescapethesaddlepoints
• Itcanbecombinedwithexponentialdecayorperiodicdecay
AdaptiveLearningRateMethods:Learningrateannealing
11*source:https://github.com/bckenstler/CLR
cycling(triangular)decay
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]usecosinecycling andrestart themaximumateachcycle• Why“cosine”?
• Itdecaysslowlyatthehalfofcycleanddropquicklyattherest
• (+) canclimbdownandupthelosssurface,thuscantraverseseverallocalminima• (+) sameasrestartingatgoodpointswithaninitiallearningrate
AdaptiveLearningRateMethods:Learningrateannealing
12*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]alsoproposedwarmrestart incyclinglearningrate
• (+) Ithelptoescapesaddlepointssinceitismorelikelytostuckinearlyiteration
AdaptiveLearningRateMethods:Learningrateannealing
13*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017
Next,adaptivelearningrate
:stepdecay :cyclingwithnorestart :cyclingwithrestart
*Warmrestart:frequentlyrestartinearlyiterations
But,thereisnoperfectlearningratescheduling!Itdependsonspecifictask.
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods:AdaGrad,RMSProp
14
3. Adaptivelychanginglearningrate(AdaGrad,RMSProp)• AdaGrad [Duchi’11]downscalesalearningratebymagnitudeofpreviousgradients.
• (−) thelearningratestrictlydecreasesandbecomestoosmallforlargeiterations.
• RMSProp [Tieleman’12]usesthemovingaveragesofsquaredgradient.
• Othervariantsalsoexist,e.g.,Adadelta[Zeiler’2012]
sumofallprevioussquaredgradients
preservationratio
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods
15*source:animationsfromfromAlecRadford’blog
optimizationonsaddlepoint optimizationonlocaloptimum
• Visualizationofalgorithms
• Adaptivelearning-ratemethods,i.e.,Adadelta andRMSprop aremostsuitableandprovidethebestconvergenceforthesescenarios
Next,momentum+adaptivelearningrate
AlgorithmicIntelligenceLab
3. Combinationofmomentumandadaptivelearningrate• Adam (ADAptive Momentestimation)[Kingma’2015]
• Canbeseenasmomentum+RMSprop update.• Othervariantsexist,e.g.,Adamax [Kingma’14],Nadam [Dozat’16]
AdaptiveLearningRateMethods:ADAM
16
averageofsquaredgradients
momentum
*source:Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015
AlgorithmicIntelligenceLab
• Inpractice, SGD+Momentum andAdam workswellinmanyapplications.
• But,schedulinglearningratesisstillcritical!(shouldbedecayappropriately)
• [Smith’2017]showsthatdecayinglearningrate=increasingbatchsize,• (+) Alargebatchsizeallowsfewerparameterupdates,leadingtoparallelism!
DecayingtheLearningRate=IncreasingtheBatchSize
17*source:Smithetal.,"Don'tDecaytheLearningRate,IncreasetheBatchSize.“,ICLR2017
AlgorithmicIntelligenceLab
• SGDhavebeenusedasessentialalgorithmstodeeplearningasback-propagation.
• Momentummethodsimprovetheperformanceofgradientdescendalgorithms.• Nesterov’smomentum
• Annealinglearningratesarecriticalfortraininglossfunctions• Exponential,harmonic,cyclicdecayingmethods• Adaptivelearningratemethods(RMSProp,AdaGrad,AdaDelta,Adam,etc)
• Inpractice,SGD+momentum showssuccessfulresults,outperformingAdam!• Forexample,NLP(Huangetal.,2017)ormachinetranslation(Wuetal.,2016)
Summary
18
AlgorithmicIntelligenceLab
• [Nesterov’1983]Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k^2).1983link:http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf
• [Duchi etal2011],“Adaptivesubgradient methodsforonlinelearningandstochasticoptimization”,JMLR2011link:http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
• [Tieleman’2012]GeoffHinton’sLecture6eofCourseraClasslink:http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
• [Zeiler’2012]Zeiler,M.D.(2012).ADADELTA:AnAdaptiveLearningRateMethodlink:https://arxiv.org/pdf/1212.5701.pdf
• [Smith’2015]Smith,LeslieN."Cyclicallearningratesfortrainingneuralnetworks.”link:https://arxiv.org/pdf/1506.01186.pdf
• [Kingma andBa.,2015]Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015link:https://arxiv.org/pdf/1412.6980.pdf
• [Dozat’2016]Dozat,T.(2016).IncorporatingNesterov MomentumintoAdam.ICLRWorkshop,link:http://cs229.stanford.edu/proj2015/054_report.pdf
• [Smithetal.,2017]Smith,SamuelL.,Pieter-JanKindermans andQuocV.Le.Don'tDecaytheLearningRate,IncreasetheBatchSize.ICLR2017.link:https://openreview.net/pdf?id=B1Yy1BxCZ
• [Loshchilov etal.,2017]Loshchilov,I.,&Hutter,F.(2017).SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017.link:https://arxiv.org/pdf/1608.03983.pdf
References
19