Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic...

AlgorithmicIntelligenceLab

EE807:RecentAdvancesinDeepLearningLecture2

Slidemadeby

Insu HanandJongheon JeongKAISTEE

StochasticGradientDescent

1. Introduction• Empiricalriskminimization(ERM)

2. GradientDescendMethods• Gradientdescent(GD)• Stochasticgradientdescent(SGD)

3. MomentumandAdaptiveLearningRateMethods• Momentummethods• Learningratescheduling• Adaptivelearningratemethods(AdaGrad,RmsProp,Adam)

4. ChangingBatchSize• Increasingthebatchsizewithoutlearningratedecaying

5. Summary

TableofContents

• Giventrainingset

• Predictionfunctionparameterizedby

• Empiricalriskminimization: Findaparamaterthatminimizesthelossfunction

whereisalossfunctione.g.,MSE,crossentropy,

• Forexample,neuralnetworkhas

EmpiricalRiskMinimization(ERM)

Next,howtosolveERM?

• Gradientdescent(GD) updatesparametersiterativelybytakinggradient.

• (+) Convergestoglobal(local)minimumforconvex(non-convex)problem.• (−)Notefficientwithrespecttocomputationtime andmemoryspace forhuge𝑛.• Forexample,ImageNetdatasethas𝑛 =1,281,167 images fortraining.

GradientDescent(GD)

parameters

learningrate

lossfunction

Next,efficientGD

1.2Mof256x256RGBimages≈ 236GBmemory

• Stochasticgradientdescent(SGD) usesamples toapproximateGD

• Inpractice,minibatchsizescanbe32/64/128.

• Mainpracticalchallenges andcurrentsolutions:1. SGDcanbetoonoisyandmightbeunstable2. hardtofindagoodlearningrate

StochasticGradientDescent(SGD)

5*source:https://lovesnowbest.site/2018/02/16/Improving-Deep-Neural-Networks-Assignment-2/

Next,momentum

momentumadaptivelearningrate

1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).

• Equivalenttotheweighted-sumofthefraction𝜇 ofpreviousupdate.

• (+) Momentumreducestheoscillationandacceleratestheconvergence.

MomentumMethods

momentum preservationratio

frictiontoverticalfluctuation

accelerationtoleftSGD+momentum

• (−)Momentumcanfailtoconvergeevenforsimpleconvexoptimizations.• Nestrov’s acceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

“lookahead”gradient

• Nesterov’sacceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,

MomentumMethods:Nesterov’s Momentum

Quiz:fillinthepseudocodeofNesterov’acceleratedgradient

SGDSGD+momentum

AdaptiveLearningRateMethods

2. Learningratescheduling• Learningrateiscriticalforminimizingloss!

*source:http://cs231n.github.io/neural-networks-3/

Next,learningratescheduling

Toohigh→Mayignorethenarrowvalley,candivergeToolow →Mayfallintothelocalminima,slowconverge

2. Learningratescheduling:decaymethods• Anaivechoiceistheconstant learningrate• Commonlearningrateschedulesincludetime-based/step/exponentialdecay

• “Stepdecay”decreaseslearningratebyafactoreveryfewepochs• Typically,itisset= 0.01 anddropsbyhalfever= 10 epoch

Time-based Exponential Step(mostpopularinpractice)

AdaptiveLearningRateMethods:Learningrateannealing

10*source:https://towardsdatascience.com/

stepdecay exponentialdecay accuracy

2. Learningratescheduling:cyclingmethod• [Smith’2015]proposedcycling learningrate(triangular)• Why“cycling”learningrate?

• Sometimes,increasinglearningrateishelpfultoescapethesaddlepoints

• Itcanbecombinedwithexponentialdecayorperiodicdecay

11*source:https://github.com/bckenstler/CLR

cycling(triangular)decay

2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]usecosinecycling andrestart themaximumateachcycle• Why“cosine”?

• Itdecaysslowlyatthehalfofcycleanddropquicklyattherest

• (+) canclimbdownandupthelosssurface,thuscantraverseseverallocalminima• (+) sameasrestartingatgoodpointswithaninitiallearningrate

12*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017

2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]alsoproposedwarmrestart incyclinglearningrate

• (+) Ithelptoescapesaddlepointssinceitismorelikelytostuckinearlyiteration

13*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017

Next,adaptivelearningrate

:stepdecay :cyclingwithnorestart :cyclingwithrestart

*Warmrestart:frequentlyrestartinearlyiterations

But,thereisnoperfectlearningratescheduling!Itdependsonspecifictask.

AdaptiveLearningRateMethods:AdaGrad,RMSProp

3. Adaptivelychanginglearningrate(AdaGrad,RMSProp)• AdaGrad [Duchi’11]downscalesalearningratebymagnitudeofpreviousgradients.

• (−) thelearningratestrictlydecreasesandbecomestoosmallforlargeiterations.

• RMSProp [Tieleman’12]usesthemovingaveragesofsquaredgradient.

• Othervariantsalsoexist,e.g.,Adadelta[Zeiler’2012]

sumofallprevioussquaredgradients

preservationratio

AdaptiveLearningRateMethods

15*source:animationsfromfromAlecRadford’blog

optimizationonsaddlepoint optimizationonlocaloptimum

• Visualizationofalgorithms

• Adaptivelearning-ratemethods,i.e.,Adadelta andRMSprop aremostsuitableandprovidethebestconvergenceforthesescenarios

Next,momentum+adaptivelearningrate

3. Combinationofmomentumandadaptivelearningrate• Adam (ADAptive Momentestimation)[Kingma’2015]

• Canbeseenasmomentum+RMSprop update.• Othervariantsexist,e.g.,Adamax [Kingma’14],Nadam [Dozat’16]

AdaptiveLearningRateMethods:ADAM

averageofsquaredgradients

momentum

*source:Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015

• Inpractice, SGD+Momentum andAdam workswellinmanyapplications.

• But,schedulinglearningratesisstillcritical!(shouldbedecayappropriately)

• [Smith’2017]showsthatdecayinglearningrate=increasingbatchsize,• (+) Alargebatchsizeallowsfewerparameterupdates,leadingtoparallelism!

DecayingtheLearningRate=IncreasingtheBatchSize

17*source:Smithetal.,"Don'tDecaytheLearningRate,IncreasetheBatchSize.“,ICLR2017

• SGDhavebeenusedasessentialalgorithmstodeeplearningasback-propagation.

• Momentummethodsimprovetheperformanceofgradientdescendalgorithms.• Nesterov’smomentum

• Annealinglearningratesarecriticalfortraininglossfunctions• Exponential,harmonic,cyclicdecayingmethods• Adaptivelearningratemethods(RMSProp,AdaGrad,AdaDelta,Adam,etc)

• Inpractice,SGD+momentum showssuccessfulresults,outperformingAdam!• Forexample,NLP(Huangetal.,2017)ormachinetranslation(Wuetal.,2016)

Summary

• [Nesterov’1983]Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k^2).1983link:http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf

• [Duchi etal2011],“Adaptivesubgradient methodsforonlinelearningandstochasticoptimization”,JMLR2011link:http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

• [Tieleman’2012]GeoffHinton’sLecture6eofCourseraClasslink:http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

• [Zeiler’2012]Zeiler,M.D.(2012).ADADELTA:AnAdaptiveLearningRateMethodlink:https://arxiv.org/pdf/1212.5701.pdf

• [Smith’2015]Smith,LeslieN."Cyclicallearningratesfortrainingneuralnetworks.”link:https://arxiv.org/pdf/1506.01186.pdf

• [Kingma andBa.,2015]Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015link:https://arxiv.org/pdf/1412.6980.pdf

• [Dozat’2016]Dozat,T.(2016).IncorporatingNesterov MomentumintoAdam.ICLRWorkshop,link:http://cs229.stanford.edu/proj2015/054_report.pdf

• [Smithetal.,2017]Smith,SamuelL.,Pieter-JanKindermans andQuocV.Le.Don'tDecaytheLearningRate,IncreasetheBatchSize.ICLR2017.link:https://openreview.net/pdf?id=B1Yy1BxCZ

• [Loshchilov etal.,2017]Loshchilov,I.,&Hutter,F.(2017).SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017.link:https://arxiv.org/pdf/1608.03983.pdf

References

Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic...

Documents

Transcript of Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic...

Gradient Descent Easy version

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent · 2020. 6. 15. · Stability and Generalization of Stochastic Gradient Descent gradient assumption

Leader Stochastic Gradient Descent for Distributed ...

Gradient Descent: Second Order Momentum and Saturating Error · 1.1 SIMPLE GRADIENT DESCENT First, let us review the bounds on the convergence rate of simple gradient descent without

Gradient Descent: Second Order Momentum and Saturating Errorpapers.nips.cc/paper/454-gradient-descent-second-order... · 2014-04-14 · 1.1 SIMPLE GRADIENT DESCENT First, let us review

Mini-batch deeplearning.ai gradient descent · Batch vs. mini-batch gradient descent Vectorization allows you to efficiently compute on mexamples. Andrew Ng Mini-batch gradient descent.

Multiple Gradient Descent Algorithm

Gradient Methods April 2004. Preview Background Steepest Descent Conjugate Gradient.

Proximal Gradient Descent › ~aarti › Class › 10725_Fall17 › Lecture_Slides › ...Proximal gradient descent has convergence rate O(1=k), or O(1= ) Same as gradient descent!

Introduction to Optimization - TU Berlin · Introduction to Optimization Gradient-based Methods Marc Toussaint U Stuttgart. Gradient descent methods Plain gradient descent (with adaptive

Exponentiated Gradient versus Gradient Descent for Linear ...manfred/pubs/J36.pdf · Exponentiated Gradient versus Gradient Descent for Linear Predictors* Jyrki Kivinen-Department

Gradient descent method

Stochastic Gradient Descent Methods

Intro Logistic+Regression Gradient+Descent+++SGD...9 SGD:+Stochastic+Gradient+Ascent+(or+Descent) • “True”gradient: • Samplebasedapproximation: • Whatifweestimategradientwithjustonesample???

Beyond Gradient Descent for Regularized Segmentation Lossesopenaccess.thecvf.com/.../Marin_Beyond_Gradient_Descent_for_Reg… · The simplicity of gradient descent (GD) made it the

Boosting Algorithms as Gradient Descent

Gradient Descent - University of Washington

The Gradient Descent Algorithm

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

1 Lecture 10: descent methods Gradient descent (reminder)