In-Place Activated BatchNormfor Memory- Optimized Training...

40
In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder Mapillary Research Paper: https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn CSC2548, 2018 Winter Harris Chan Jan 31, 2018

Transcript of In-Place Activated BatchNormfor Memory- Optimized Training...

Page 1: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

In-PlaceActivatedBatchNorm forMemory-OptimizedTrainingof

DNNsSamuelRotaBulò,LorenzoPorzi,PeterKontschieder

Mapillary ResearchPaper:https://arxiv.org/abs/1712.02616

Code:https://github.com/mapillary/inplace_abn

CSC2548,2018WinterHarrisChanJan31,2018

Page 2: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 3: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 4: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

WhyReduceMemoryUsage?

• Moderncomputervisionrecognitionmodelsusedeepneuralnetworkstoextractfeatures• Depth/widthofnetworks~ GPUmemoryrequirements

• Semanticsegmentation:mayevenonlydojustasinglecropperGPUduringtrainingduetosuboptimalmemorymanagement

• Moreefficientmemoryusageduringtrainingletsyou:• Trainlargermodels• Usebiggerbatchsize/imageresolutions

• Thispaperfocusesonincreasingmemoryefficiencyofthetrainingprocessofdeepnetworkarchitecturesattheexpenseofsmalladditionalcomputationtime

Page 5: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

ApproachestoReducingMemory

ReduceMemoryby…

ReducingPrecision(&Accuracy)

IncreasingComputationTime

Page 6: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 7: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ReducingPrecisionWork Weight Activation Gradients

BinaryConnect(M.Courbariaux etal.2015)

Binary FullPrecision FullPrecision

Binarized neuralnetworks(I.Hubara etal.2016)

Binary Binary FullPrecision

Quantizedneuralnetworks (I.Hubara etal)

Quantized 2,4,6bits

Quantized 2,4,6bits

FullPrecision

Mixedprecisiontraining(P.Micikevicius etal.2017)

HalfPrecision(fwd/bw) &FullPrecision

(masterweights)

HalfPrecision HalfPrecision

Page 8: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ReducingPrecision• Idea: Duringtraining,lowertheprecision(uptobinary)oftheweights/activations/gradients

Strength Weakness

Reducememory requirementandsizeofthemodel

Oftendecrease inaccuracyperformance(newerworkattemptstoaddressthis)

Lesspower:efficient forwardpass

Faster:1-bitXNOR-countvs.32-bitfloatingpointmultiply

Page 9: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ComputationTime• Checkpointing: tradeoffmemorywithcomputationtime• Idea:Duringbackpropagation,storeasubsetofactivations(“checkpoints”)andrecompute theremainingactivationsasneeded• Dependingonthearchitecture,wecanusedifferentstrategiestofigureoutwhichsubsetsofactivationstostore

Page 10: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ComputationTime

Work Spatial Complexity ComputationComplexity

Naive Ο(𝐿) Ο(𝐿)Checkpointing (MartensandSutskever, 2012)

Ο( 𝐿� ) Ο(𝐿)

RecursiveCheckpointing(T.Chenetal., 2016)

Ο(log 𝐿) Ο(𝐿 log 𝐿)

ReversibleNetworks(Gomezetal.,2017)

Ο(1) Ο(𝐿)

TableadaptedfromGomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

• LetL bethenumberofidenticalfeed-forwardlayers:

Page 11: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)

ResidualBlock

RevNet (Forward) RevNet (Backward)

Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

BasicResidualFunction

Idea:ReversibleResidualmoduleallowsthecurrentlayer’sactivationtobereconstructedexactlyfromthenextlayer’s.Noneedtostoreanyactivationsforbackpropagation!

Page 12: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)

• Nonoticeablelossinperformance• Gainsinnetworkdepth:~600vs

~100• 4xincreaseinbatchsize(128vs32)Ad

vantage

Disadvan

tage

Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

• Runtimecost:1.5xofnormaltraining(sometimeslessinpractice)

• Restrictreversibleblockstohaveastrideof1tonotdiscardinformation(i.e.nobottlenecklayer)

Page 13: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 14: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Review:BatchNormalization(BN)

• ApplyBNoncurrentfeatures(𝑥+)acrossthemini-batch• Helpsreduceinternalcovariateshift &acceleratetrainingprocess• Lesssensitivetoinitialization Credit:Ioffe &Szegedy,2015.ArXiv link

Page 15: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

MemoryOptimizationStrategies

• Let’scomparethevariousstrategiesforBN+Act:1. Standard2. Checkpointing (baseline)3. Checkpointing (proposed)4. In-PlaceActivatedBatchNormalizationI5. In-PlaceActivatedBatchNormalizationII

Page 16: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

1:StandardBNImplementation

Page 17: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

GradientsforBatchNormalization

Credit:Ioffe &Szegedy,2015.“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift”.ArXivlink

Page 18: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

2:Checkpointing (baseline)

Page 19: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

3:Checkpointing (Proposed)

Page 20: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

In-PlaceABN

• Fusebatchnormandactivationlayertoenablein-placecomputation,usingonlyasinglememorybuffertostoreresults.• Encapsulationmakesiteasytoimplementanddeploy• ImplementedINPLACEABN-IlayerinPyTorch asanewmodule

Page 21: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

4:In-PlaceABNI(Proposed)

InvertibleActivationFunction

𝛾 ≠ 0

Page 22: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

LeakyReLU isInvertible

Page 23: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

5:In-PlaceABNII(Proposed)

Page 24: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

StrategiesComparisonsStrategy Store ComputationOverhead

Standard 𝒙, 𝒛, 𝝈ℬ, 𝝁ℬ -

Checkpointing 𝒙, 𝝈ℬ, 𝝁ℬ 𝐵𝑁8,9, 𝜙

Checkpointing(proposed)

𝒙, 𝝈ℬ 𝜋8,9, 𝜙

In-PlaceABNI(proposed)

𝒛, 𝝈ℬ 𝜙<=, 𝜋8,9<=

In-PlaceABNII(proposed)

𝒛, 𝝈ℬ 𝜙<=

Page 25: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

In-PlaceABN(Proposed)

Page 26: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

In-PlaceABN(Proposed)Strength Weakness

Reducememory requirementbyhalfcomparedtostandard;samesavingsascheckpointing

Requiresinvertibleactivationfunction

Empiricallyfasterthannaïvecheckpointing

…butstillslowerthanstandard(memoryhungry)implementation.

Encapsulating BN&Activationtogether makesiteasytoimplementanddeploy(plug& play)

Page 27: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 28: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:Overview

• 3Majortypes:• Performanceon:(1)ImageClassification,(2)SemanticSegmentation• (3)TimingAnalysiscomparedtostandard/checkpointing

• ExperimentSetup:• NVIDIATitanXp (12GBRAM/GPU)• PyTorch• LeakyReLU activation

Page 29: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:ImageClassificationResNeXt-101/ResNeXt-152 WideResNet-38

Dataset ImageNet-1k ImageNet-1k

Description Bottleneckresidualunitsarereplacedwithamulti-branchversion=“cardinality”of64

More featurechannelsbutshallower

DataAugmentation

Scalesmallestside=256pixelsthenrandomlycrop224× 224,per-channelmeanandvariancenormalization

(SameasResNeXt-101/152)

Optimizer • SGDwithNesterovUpdates

• Initiallearningrate=0.1• weightdecay=10-4• momentum=0.9• 90Epoch,reduceby

factorof10per30epoch

• (Same asResNeXt)• 90Epoch,linearly

decreasingfrom0.1 to10-6

Page 30: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:LeakyReLU impact

• UsingLeakyReLU performsslightlyworsethanwithReLU• Within~1%,exceptfor3202centercrop—authours argueditwasdue

tonon-deterministictrainingbehaviour• Weaknesses

• Showinganaverage+standarddeviationcanbemoreconvincingoftheimprovements.

Page 31: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:ExploitingMemorySaving

Baseline1)LargerBatchSize2)DeeperNetwork3)LargerNetwork4)SyncBN

• Performanceincreasefor1-3• Similarperformancewithlargerbatchsizevsdeepermodel(1vs2)• SynchronizedINPLACE-ABNdidnotincreasetheperformancethat

much• NotesonsynchronizedBN:http://hangzh.com/PyTorch-

Encoding/notes/syncbn.html

Page 32: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:SemanticSegmentation

• SemanticSegmentation:Assigncategoricallabelstoeachpixelinanimage• Datasets• CityScapes• COCO-Stuff• Mapillary Vistas

FigureCredit:https://www.cityscapes-dataset.com/examples/

Page 33: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:SemanticSegmentation

• Architecturecontains2partsthatarejointlyfine-tunedonsegmentationdata:• Body:Classificationmodelspre-trainedonImageNet• Head:Segmentationspecificarchitectures

• Authours usedDeepLabV3*asthehead• Cascadedatrous (dilated)convolutionsforcapturingcontextualinfo

• Crop-levelfeaturesencodingglobalcontext• MaximizeGPUUsageby:

• (FIXEDCROP)fixingthetrainingcropsizeandthereforepushingtheamountofcropsperminibatch tothelimit

• (FIXEDBATCH) fixingthenumberofcropsperminibatch andmaximizingthetrainingcropresolutions

*L.Chen,G.Papandreou,F.Schroff,andH.Adam.“Rethinkingatrous convolutionforsemanticimagesegmentation.”ArXivLink

Page 34: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:SemanticSegmentation

• Moretrainingdata(FIXEDCROP) helpsalittlebit• Higherinputresolution(FIXEDBATCH) helpsevenmorethanadding

morecrops

• Noqualitativeresult:probablyvisuallysimilartoDeepLabV3

Page 35: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:SemanticSegmentationFine-TunedonCityScapes andMapillaryVistas

• CombinationofINPLACE-ABNsyncwithlargercropsizesimprovesby≈0.9%overthebestperformingsettinginTable3

• Class- Uniformsampling:Class-uniformlysampledfromeligibleimagecandidates,makingsuretotaketrainingcropsfromareascontainingtheclassofinterest.

Page 36: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:SemanticSegmentation• CurrentlystateoftheartforCityScapes forIoU classandiIoU (instance)Class• iIoU:Weightingthecontributionofeachpixelbytheratiooftheclass’averageinstancesizetothesizeoftherespectivegroundtruthinstance.

Page 37: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Experiments:TimingAnalyses

• TheyisolatedasingleBN+ACT+CONVblock&evaluatethecomputationaltimesrequiredforaforwardandbackwardpass• Result:Narrowedthegapbetweenstandard vscheckpointing by half• Ensuredfaircomparisonbyre-implementingcheckpointing inPyTorch

Page 38: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

Page 39: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

FutureDirections:

• ApplyINPLACE-ABNinother…• Architectures:DenseNet,Squeeze-ExcitationNetworks,DeformableConvolutionalNetworks• ProblemDomains: Objectdetection,instance-specificsegmentation,3Ddatalearning

• CombineINPLACE-ABNwithothermemoryreductiontechniques,ex:Mixedprecisiontraining• ApplysameInPlace ideaon’newer’BatchNorm,ex:BatchRenormalization*

*S.Ioffe.“BatchRenormalization:TowardsReducingMinibatch DependenceinBatch-NormalizedModels.”ArXivLink

Page 40: In-Place Activated BatchNormfor Memory- Optimized Training ...fidler/teaching/2018/slides/CSC2548/INPBN... · •Motivation for Efficient Memory management •Related Works •Reducing

LinksandReferences

• INPLACE-ABNPaper:https://arxiv.org/pdf/1712.02616.pdf• OfficialGithub code(PyTorch):https://github.com/mapillary/inplace_abn• CityScapes Dataset:https://www.cityscapes-dataset.com/benchmarks/#scene-labeling-task• ReducedPrecision:

• BinaryConnect:https://arxiv.org/abs/1511.00363• Binarized Networks:https://arxiv.org/abs/1602.02830• MixedPrecisionTraining:https://arxiv.org/abs/1710.03740

• TradeoffwithComputationTime• Checkpointing:https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf

• RecursiveCheckpointing:https://arxiv.org/abs/1604.06174• ReversibleNetworks:https://arxiv.org/abs/1707.04585