In-Place Activated BatchNormfor Memory- Optimized Training...
Transcript of In-Place Activated BatchNormfor Memory- Optimized Training...
In-PlaceActivatedBatchNorm forMemory-OptimizedTrainingof
DNNsSamuelRotaBulò,LorenzoPorzi,PeterKontschieder
Mapillary ResearchPaper:https://arxiv.org/abs/1712.02616
Code:https://github.com/mapillary/inplace_abn
CSC2548,2018WinterHarrisChanJan31,2018
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
WhyReduceMemoryUsage?
• Moderncomputervisionrecognitionmodelsusedeepneuralnetworkstoextractfeatures• Depth/widthofnetworks~ GPUmemoryrequirements
• Semanticsegmentation:mayevenonlydojustasinglecropperGPUduringtrainingduetosuboptimalmemorymanagement
• Moreefficientmemoryusageduringtrainingletsyou:• Trainlargermodels• Usebiggerbatchsize/imageresolutions
• Thispaperfocusesonincreasingmemoryefficiencyofthetrainingprocessofdeepnetworkarchitecturesattheexpenseofsmalladditionalcomputationtime
ApproachestoReducingMemory
ReduceMemoryby…
ReducingPrecision(&Accuracy)
IncreasingComputationTime
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
RelatedWorks:ReducingPrecisionWork Weight Activation Gradients
BinaryConnect(M.Courbariaux etal.2015)
Binary FullPrecision FullPrecision
Binarized neuralnetworks(I.Hubara etal.2016)
Binary Binary FullPrecision
Quantizedneuralnetworks (I.Hubara etal)
Quantized 2,4,6bits
Quantized 2,4,6bits
FullPrecision
Mixedprecisiontraining(P.Micikevicius etal.2017)
HalfPrecision(fwd/bw) &FullPrecision
(masterweights)
HalfPrecision HalfPrecision
RelatedWorks:ReducingPrecision• Idea: Duringtraining,lowertheprecision(uptobinary)oftheweights/activations/gradients
Strength Weakness
Reducememory requirementandsizeofthemodel
Oftendecrease inaccuracyperformance(newerworkattemptstoaddressthis)
Lesspower:efficient forwardpass
Faster:1-bitXNOR-countvs.32-bitfloatingpointmultiply
RelatedWorks:ComputationTime• Checkpointing: tradeoffmemorywithcomputationtime• Idea:Duringbackpropagation,storeasubsetofactivations(“checkpoints”)andrecompute theremainingactivationsasneeded• Dependingonthearchitecture,wecanusedifferentstrategiestofigureoutwhichsubsetsofactivationstostore
RelatedWorks:ComputationTime
Work Spatial Complexity ComputationComplexity
Naive Ο(𝐿) Ο(𝐿)Checkpointing (MartensandSutskever, 2012)
Ο( 𝐿� ) Ο(𝐿)
RecursiveCheckpointing(T.Chenetal., 2016)
Ο(log 𝐿) Ο(𝐿 log 𝐿)
ReversibleNetworks(Gomezetal.,2017)
Ο(1) Ο(𝐿)
TableadaptedfromGomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link
• LetL bethenumberofidenticalfeed-forwardlayers:
RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)
ResidualBlock
RevNet (Forward) RevNet (Backward)
Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link
BasicResidualFunction
Idea:ReversibleResidualmoduleallowsthecurrentlayer’sactivationtobereconstructedexactlyfromthenextlayer’s.Noneedtostoreanyactivationsforbackpropagation!
RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)
• Nonoticeablelossinperformance• Gainsinnetworkdepth:~600vs
~100• 4xincreaseinbatchsize(128vs32)Ad
vantage
Disadvan
tage
Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link
• Runtimecost:1.5xofnormaltraining(sometimeslessinpractice)
• Restrictreversibleblockstohaveastrideof1tonotdiscardinformation(i.e.nobottlenecklayer)
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
Review:BatchNormalization(BN)
• ApplyBNoncurrentfeatures(𝑥+)acrossthemini-batch• Helpsreduceinternalcovariateshift &acceleratetrainingprocess• Lesssensitivetoinitialization Credit:Ioffe &Szegedy,2015.ArXiv link
MemoryOptimizationStrategies
• Let’scomparethevariousstrategiesforBN+Act:1. Standard2. Checkpointing (baseline)3. Checkpointing (proposed)4. In-PlaceActivatedBatchNormalizationI5. In-PlaceActivatedBatchNormalizationII
1:StandardBNImplementation
GradientsforBatchNormalization
Credit:Ioffe &Szegedy,2015.“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift”.ArXivlink
2:Checkpointing (baseline)
3:Checkpointing (Proposed)
In-PlaceABN
• Fusebatchnormandactivationlayertoenablein-placecomputation,usingonlyasinglememorybuffertostoreresults.• Encapsulationmakesiteasytoimplementanddeploy• ImplementedINPLACEABN-IlayerinPyTorch asanewmodule
4:In-PlaceABNI(Proposed)
InvertibleActivationFunction
𝛾 ≠ 0
LeakyReLU isInvertible
5:In-PlaceABNII(Proposed)
StrategiesComparisonsStrategy Store ComputationOverhead
Standard 𝒙, 𝒛, 𝝈ℬ, 𝝁ℬ -
Checkpointing 𝒙, 𝝈ℬ, 𝝁ℬ 𝐵𝑁8,9, 𝜙
Checkpointing(proposed)
𝒙, 𝝈ℬ 𝜋8,9, 𝜙
In-PlaceABNI(proposed)
𝒛, 𝝈ℬ 𝜙<=, 𝜋8,9<=
In-PlaceABNII(proposed)
𝒛, 𝝈ℬ 𝜙<=
In-PlaceABN(Proposed)
In-PlaceABN(Proposed)Strength Weakness
Reducememory requirementbyhalfcomparedtostandard;samesavingsascheckpointing
Requiresinvertibleactivationfunction
Empiricallyfasterthannaïvecheckpointing
…butstillslowerthanstandard(memoryhungry)implementation.
Encapsulating BN&Activationtogether makesiteasytoimplementanddeploy(plug& play)
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
Experiments:Overview
• 3Majortypes:• Performanceon:(1)ImageClassification,(2)SemanticSegmentation• (3)TimingAnalysiscomparedtostandard/checkpointing
• ExperimentSetup:• NVIDIATitanXp (12GBRAM/GPU)• PyTorch• LeakyReLU activation
Experiments:ImageClassificationResNeXt-101/ResNeXt-152 WideResNet-38
Dataset ImageNet-1k ImageNet-1k
Description Bottleneckresidualunitsarereplacedwithamulti-branchversion=“cardinality”of64
More featurechannelsbutshallower
DataAugmentation
Scalesmallestside=256pixelsthenrandomlycrop224× 224,per-channelmeanandvariancenormalization
(SameasResNeXt-101/152)
Optimizer • SGDwithNesterovUpdates
• Initiallearningrate=0.1• weightdecay=10-4• momentum=0.9• 90Epoch,reduceby
factorof10per30epoch
• (Same asResNeXt)• 90Epoch,linearly
decreasingfrom0.1 to10-6
Experiments:LeakyReLU impact
• UsingLeakyReLU performsslightlyworsethanwithReLU• Within~1%,exceptfor3202centercrop—authours argueditwasdue
tonon-deterministictrainingbehaviour• Weaknesses
• Showinganaverage+standarddeviationcanbemoreconvincingoftheimprovements.
Experiments:ExploitingMemorySaving
Baseline1)LargerBatchSize2)DeeperNetwork3)LargerNetwork4)SyncBN
• Performanceincreasefor1-3• Similarperformancewithlargerbatchsizevsdeepermodel(1vs2)• SynchronizedINPLACE-ABNdidnotincreasetheperformancethat
much• NotesonsynchronizedBN:http://hangzh.com/PyTorch-
Encoding/notes/syncbn.html
Experiments:SemanticSegmentation
• SemanticSegmentation:Assigncategoricallabelstoeachpixelinanimage• Datasets• CityScapes• COCO-Stuff• Mapillary Vistas
FigureCredit:https://www.cityscapes-dataset.com/examples/
Experiments:SemanticSegmentation
• Architecturecontains2partsthatarejointlyfine-tunedonsegmentationdata:• Body:Classificationmodelspre-trainedonImageNet• Head:Segmentationspecificarchitectures
• Authours usedDeepLabV3*asthehead• Cascadedatrous (dilated)convolutionsforcapturingcontextualinfo
• Crop-levelfeaturesencodingglobalcontext• MaximizeGPUUsageby:
• (FIXEDCROP)fixingthetrainingcropsizeandthereforepushingtheamountofcropsperminibatch tothelimit
• (FIXEDBATCH) fixingthenumberofcropsperminibatch andmaximizingthetrainingcropresolutions
*L.Chen,G.Papandreou,F.Schroff,andH.Adam.“Rethinkingatrous convolutionforsemanticimagesegmentation.”ArXivLink
Experiments:SemanticSegmentation
• Moretrainingdata(FIXEDCROP) helpsalittlebit• Higherinputresolution(FIXEDBATCH) helpsevenmorethanadding
morecrops
• Noqualitativeresult:probablyvisuallysimilartoDeepLabV3
Experiments:SemanticSegmentationFine-TunedonCityScapes andMapillaryVistas
• CombinationofINPLACE-ABNsyncwithlargercropsizesimprovesby≈0.9%overthebestperformingsettinginTable3
• Class- Uniformsampling:Class-uniformlysampledfromeligibleimagecandidates,makingsuretotaketrainingcropsfromareascontainingtheclassofinterest.
Experiments:SemanticSegmentation• CurrentlystateoftheartforCityScapes forIoU classandiIoU (instance)Class• iIoU:Weightingthecontributionofeachpixelbytheratiooftheclass’averageinstancesizetothesizeoftherespectivegroundtruthinstance.
Experiments:TimingAnalyses
• TheyisolatedasingleBN+ACT+CONVblock&evaluatethecomputationaltimesrequiredforaforwardandbackwardpass• Result:Narrowedthegapbetweenstandard vscheckpointing by half• Ensuredfaircomparisonbyre-implementingcheckpointing inPyTorch
Overview
• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)
• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization
• Experiments• FutureDirections
FutureDirections:
• ApplyINPLACE-ABNinother…• Architectures:DenseNet,Squeeze-ExcitationNetworks,DeformableConvolutionalNetworks• ProblemDomains: Objectdetection,instance-specificsegmentation,3Ddatalearning
• CombineINPLACE-ABNwithothermemoryreductiontechniques,ex:Mixedprecisiontraining• ApplysameInPlace ideaon’newer’BatchNorm,ex:BatchRenormalization*
*S.Ioffe.“BatchRenormalization:TowardsReducingMinibatch DependenceinBatch-NormalizedModels.”ArXivLink
LinksandReferences
• INPLACE-ABNPaper:https://arxiv.org/pdf/1712.02616.pdf• OfficialGithub code(PyTorch):https://github.com/mapillary/inplace_abn• CityScapes Dataset:https://www.cityscapes-dataset.com/benchmarks/#scene-labeling-task• ReducedPrecision:
• BinaryConnect:https://arxiv.org/abs/1511.00363• Binarized Networks:https://arxiv.org/abs/1602.02830• MixedPrecisionTraining:https://arxiv.org/abs/1710.03740
• TradeoffwithComputationTime• Checkpointing:https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf
• RecursiveCheckpointing:https://arxiv.org/abs/1604.06174• ReversibleNetworks:https://arxiv.org/abs/1707.04585