In-Place Activated BatchNormfor Memory- Optimized Training...

In-PlaceActivatedBatchNorm forMemory-OptimizedTrainingof

DNNsSamuelRotaBulò,LorenzoPorzi,PeterKontschieder

Mapillary ResearchPaper:https://arxiv.org/abs/1712.02616

Code:https://github.com/mapillary/inplace_abn

CSC2548,2018WinterHarrisChanJan31,2018

Overview

• MotivationforEfficientMemorymanagement• RelatedWorks• Reducingprecision• Checkpointing• ReversibleNetworks[9](Gomezetal.,2017)

• In-PlaceActivatedBatchNormalization• Review:BatchNormalization• In-placeActivatedBatchNormalization

• Experiments• FutureDirections

WhyReduceMemoryUsage?

• Moderncomputervisionrecognitionmodelsusedeepneuralnetworkstoextractfeatures• Depth/widthofnetworks~ GPUmemoryrequirements

• Semanticsegmentation:mayevenonlydojustasinglecropperGPUduringtrainingduetosuboptimalmemorymanagement

• Moreefficientmemoryusageduringtrainingletsyou:• Trainlargermodels• Usebiggerbatchsize/imageresolutions

• Thispaperfocusesonincreasingmemoryefficiencyofthetrainingprocessofdeepnetworkarchitecturesattheexpenseofsmalladditionalcomputationtime

ApproachestoReducingMemory

ReduceMemoryby…

ReducingPrecision(&Accuracy)

IncreasingComputationTime

Overview




RelatedWorks:ReducingPrecisionWork Weight Activation Gradients

BinaryConnect(M.Courbariaux etal.2015)

Binary FullPrecision FullPrecision

Binarized neuralnetworks(I.Hubara etal.2016)

Binary Binary FullPrecision

Quantizedneuralnetworks (I.Hubara etal)

Quantized 2,4,6bits

Quantized 2,4,6bits

FullPrecision

Mixedprecisiontraining(P.Micikevicius etal.2017)

HalfPrecision(fwd/bw) &FullPrecision

(masterweights)

HalfPrecision HalfPrecision

RelatedWorks:ReducingPrecision• Idea: Duringtraining,lowertheprecision(uptobinary)oftheweights/activations/gradients

Strength Weakness

Reducememory requirementandsizeofthemodel

Oftendecrease inaccuracyperformance(newerworkattemptstoaddressthis)

Lesspower:efficient forwardpass

Faster:1-bitXNOR-countvs.32-bitfloatingpointmultiply

RelatedWorks:ComputationTime• Checkpointing: tradeoffmemorywithcomputationtime• Idea:Duringbackpropagation,storeasubsetofactivations(“checkpoints”)andrecompute theremainingactivationsasneeded• Dependingonthearchitecture,wecanusedifferentstrategiestofigureoutwhichsubsetsofactivationstostore

RelatedWorks:ComputationTime

Work Spatial Complexity ComputationComplexity

Naive Ο(𝐿) Ο(𝐿)Checkpointing (MartensandSutskever, 2012)

Ο( 𝐿� ) Ο(𝐿)

RecursiveCheckpointing(T.Chenetal., 2016)

Ο(log 𝐿) Ο(𝐿 log 𝐿)

ReversibleNetworks(Gomezetal.,2017)

Ο(1) Ο(𝐿)

TableadaptedfromGomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

• LetL bethenumberofidenticalfeed-forwardlayers:

RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)

ResidualBlock

RevNet (Forward) RevNet (Backward)

Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

BasicResidualFunction

Idea:ReversibleResidualmoduleallowsthecurrentlayer’sactivationtobereconstructedexactlyfromthenextlayer’s.Noneedtostoreanyactivationsforbackpropagation!

RelatedWorks:ComputationTimeReversibleResNet (Gomezetal.,2017)

• Nonoticeablelossinperformance• Gainsinnetworkdepth:~600vs

~100• 4xincreaseinbatchsize(128vs32)Ad

vantage

Disadvan

tage

Gomezetal.,2017.“TheReversibleResidualNetwork:BackpropagationWithoutStoringActivations”.ArXiv Link

• Runtimecost:1.5xofnormaltraining(sometimeslessinpractice)

• Restrictreversibleblockstohaveastrideof1tonotdiscardinformation(i.e.nobottlenecklayer)

Overview




Review:BatchNormalization(BN)

• ApplyBNoncurrentfeatures(𝑥+)acrossthemini-batch• Helpsreduceinternalcovariateshift &acceleratetrainingprocess• Lesssensitivetoinitialization Credit:Ioffe &Szegedy,2015.ArXiv link

MemoryOptimizationStrategies

• Let’scomparethevariousstrategiesforBN+Act:1. Standard2. Checkpointing (baseline)3. Checkpointing (proposed)4. In-PlaceActivatedBatchNormalizationI5. In-PlaceActivatedBatchNormalizationII

1:StandardBNImplementation

GradientsforBatchNormalization

Credit:Ioffe &Szegedy,2015.“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift”.ArXivlink

2:Checkpointing (baseline)

3:Checkpointing (Proposed)

In-PlaceABN

• Fusebatchnormandactivationlayertoenablein-placecomputation,usingonlyasinglememorybuffertostoreresults.• Encapsulationmakesiteasytoimplementanddeploy• ImplementedINPLACEABN-IlayerinPyTorch asanewmodule

4:In-PlaceABNI(Proposed)

InvertibleActivationFunction

𝛾 ≠ 0

LeakyReLU isInvertible

5:In-PlaceABNII(Proposed)

StrategiesComparisonsStrategy Store ComputationOverhead

Standard 𝒙, 𝒛, 𝝈ℬ, 𝝁ℬ -

Checkpointing 𝒙, 𝝈ℬ, 𝝁ℬ 𝐵𝑁8,9, 𝜙

Checkpointing(proposed)

𝒙, 𝝈ℬ 𝜋8,9, 𝜙

In-PlaceABNI(proposed)

𝒛, 𝝈ℬ 𝜙<=, 𝜋8,9<=

In-PlaceABNII(proposed)

𝒛, 𝝈ℬ 𝜙<=

In-PlaceABN(Proposed)

In-PlaceABN(Proposed)Strength Weakness

Reducememory requirementbyhalfcomparedtostandard;samesavingsascheckpointing

Requiresinvertibleactivationfunction

Empiricallyfasterthannaïvecheckpointing

…butstillslowerthanstandard(memoryhungry)implementation.

Encapsulating BN&Activationtogether makesiteasytoimplementanddeploy(plug& play)

Overview




Experiments:Overview

• 3Majortypes:• Performanceon:(1)ImageClassification,(2)SemanticSegmentation• (3)TimingAnalysiscomparedtostandard/checkpointing

• ExperimentSetup:• NVIDIATitanXp (12GBRAM/GPU)• PyTorch• LeakyReLU activation

Experiments:ImageClassificationResNeXt-101/ResNeXt-152 WideResNet-38

Dataset ImageNet-1k ImageNet-1k

Description Bottleneckresidualunitsarereplacedwithamulti-branchversion=“cardinality”of64

More featurechannelsbutshallower

DataAugmentation

Scalesmallestside=256pixelsthenrandomlycrop224× 224,per-channelmeanandvariancenormalization

(SameasResNeXt-101/152)

Optimizer • SGDwithNesterovUpdates

• Initiallearningrate=0.1• weightdecay=10-4• momentum=0.9• 90Epoch,reduceby

factorof10per30epoch

• (Same asResNeXt)• 90Epoch,linearly

decreasingfrom0.1 to10-6

Experiments:LeakyReLU impact

• UsingLeakyReLU performsslightlyworsethanwithReLU• Within~1%,exceptfor3202centercrop—authours argueditwasdue

tonon-deterministictrainingbehaviour• Weaknesses

• Showinganaverage+standarddeviationcanbemoreconvincingoftheimprovements.

Experiments:ExploitingMemorySaving

Baseline1)LargerBatchSize2)DeeperNetwork3)LargerNetwork4)SyncBN

• Performanceincreasefor1-3• Similarperformancewithlargerbatchsizevsdeepermodel(1vs2)• SynchronizedINPLACE-ABNdidnotincreasetheperformancethat

much• NotesonsynchronizedBN:http://hangzh.com/PyTorch-

Encoding/notes/syncbn.html

Experiments:SemanticSegmentation

• SemanticSegmentation:Assigncategoricallabelstoeachpixelinanimage• Datasets• CityScapes• COCO-Stuff• Mapillary Vistas

FigureCredit:https://www.cityscapes-dataset.com/examples/


• Architecturecontains2partsthatarejointlyfine-tunedonsegmentationdata:• Body:Classificationmodelspre-trainedonImageNet• Head:Segmentationspecificarchitectures

• Authours usedDeepLabV3*asthehead• Cascadedatrous (dilated)convolutionsforcapturingcontextualinfo

• Crop-levelfeaturesencodingglobalcontext• MaximizeGPUUsageby:

• (FIXEDCROP)fixingthetrainingcropsizeandthereforepushingtheamountofcropsperminibatch tothelimit

• (FIXEDBATCH) fixingthenumberofcropsperminibatch andmaximizingthetrainingcropresolutions

*L.Chen,G.Papandreou,F.Schroff,andH.Adam.“Rethinkingatrous convolutionforsemanticimagesegmentation.”ArXivLink


• Moretrainingdata(FIXEDCROP) helpsalittlebit• Higherinputresolution(FIXEDBATCH) helpsevenmorethanadding

morecrops

• Noqualitativeresult:probablyvisuallysimilartoDeepLabV3

Experiments:SemanticSegmentationFine-TunedonCityScapes andMapillaryVistas

• CombinationofINPLACE-ABNsyncwithlargercropsizesimprovesby≈0.9%overthebestperformingsettinginTable3

• Class- Uniformsampling:Class-uniformlysampledfromeligibleimagecandidates,makingsuretotaketrainingcropsfromareascontainingtheclassofinterest.

Experiments:SemanticSegmentation• CurrentlystateoftheartforCityScapes forIoU classandiIoU (instance)Class• iIoU:Weightingthecontributionofeachpixelbytheratiooftheclass’averageinstancesizetothesizeoftherespectivegroundtruthinstance.

Experiments:TimingAnalyses

• TheyisolatedasingleBN+ACT+CONVblock&evaluatethecomputationaltimesrequiredforaforwardandbackwardpass• Result:Narrowedthegapbetweenstandard vscheckpointing by half• Ensuredfaircomparisonbyre-implementingcheckpointing inPyTorch

Overview




FutureDirections:

• ApplyINPLACE-ABNinother…• Architectures:DenseNet,Squeeze-ExcitationNetworks,DeformableConvolutionalNetworks• ProblemDomains: Objectdetection,instance-specificsegmentation,3Ddatalearning

• CombineINPLACE-ABNwithothermemoryreductiontechniques,ex:Mixedprecisiontraining• ApplysameInPlace ideaon’newer’BatchNorm,ex:BatchRenormalization*

*S.Ioffe.“BatchRenormalization:TowardsReducingMinibatch DependenceinBatch-NormalizedModels.”ArXivLink

LinksandReferences

• INPLACE-ABNPaper:https://arxiv.org/pdf/1712.02616.pdf• OfficialGithub code(PyTorch):https://github.com/mapillary/inplace_abn• CityScapes Dataset:https://www.cityscapes-dataset.com/benchmarks/#scene-labeling-task• ReducedPrecision:

• BinaryConnect:https://arxiv.org/abs/1511.00363• Binarized Networks:https://arxiv.org/abs/1602.02830• MixedPrecisionTraining:https://arxiv.org/abs/1710.03740

• TradeoffwithComputationTime• Checkpointing:https://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf

• RecursiveCheckpointing:https://arxiv.org/abs/1604.06174• ReversibleNetworks:https://arxiv.org/abs/1707.04585

In-Place Activated BatchNormfor Memory- Optimized Training...

Documents

Transcript of In-Place Activated BatchNormfor Memory- Optimized Training...