An In-depth Performance Characterization of CPU...

AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures

AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda

NetworkBasedComputingLaboratory

Dept.ofComputerScienceandEngineering

TheOhioStateUniversity

[email protected],{subramon,panda}@cse.ohio-state.edu

PresentationatMLHPC‘17

MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– CPU-basedDeepLearning

– DeepLearningFrameworks

• ResearchChallenges

• DesignDiscussion

• PerformanceCharacterization

• Conclusion

CPUbasedDeepLearningisnotasbadasyouthink!


• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)

• TheImageNetChallenge- (ILSVRC)– 90%oftheImageNetteamsusedGPUsin

2014*

– DLmodelslikeAlexNet,GoogLeNet,andVGG

– GPUs:AnaturalfitforDLduetothethroughput-orientednature

– GPUsarealsogrowingintheHPCarena!

GPUsaregreatforDeepLearning

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/


ButwhataboutCPUs?

1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org

• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing

• XeonPhi1st generation:amany-coreco-processor

• XeonPhi2nd generation(KNL):aself-hostedmany-coreprocessor!

• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?

https://www.top500.org/statistics/list/

SystemCountforXeonPhi


• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....

• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!

• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?

– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)

• Butthereishope:-)– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!

– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..

DeepLearningFrameworks – CPUsorGPUs?


• Caffeisapopularandwidelyusedframework;hasmanyforks(friends)

• NVIDIA-CaffeandBVLC-Caffe(OfficialCaffe)arealmostsimilar– NVIDIA-Caffeiscuttingedgethough!(Tensorcores,Volta,DrivePX,etc.)

• Intel-CaffeisoptimizedforCPU-basedDeepLearning

• OSU-Caffeisamulti-nodemulti-GPUvariantthatwehaveworkedonatOSU

TheDLFramework(s)indiscussion:Caffeandfriends

Caffe Variant Multi-GPUSupport Multi-node Support Multi-nodeCommunication

BVLC-Caffe Yes No N/A

NVIDIA-Caffe Yes No N/A

Intel-Caffe N/A Yes IntelMLSL2017.1.016(withIntel MPI2017)

OSU-Caffe Yes Yes MVAPICH2-GDR 2.2


• Introduction




• Conclusion

Agenda


CanweprovideaholisticyetcomprehensiveviewofDNNtrainingperformanceforadiversesetofhardwarearchitectures

includingIntelXeonPhi(KNL)processorsandNVIDIAPascalGPUs?

TheKeyQuestion!


ResearchChallenges

LetusbringHPCandDL“together”!

ComputationandcommunicationcharacteristicsofDLworkloads?

VariousdatasetsandnetworkshandleddifferentlyinDLframeworks

Possiblestrategiestoevaluatethe

performanceofDLframeworks

Performancetrendsthatcanbeobservedfora

singlenode

Scale-outofDNNtrainingforCPU-basedandGPU-

basedDNNtraining

Performancebehaviorfor

hardwarefeatureslikeMCDRAM


• Introduction


• DesignDiscussion– CaffeArchitecture

– UnderstandingtheImpactofExecutionEnvironments

– Multi-nodeTraining:Intel-Caffe,OSU-Caffe,andMPI


• Conclusion

Agenda


Bcast(GPU0)

packed_comm_buff

L1L2..

Ln

F

L1L2..

Ln

L1L2..

Ln

L1L2..

Ln

Params

GPU

0

Params

GPU

1 Params

GPU

2

Params

GPU

3

Gradients

1.DataPropagation

2.ForwardBackwardPass

3.GradientAggregation

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

Reduce(GPU0)

Loop{}

CaffeArchitecture

http://hidl.cse.ohio-state.edu


Performanceisdependenton:

1. HardwareArchitectures– GPUs

– Multi-/Many-coreCPUs

2. SoftwareLibraries– cuDNN(forGPUs)

– MKL-DNN/MKL2017 (forCPUs)

3. Hardware/Softwareco-design– Softwarelibrariesoptimizedfor

oneplatformwillnothelptheother!

– cuDNNvs.MKL-DNN

UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors


• MKL-DNN:ThekeyperformancedifferenceforCPU-basedDNNtraining!

• Doesthatreallyworkinpractice?

• IntelMKLclaimstooffermuchbetterperformance

• IntelMLSLpromisesmulti-nodetraining

Intel-CaffeandIntelMKL

Courtesy:http://www.techenablement.com/accelerating-python-deep-learning/

Multi-NodescalingusingIntelOmni-PathonAlexNet


• WeneedacommunicationlibraryforScale-out?– MessagePassingInterface(MPI)librarieslikeMVAPICH,IntelMPI,etc.

– NVIDIANCCL,FacebookGloo,Baidu-allreduce,etc.

– IntelMachineLearningScalingLibrary(higherlevellibrarybuiltontopofMPI)

• Howtochoose?– ForGPU-basedframeworks,CUDA-AwareMPI,NCCL,andGloo

– ForCPU-basedframeworks,anyMPIlibrarywilldo• MLSLofferssomethingmore

• MLSLissortofaDLframeworkAPI– canbeusedinsidetheframework

• Butcanbeusedinastand-aloneformattoo!

SowhattouseforScale-outwithIntel-Caffe?


• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)

– MostcommunicationbasedonGPUbuffers

• State-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance

– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!

• Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?– EfficientOverlap ofComputationandCommunication

– EfficientLarge-Message Communication(Reductions)

– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?

OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes

Scale-up

Perform

ance

Scale-outPerformance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)


OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)


0

2000

4000

6000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBandwidth


0

10

20

30

0 2 8 32 128 512 2K 8K

Latency(us)

MessageSize(Bytes)

GPU-GPUInter-nodeLatency


MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores

NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA

CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA

10x

9x

Scale-outforGPU-basedTraining

1.88us11X

MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!


• Introduction



• PerformanceCharacterization– Single-nodePerformance

– Multi-nodePerformance

• Conclusion

Agenda


• SeveralGPUgenerationsandCPUarchitectures

• Single-nodeResultsforAlexNetandResNet-50– ImpactofMKLengine

– ImpactofMC-DRAM

– Layer-wisebreakdown

– P100vs.KNL

• Multi-noderesultsusingIntel-CaffeandOSU-Caffe– Weakscaling

– ResNet-50andAlexNet

PerformanceCharacterization


Name(Label)

ProcessorArchitecture(Description) No.ofCores No.of Sockets

Haswell1 Intel [email protected] 20(2*10) 2

Haswell2 Intel [email protected] 20(2*10) 2

Broadwell IntelXeonCPU [email protected] 28(2*14) 2

KNL IntelXeon [email protected] 68(1*68) 1

K40 NVIDIATeslaK4011.8GB @0.75GHz 2880CUDACores N/A

K80 NVIDIATeslaK8011.8GB @0.82GHz 2496CUDACores N/A

P100 NVIDIATeslaP100-PCIE16GB @1.33GHz 3584CUDACores N/A

PerformanceCharacterization:VariousArchitectures


• ThecomparisonofoptimizedMKLengineandthedefaultCaffeengine

• MKLengineisupto3XbetterthandefaultCaffeengine

• Biggest gainsforIntel XeonPhi(many-core)architecture

• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)

Single-node:ImpactofMKLengineinIntel-Caffe

020040060080010001200140016001800

Training

Time(m

s)

CPUArchitectures


Single-node:ImpactofUtilizingMCDRAM

0100200300400500600700

DDR-All MCDRAM-All MCDRAMasCache

Training

Time(m

s)

MemoryConfigurations

Forward Backward

• “MCDRAMasCache”and“MCDRAM-All”offerverysimilarperformance

• WechosetouseMCDRAMasCache forallthesubsequentresults

• Onaverage,DDR-Allisupto1.5Xslower thanMCDRAM


DivingDeeper:Layer-wiseBreakdown

050100150200250300350400450500

Time(m

s)

conv1 conv2 conv3 conv4 conv5

• ThefulllandscapeforAlexNet:ForwardandBackwardPass

• FasterConvolutionsà FasterTraining

• Mostperformancegainsarebasedonconv2 andconv3 forAlexNet

0100200300400500600700800

Time(m

s)

conv1 conv2 conv3 conv4 conv5


• FullyconnectedlayersaremuchsloweronKNLcomparedtoP100

• conv1 andconv3 alsocontributetodegradationonKNL

• conv2 isfasteronKNLcomparedtoP100

• ResNet-50hassomesurprises(notshownonthisslide)– KNLperformssignificantlybetterthanP100

– DifficulttovisualizeasthereareseverallayersinResNet-50

DivingDeeper:P100vs.KNL(AlexNet)

0

20

40

60

80

100

120

140

160

180

200

P100 KNL-Opt

Time(m

s)

HardwareArchitecture

conv1 conv2 conv3 conv4 conv5 fc6 fc7


Multi-nodeResults:ResNet-50• Allresultsareweakscaling

– Thebatchsizeremainsconstant/solver

– Butincreasesoverallby:

– batch-size*(#nodesor#gpus)

• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability

• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Less

scalabilityduetocommunicationoverhead

0

100

200

300

400

500

600

0

50

100

150

200

250

300

350

400

2 4 8 16 20 32

Images/secon

d

Training

Time(secon

ds)

No.ofNodes

Time(seconds) Images/second

ResNet-50Intel-Caffe1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html


Multi-nodeResults:AlexNetComparison

• OSU-Caffevs.Intel-Caffe– Differentframeworkssonotdirectlycomparable

– Aroughcomparisoncanstillhelpinunderstandingscalabilitytrends

– Designofframeworkcanaffectperformancefordistributedtraining• MPI(orthecommunicationruntime)cancauseamarkeddifference

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16 20 32

ImagesPe

rSecon

d

No.ofNodes

OSU-Caffe(GPU) Intel-Caffe(CPU)

0

10

20

30

40

50

60

70

80

90

1 2 4 8 16 20 32

Training

Time(secon

ds)

No.ofNodes

OSU-Caffe(GPU) Intel-Caffe(CPU)


• Introduction


• DesignComparisons


• Conclusion

Agenda


Conclusion• CPUisverycomparabletoGPUforDNNTrainingworkloadsif

appropriateoptimizationsareexploited

• GPUsarestillfasterthanCPUsingeneral

• KNLbeatsP100foronecasebutP100beatsKNLformostcases

• EvaluatingtheperformanceofaDLframework– Thehardwarearchitecturematters

– Butsoftwarestackhasahigherandmoresignificantimpactthanhardware

– Thefullexecutionenvironmentandcommunicationruntimeneedstobeevaluatedtoensurefairnessincomparisons


• Evaluatewithupcomingarchitectures– VoltaGPUs

– DGX-1VSystem

– IntelNervanaNeuralNetworkProcessor

• VerifythehypothesisusingotherDLframeworks– TensorFlow

– IntelNeon

– NervanaGraph

• InvestigatenewdesignswithMVAPICH2andotherMPIstackstosupportfasterDNNtraining

FutureWork


ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~awan.10

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/


PleasejoinusforothereventsatSC’17• Workshops

– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware

• Tutorials– InfiniBand,Omni-Path,andHigh-Speed

EthernetforDummies

– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage

• BoFs– MPICHBoF:MVAPICH2Project:Latest

StatusandFuturePlans

Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails

• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning

FrameworksforScalableDistributedTrainingonGPUClusters

– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters

• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans

TowardsExascaleComputing

– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach

– AcceleratingDeepLearningwithMVAPICH

– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning

An In-depth Performance Characterization of CPU...

Documents

Transcript of An In-depth Performance Characterization of CPU...