An In-depth Performance Characterization of CPU...
Transcript of An In-depth Performance Characterization of CPU...
AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures
AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda
NetworkBasedComputingLaboratory
Dept.ofComputerScienceandEngineering
TheOhioStateUniversity
[email protected],{subramon,panda}@cse.ohio-state.edu
PresentationatMLHPC‘17
MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– CPU-basedDeepLearning
– DeepLearningFrameworks
• ResearchChallenges
• DesignDiscussion
• PerformanceCharacterization
• Conclusion
CPUbasedDeepLearningisnotasbadasyouthink!
MLHPC‘17 3NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)
• TheImageNetChallenge- (ILSVRC)– 90%oftheImageNetteamsusedGPUsin
2014*
– DLmodelslikeAlexNet,GoogLeNet,andVGG
– GPUs:AnaturalfitforDLduetothethroughput-orientednature
– GPUsarealsogrowingintheHPCarena!
GPUsaregreatforDeepLearning
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
https://www.top500.org/
MLHPC‘17 4NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ButwhataboutCPUs?
1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1
• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org
• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing
• XeonPhi1st generation:amany-coreco-processor
• XeonPhi2nd generation(KNL):aself-hostedmany-coreprocessor!
• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?
https://www.top500.org/statistics/list/
SystemCountforXeonPhi
MLHPC‘17 5NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....
• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!
• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?
– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)
• Butthereishope:-)– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!
– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..
DeepLearningFrameworks – CPUsorGPUs?
MLHPC‘17 6NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Caffeisapopularandwidelyusedframework;hasmanyforks(friends)
• NVIDIA-CaffeandBVLC-Caffe(OfficialCaffe)arealmostsimilar– NVIDIA-Caffeiscuttingedgethough!(Tensorcores,Volta,DrivePX,etc.)
• Intel-CaffeisoptimizedforCPU-basedDeepLearning
• OSU-Caffeisamulti-nodemulti-GPUvariantthatwehaveworkedonatOSU
TheDLFramework(s)indiscussion:Caffeandfriends
Caffe Variant Multi-GPUSupport Multi-node Support Multi-nodeCommunication
BVLC-Caffe Yes No N/A
NVIDIA-Caffe Yes No N/A
Intel-Caffe N/A Yes IntelMLSL2017.1.016(withIntel MPI2017)
OSU-Caffe Yes Yes MVAPICH2-GDR 2.2
MLHPC‘17 7NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction
• ResearchChallenges
• DesignDiscussion
• PerformanceCharacterization
• Conclusion
Agenda
MLHPC‘17 8NetworkBasedComputingLaboratory High-PerformanceDeepLearning
CanweprovideaholisticyetcomprehensiveviewofDNNtrainingperformanceforadiversesetofhardwarearchitectures
includingIntelXeonPhi(KNL)processorsandNVIDIAPascalGPUs?
TheKeyQuestion!
MLHPC‘17 9NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ResearchChallenges
LetusbringHPCandDL“together”!
ComputationandcommunicationcharacteristicsofDLworkloads?
VariousdatasetsandnetworkshandleddifferentlyinDLframeworks
Possiblestrategiestoevaluatethe
performanceofDLframeworks
Performancetrendsthatcanbeobservedfora
singlenode
Scale-outofDNNtrainingforCPU-basedandGPU-
basedDNNtraining
Performancebehaviorfor
hardwarefeatureslikeMCDRAM
MLHPC‘17 10NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction
• ResearchChallenges
• DesignDiscussion– CaffeArchitecture
– UnderstandingtheImpactofExecutionEnvironments
– Multi-nodeTraining:Intel-Caffe,OSU-Caffe,andMPI
• PerformanceCharacterization
• Conclusion
Agenda
MLHPC‘17 11NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Bcast(GPU0)
packed_comm_buff
L1L2..
Ln
F
L1L2..
Ln
L1L2..
Ln
L1L2..
Ln
Params
GPU
0
Params
GPU
1 Params
GPU
2
Params
GPU
3
Gradients
1.DataPropagation
2.ForwardBackwardPass
3.GradientAggregation
B F B F B F B
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
ApplyUpdates
Reduce(GPU0)
Loop{}
CaffeArchitecture
http://hidl.cse.ohio-state.edu
MLHPC‘17 12NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Performanceisdependenton:
1. HardwareArchitectures– GPUs
– Multi-/Many-coreCPUs
2. SoftwareLibraries– cuDNN(forGPUs)
– MKL-DNN/MKL2017 (forCPUs)
3. Hardware/Softwareco-design– Softwarelibrariesoptimizedfor
oneplatformwillnothelptheother!
– cuDNNvs.MKL-DNN
UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)
DLFrameworks(Caffe,TensorFlow,etc.)
BLASLibraries
Hardware
Many-coreGPU(PascalP100)
GenericConvolutionLayer
MKLOptimizedConvolutionLayer
MKL2017 cuDNN/cuBLAS
Multi-/Many-core(Xeon,XeonPhi)
cuDNN OptimizedConvolutionLayer
OtherBLASLibraries
OpenBLASATLAS
OtherProcessors
MLHPC‘17 13NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• MKL-DNN:ThekeyperformancedifferenceforCPU-basedDNNtraining!
• Doesthatreallyworkinpractice?
• IntelMKLclaimstooffermuchbetterperformance
• IntelMLSLpromisesmulti-nodetraining
Intel-CaffeandIntelMKL
Courtesy:http://www.techenablement.com/accelerating-python-deep-learning/
Multi-NodescalingusingIntelOmni-PathonAlexNet
MLHPC‘17 14NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• WeneedacommunicationlibraryforScale-out?– MessagePassingInterface(MPI)librarieslikeMVAPICH,IntelMPI,etc.
– NVIDIANCCL,FacebookGloo,Baidu-allreduce,etc.
– IntelMachineLearningScalingLibrary(higherlevellibrarybuiltontopofMPI)
• Howtochoose?– ForGPU-basedframeworks,CUDA-AwareMPI,NCCL,andGloo
– ForCPU-basedframeworks,anyMPIlibrarywilldo• MLSLofferssomethingmore
• MLSLissortofaDLframeworkAPI– canbeusedinsidetheframework
• Butcanbeusedinastand-aloneformattoo!
SowhattouseforScale-outwithIntel-Caffe?
MLHPC‘17 15NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)
– MostcommunicationbasedonGPUbuffers
• State-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance
– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!
• Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?– EfficientOverlap ofComputationandCommunication
– EfficientLarge-Message Communication(Reductions)
– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?
OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes
Scale-up
Perform
ance
Scale-outPerformance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-Designs
MPIcuBLAS
A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)
MLHPC‘17 16NetworkBasedComputingLaboratory High-PerformanceDeepLearning
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 15th,241,108-core(Pleiades)atNASA
• 20th,462,462-core(Stampede)atTACC
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)
MLHPC‘17 17NetworkBasedComputingLaboratory High-PerformanceDeepLearning
0
2000
4000
6000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
30
0 2 8 32 128 512 2K 8K
Latency(us)
MessageSize(Bytes)
GPU-GPUInter-nodeLatency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores
NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA
CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA
10x
9x
Scale-outforGPU-basedTraining
1.88us11X
MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!
MLHPC‘17 18NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction
• ResearchChallenges
• DesignDiscussion
• PerformanceCharacterization– Single-nodePerformance
– Multi-nodePerformance
• Conclusion
Agenda
MLHPC‘17 19NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• SeveralGPUgenerationsandCPUarchitectures
• Single-nodeResultsforAlexNetandResNet-50– ImpactofMKLengine
– ImpactofMC-DRAM
– Layer-wisebreakdown
– P100vs.KNL
• Multi-noderesultsusingIntel-CaffeandOSU-Caffe– Weakscaling
– ResNet-50andAlexNet
PerformanceCharacterization
MLHPC‘17 20NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Name(Label)
ProcessorArchitecture(Description) No.ofCores No.of Sockets
Haswell1 Intel [email protected] 20(2*10) 2
Haswell2 Intel [email protected] 20(2*10) 2
Broadwell IntelXeonCPU [email protected] 28(2*14) 2
KNL IntelXeon [email protected] 68(1*68) 1
K40 NVIDIATeslaK4011.8GB @0.75GHz 2880CUDACores N/A
K80 NVIDIATeslaK8011.8GB @0.82GHz 2496CUDACores N/A
P100 NVIDIATeslaP100-PCIE16GB @1.33GHz 3584CUDACores N/A
PerformanceCharacterization:VariousArchitectures
MLHPC‘17 21NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• ThecomparisonofoptimizedMKLengineandthedefaultCaffeengine
• MKLengineisupto3XbetterthandefaultCaffeengine
• Biggest gainsforIntel XeonPhi(many-core)architecture
• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)
Single-node:ImpactofMKLengineinIntel-Caffe
020040060080010001200140016001800
Training
Time(m
s)
CPUArchitectures
MLHPC‘17 22NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Single-node:ImpactofUtilizingMCDRAM
0100200300400500600700
DDR-All MCDRAM-All MCDRAMasCache
Training
Time(m
s)
MemoryConfigurations
Forward Backward
• “MCDRAMasCache”and“MCDRAM-All”offerverysimilarperformance
• WechosetouseMCDRAMasCache forallthesubsequentresults
• Onaverage,DDR-Allisupto1.5Xslower thanMCDRAM
MLHPC‘17 23NetworkBasedComputingLaboratory High-PerformanceDeepLearning
DivingDeeper:Layer-wiseBreakdown
050100150200250300350400450500
Time(m
s)
conv1 conv2 conv3 conv4 conv5
• ThefulllandscapeforAlexNet:ForwardandBackwardPass
• FasterConvolutionsà FasterTraining
• Mostperformancegainsarebasedonconv2 andconv3 forAlexNet
0100200300400500600700800
Time(m
s)
conv1 conv2 conv3 conv4 conv5
MLHPC‘17 24NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• FullyconnectedlayersaremuchsloweronKNLcomparedtoP100
• conv1 andconv3 alsocontributetodegradationonKNL
• conv2 isfasteronKNLcomparedtoP100
• ResNet-50hassomesurprises(notshownonthisslide)– KNLperformssignificantlybetterthanP100
– DifficulttovisualizeasthereareseverallayersinResNet-50
DivingDeeper:P100vs.KNL(AlexNet)
0
20
40
60
80
100
120
140
160
180
200
P100 KNL-Opt
Time(m
s)
HardwareArchitecture
conv1 conv2 conv3 conv4 conv5 fc6 fc7
MLHPC‘17 25NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Multi-nodeResults:ResNet-50• Allresultsareweakscaling
– Thebatchsizeremainsconstant/solver
– Butincreasesoverallby:
– batch-size*(#nodesor#gpus)
• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability
• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Less
scalabilityduetocommunicationoverhead
0
100
200
300
400
500
600
0
50
100
150
200
250
300
350
400
2 4 8 16 20 32
Images/secon
d
Training
Time(secon
ds)
No.ofNodes
Time(seconds) Images/second
ResNet-50Intel-Caffe1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html
MLHPC‘17 26NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Multi-nodeResults:AlexNetComparison
• OSU-Caffevs.Intel-Caffe– Differentframeworkssonotdirectlycomparable
– Aroughcomparisoncanstillhelpinunderstandingscalabilitytrends
– Designofframeworkcanaffectperformancefordistributedtraining• MPI(orthecommunicationruntime)cancauseamarkeddifference
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8 16 20 32
ImagesPe
rSecon
d
No.ofNodes
OSU-Caffe(GPU) Intel-Caffe(CPU)
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 20 32
Training
Time(secon
ds)
No.ofNodes
OSU-Caffe(GPU) Intel-Caffe(CPU)
MLHPC‘17 27NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction
• ResearchChallenges
• DesignComparisons
• PerformanceCharacterization
• Conclusion
Agenda
MLHPC‘17 28NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Conclusion• CPUisverycomparabletoGPUforDNNTrainingworkloadsif
appropriateoptimizationsareexploited
• GPUsarestillfasterthanCPUsingeneral
• KNLbeatsP100foronecasebutP100beatsKNLformostcases
• EvaluatingtheperformanceofaDLframework– Thehardwarearchitecturematters
– Butsoftwarestackhasahigherandmoresignificantimpactthanhardware
– Thefullexecutionenvironmentandcommunicationruntimeneedstobeevaluatedtoensurefairnessincomparisons
MLHPC‘17 29NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Evaluatewithupcomingarchitectures– VoltaGPUs
– DGX-1VSystem
– IntelNervanaNeuralNetworkProcessor
• VerifythehypothesisusingotherDLframeworks– TensorFlow
– IntelNeon
– NervanaGraph
• InvestigatenewdesignswithMVAPICH2andotherMPIstackstosupportfasterDNNtraining
FutureWork
MLHPC‘17 30NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/
http://web.cse.ohio-state.edu/~awan.10
TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/
TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/
MLHPC‘17 31NetworkBasedComputingLaboratory High-PerformanceDeepLearning
PleasejoinusforothereventsatSC’17• Workshops
– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware
• Tutorials– InfiniBand,Omni-Path,andHigh-Speed
EthernetforDummies
– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage
• BoFs– MPICHBoF:MVAPICH2Project:Latest
StatusandFuturePlans
Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails
• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning
FrameworksforScalableDistributedTrainingonGPUClusters
– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters
• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans
TowardsExascaleComputing
– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach
– AcceleratingDeepLearningwithMVAPICH
– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning