Accelerating Deep Learning with...
Transcript of Accelerating Deep Learning with...
AcceleratingDeepLearningwithMVAPICH
AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda
NetworkBasedComputingLaboratory
Dept.ofComputerScienceandEngineering
TheOhioStateUniversity
OSUBoothTalk(SC’17)
MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 3NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Caffe,TensorFlow,CNTKandmanymore..
• MostframeworksareexploitingGPUstoacceleratetraining
• Diverseapplications– ImageRecognition,CancerDetection,Self-DrivingCars,SpeechProcessingetc.
DLFrameworksandTrends
https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
MLHPC‘17 4NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)– TheImageNetChallenge- (ILSVRC)
– 90%oftheImageNetteamsusedGPUsin2014*
– DLmodelslikeAlexNet,GoogLeNet,andVGG
– AnaturalfitforDLduetothethroughput-orientednature
– GPUsarealsogrowingintheHPCarena!à
GPUsaregreatforDeepLearning
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
https://www.top500.org/statistics/list/
MLHPC‘17 5NetworkBasedComputingLaboratory High-PerformanceDeepLearning
AndCPUsarecatchingupfast
1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1
• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org
• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing
• XeonPhi1st generationwasaco-processor
• Unlike XeonPhi2nd generation,whichisaself-hostedprocessor!
• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?
https://www.top500.org/statistics/list/
SystemCountforXeonPhi
MLHPC‘17 6NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• WhatisMessagePassingInterface(MPI)?– ade-factostandardforexpressingdistributed-memoryparallelprogramming
– usedforcommunicationbetweenprocessesinmulti-processapplications
• MVAPICH2isahighperformanceimplementationoftheMPIstandard
• WhatcanMPIdoforDeepLearning?– MPIhasbeenusedforlargescalescientificapplications
– DeepLearningcanalsoexploitMPItoperformhigh-performancecommunication
• WhydoIneedcommunicationinDeepLearning?– IfyouuseoneGPUoroneCPU,youdonotneedcommunication
– But,oneGPUorCPUisnotenough!
– DLwantsasmanycomputeelementsasitcanget!
– MPIisagreatfit– Broadcast,Reduce,andAllreduceiswhatmostDLworkloadsrequire
Whattouseforscale-out?(DistributedtrainingofNeuralNets.)
MLHPC‘17 7NetworkBasedComputingLaboratory High-PerformanceDeepLearning
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 15th,241,108-core(Pleiades)atNASA
• 20th,462,462-core(Stampede)atTACC
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)
MLHPC‘17 8NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....
• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!
• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?
– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)
• Butthereishope,actuallyalotofgreatprogresshere!– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!
– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..
DeepLearningFrameworks – CPUsorGPUs?
MLHPC‘17 9NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Howtoefficientlyscale-outaDeepLearning(DL)frameworkandtakeadvantageofheterogeneousHigh
PerformanceComputing(HPC)resourceslikeGPUsandXeonPhi(s)?
TheKeyQuestion!
MLHPC‘17 10NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ResearchChallenges
LetusbringHPCandDL“together”!
ComputationandcommunicationcharacteristicsofDLworkloads?
VariousdatasetsandnetworkshandleddifferentlyinDLframeworks
Possiblestrategiestoevaluatethe
performanceofDLframeworks
Performancetrendsthatcanbeobservedfora
singlenode
Scale-outofDNNtrainingforCPU-basedandGPU-
basedDNNtraining
Performancebehaviorfor
hardwarefeatures
MLHPC‘17 11NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 12NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Bcast(GPU0)
packed_comm_buff
L1L2..
Ln
F
L1L2..
Ln
L1L2..
Ln
L1L2..
Ln
Params
GPU
0
Params
GPU
1 Params
GPU
2
Params
GPU
3
Gradients
1.DataPropagation
2.ForwardBackwardPass
3.GradientAggregation
B F B F B F B
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
ApplyUpdates
Reduce(GPU0)
Loop{}
CaffeArchitecture
http://hidl.cse.ohio-state.edu
MLHPC‘17 13NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)
– MostcommunicationbasedonGPUbuffers
• ExistingState-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance
– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!
• Proposed:Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?
– EfficientOverlap ofComputationandCommunication
– EfficientLarge-Message Communication(Reductions)
– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?
OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes
Scale-up
Perform
ance
Scale-outPerformance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-Designs
MPIcuBLAS
A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)
MLHPC‘17 14NetworkBasedComputingLaboratory High-PerformanceDeepLearning
0
2000
4000
6000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
30
0 2 8 32 128 512 2K 8K
Latency(us)
MessageSize(Bytes)
GPU-GPUInter-nodeLatency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores
NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA
CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA
10x
9x
MVAPICH2-GDR:Scale-outforGPU-basedDistributedTraining
1.88us11X
MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!
MLHPC‘17 15NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Caffe:AflexibleandlayeredDeepLearningframework.
• BenefitsandWeaknesses– Multi-GPUTrainingwithinasinglenode
– PerformancedegradationforGPUsacrossdifferentsockets
– LimitedScale-out
• OSU-Caffe:MPI-basedParallelTraining– EnableScale-up(withinanode)andScale-out
(acrossmulti-GPUnodes)
– Scale-outon64GPUsfortrainingCIFAR-10networkonCIFAR-10dataset
– Scale-outon128GPUsfortrainingGoogLeNetnetworkonImageNetdataset
OSU-Caffe 0.9:ScalableDeepLearningonGPUClusters
0
50
100
150
200
250
8 16 32 64 128
TrainingTim
e(secon
ds)
No.ofGPUs
GoogLeNet(ImageNet)on128GPUs
Caffe OSU-Caffe(1024) OSU-Caffe(2048)Invalidusecase
OSU-Caffe0.9availablefromHiDL site
MLHPC‘17 16NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• NCCLhassomelimitations– Onlyworksforasinglenode,thus,noscale-outon
multiplenodes
– DegradationacrossIOH(socket)forscale-up(withinanode)
• WeproposeoptimizedMPI_Bcast– CommunicationofverylargeGPUbuffers(orderof
megabytes)
– Scale-outonlargenumberofdensemulti-GPUnodes
• HierarchicalCommunicationthatefficientlyexploits:– CUDA-AwareMPI_BcastinMV2-GDR
– NCCLBroadcastprimitive
EfficientBroadcastforMVAPICH2-GDRusingNVIDIANCCL
110100100010000100000
1 8 64 512 4K 32K
256K 2M 16M
128M
Latency(us)
LogScale
MessageSize
MV2-GDR MV2-GDR-Opt
100x
0102030
2 4 8 16 32 64Time(secon
ds)
NumberofGPUs
MV2-GDR MV2-GDR-Opt
PerformanceBenefits:MicrosoftCNTKDLframework(25%avg.improvement)
PerformanceBenefits:OSUMicro-benchmarks
EfficientLargeMessageBroadcastusingNCCLandCUDA-AwareMPIforDeepLearning,A. Awan,K.Hamidouche,A.Venkatesh,andD.K.Panda,The23rdEuropeanMPIUsers'GroupMeeting(EuroMPI16),Sep2016 [BestPaperRunner-Up]
MLHPC‘17 17NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• MPI_Bcast:DesignandPerformanceTuningforDLWorkloads– Designring-basedalgorithmsforlargemessages
– Harnessamultitudeofalgorithmsandtechniquesforbestperformanceacrossthefullrangeofmessagesizeandprocess/GPUcount
• PerformanceBenefits– PerformancecomparableorbetterthanNCCL-
augmentedapproachesforlargemessages
– Upto10Ximprovementforsmall/mediummessagesizeswithmicro-benchmarks
– Upto7%improvementforVGGtraining
PureMPILargeMessageBroadcast
0.0010.010.11101001000
1 8 64 512 4K 32K
256K 2M 16M
128M
Latency(m
s)-logscale
MessageSize(bytes)
MV2-GDR-NCCL MV2-GDR-Opt
051015202530
2 4 8 32 64 128
TrainingTime(secon
ds)
No.ofGPUs
MV2-GDR-NCCL MV2-GDR-Opt
VGGTrainingwithCNTK
A.A.Awan,C-H.Chu,H.Subramoni,andD.K.Panda.OptimizedBroadcastforDeepLearningWorkloadsonDense-GPUInfiniBandClusters:MPIorNCCL?,arXiv ’17(https://arxiv.org/abs/1707.09414)
MPIBcast Benchmark:128GPUs(8nodes)
MLHPC‘17 18NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• PerformancegainsforMVAPICH2-GDR2.3a*comparedtoBaidu-allreduceLargeMessageAllreduce:MVAPICH2-GDRvs.Baidu-allreduce
*AvailablewithMVAPICH2-GDR2.3a
110
1001000
10000100000
1000000
Latency(us)-logscale
MessageSize(bytes)
8GPUs(4nodeslogscale-allreducevsMVAPICH2-GDR
Baidu-allreduce MVAPICH2-GDR
~30Xbetter~11%improvement
MLHPC‘17 19NetworkBasedComputingLaboratory High-PerformanceDeepLearning
0100200300
2 4 8 16 32 64 128Latency(m
s)
MessageSize(MB)
Reduce– 192GPUs
LargeMessageOptimizedCollectivesforDeepLearning
0
100
200
128 160 192Latency(m
s)
No.ofGPUs
Reduce– 64MB
0100200300
16 32 64Latency(m
s)
No.ofGPUs
Allreduce- 128MB
0
50
100
2 4 8 16 32 64 128Latency(m
s)
MessageSize(MB)
Bcast– 64GPUs
0
50
100
16 32 64Latency(m
s)
No.ofGPUs
Bcast128MB
• MVAPICH2-GDRprovidesoptimizedcollectivesforlargemessagesizes
• OptimizedReduce,Allreduce,andBcast
• GoodscalingwithlargenumberofGPUs
• AvailableinMVAPICH2-GDR2.2andhigher
0100200300
2 4 8 16 32 64 128Latency(m
s)MessageSize(MB)
Allreduce– 64GPUs
MLHPC‘17 20NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 21NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Performancedependsonmanyfactors
• HardwareArchitectures– GPUs
– Multi-/Many-coreCPUs
– SoftwareLibraries:cuDNN(forGPUs),MKL-DNN/MKL2017
(forCPUs)
• HardwareandSoftwareco-design– Softwarelibrariesoptimized
foroneplatformwillnothelptheother!
– cuDNNvs.MKL-DNN
UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)
DLFrameworks(Caffe,TensorFlow,etc.)
BLASLibraries
Hardware
Many-coreGPU(PascalP100)
GenericConvolutionLayer
MKLOptimizedConvolutionLayer
MKL2017 cuDNN/cuBLAS
Multi-/Many-core(Xeon,XeonPhi)
cuDNN OptimizedConvolutionLayer
OtherBLASLibraries
OpenBLASATLAS
OtherProcessors
A.A.Awan,H.Subramoni,D.Panda,“AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures”3rdWorkshoponMachineLearninginHighPerformanceComputingEnvironments,heldinconjunctionwithSC17,Nov2017.
MLHPC‘17 22NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• WeuseMCDRAMasCache forallthesubsequentresults
• Onaverage,DDR-Allisupto1.5XslowerthanMCDRAM
• MKLengineisupto3XbetterthandefaultCaffeengine
• Biggest gainsforIntel XeonPhi (many-core)architecture
• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)
ImpactofMKLengineandMC-DRAMforIntel-Caffe
020040060080010001200140016001800
Training
Time(m
s)
CPUArchitectures
0100200300400500600700
DDR-All MCDRAM-All MCDRAMasCache
Training
Time(m
s)
MemoryConfigurations
Forward Backward
MLHPC‘17 23NetworkBasedComputingLaboratory High-PerformanceDeepLearning
TheFullLandscapeforAlexNetTraining
050100150200250300350400450500
Time(m
s)
conv1 conv2 conv3 conv4 conv5
• ConvolutionsintheForwardandBackwardPass
• FasterConvolutionsà FasterTraining
• Mostperformancegainsarebasedonconv2 andconv3.
0100200300400500600700800
Time(m
s)
conv1 conv2 conv3 conv4 conv5
MLHPC‘17 24NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Multi-nodeResults:ResNet-50• Allresultsareweakscaling
– Thebatchsizeremainsconstantpersolverbutincreasesoverallby:
– Batch-size*#nodesor
– Batch-size*#gpus
• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability
• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Lessscalability
duetocommunicationoverhead
0
100
200
300
400
500
600
0
50
100
150
200
250
300
350
400
2 4 8 16 20 32
Images/secon
d
Training
Time(secon
ds)
No.ofNodes
Time(seconds) Images/second
ResNet-50Intel-Caffe
1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html
MLHPC‘17 25NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Summary
• DeepLearningisontherise– Rapidadvancesinsoftware,hardware,andavailabilityoflargedatasetsis
drivingit
• SinglenodeorsingleGPUisnotenoughforDeepLearningworkloads
• WeneedtofocusondistributedDeepLearningbuttherearemanychallenges
• MPIoffersagreatabstractionforcommunicationinDLTrainingtasks
• Aco-designofDeepLearningframeworksandcommunicationruntimeswillberequiredtomakeDNNTrainingscalable
MLHPC‘17 26NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/
http://web.cse.ohio-state.edu/~awan.10
TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/
TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/
MLHPC‘17 27NetworkBasedComputingLaboratory High-PerformanceDeepLearning
PleasejoinusforothereventsatSC’17• Workshops
– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware
• Tutorials– InfiniBand,Omni-Path,andHigh-Speed
EthernetforDummies
– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage
• BoFs– MPICHBoF:MVAPICH2Project:Latest
StatusandFuturePlans
Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails
• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning
FrameworksforScalableDistributedTrainingonGPUClusters
– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters
• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans
TowardsExascaleComputing
– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach
– AcceleratingDeepLearningwithMVAPICH
– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning