Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the...

46
NAS Technical Report NAS-2018-01 May 2018 Evaluating the Suitability of Commercial Clouds for NASA’s High Performance Computing Applications: A Trade Study S. Chang 1 , R. Hood 1 , H. Jin 2 , S. Heistand 1 , J. Chang 1 , S. Cheung 1 , J. Djomehri 1 , G. Jost 3 , D. Kokron 1 NASA Advanced Supercomputing Division, NASA Ames Research Center Executive Summary NASA’s High-End Computing Capability (HECC) Project is periodically asked if it could be more cost effective through the use of commercial cloud resources. To answer the question, HECC’s Application Performance and Productivity (APP) team undertook a performance and cost evaluation comparing three domains: two commercial cloud providers, Amazon and Penguin, and HECC’s in-house resources—the Pleiades and Electra systems. In the study, the APP team used a combination of the NAS Parallel Benchmarks (NPB) and six full applications from NASA’s workload on Pleiades and Electra to compare performance of nodes based on three different generations of Intel Xeon processors—Haswell, Broadwell, and Skylake. Because of export control limitations, the most heavily used applications on Pleiades and Electra could not be used in the cloud; therefore, only one of the applications, OpenFOAM, represents work from the Aeronautics Research Mission Directorate and the Human and Exploration Mission Directorate. The other five applications are from the Science Mission Directorate. In addition to gathering performance information, the APP team also calculated costs as of May 2018 for the runs. In the case of work done on Pleiades and Electra, it used the “full cost” of running, based on the Project’s total annual budget (including all hardware, software, power, maintenance, staff, and facility costs, etc.). In the case of the commercial cloud providers, the team calculated only the compute cost of each run, using published rates and including publicly-known discounts as appropriate. Other infrastructure costs of running in the cloud—such as near-term storage, deep archive storage, network bandwidth, software licensing, and staffing costs for security and security monitoring, application porting and support, problem tracking and resolution, program management and support, and sustaining engineering—were not considered in this study. These “full cloud costs” are likely significant. While hardware differences across the three domains make it difficult to compare performance in an apples-to-apples manner, the APP team can make some general observations nonetheless. All runs on HECC resources were faster, and sometimes significantly faster, than runs on the most closely matching Amazon Web Services (AWS) resources. The largest differences are most likely due to Amazon’s lack of a true high- performance processor interconnect, which provides a communication fabric between cores on different compute nodes. For some of the NPBs, Penguin On-Demand (POD) resources yielded faster runs; for others it had slower runs than HECC. Differences in the processor interconnects and communication library are likely the causes of the performance differences. For the 1 An employee of CSRA LLC, a General Dynamics Information Technology company. 2 Point of contact; please direct any correspondence about the work to [email protected]. 3 An employee of Supersmith. https://ntrs.nasa.gov/search.jsp?R=20190002611 2020-07-10T01:05:09+00:00Z

Transcript of Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the...

Page 1: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS Technical Report NAS-2018-01

May 2018

Evaluating the Suitability of Commercial Clouds for NASA’s High Performance Computing Applications: A Trade Study S. Chang1, R. Hood1, H. Jin2, S. Heistand1, J. Chang1, S. Cheung1, J. Djomehri1, G. Jost3, D. Kokron1

NASA Advanced Supercomputing Division, NASA Ames Research Center

Executive Summary NASA’sHigh-EndComputingCapability(HECC)Projectisperiodicallyaskedifitcouldbemorecosteffectivethroughtheuseofcommercialcloudresources.Toanswerthequestion,HECC’sApplicationPerformanceandProductivity(APP)teamundertookaperformanceandcostevaluationcomparingthreedomains:twocommercialcloudproviders,AmazonandPenguin,andHECC’sin-houseresources—thePleiadesandElectrasystems.

Inthestudy,theAPPteamusedacombinationoftheNASParallelBenchmarks(NPB)andsixfullapplicationsfromNASA’sworkloadonPleiadesandElectratocompareperformanceofnodesbasedonthreedifferentgenerationsofIntelXeonprocessors—Haswell,Broadwell,andSkylake.Becauseofexportcontrollimitations,themostheavilyusedapplicationsonPleiadesandElectracouldnotbeusedinthecloud;therefore,onlyoneoftheapplications,OpenFOAM,representsworkfromtheAeronauticsResearchMissionDirectorateandtheHumanandExplorationMissionDirectorate.TheotherfiveapplicationsarefromtheScienceMissionDirectorate.

Inadditiontogatheringperformanceinformation,theAPPteamalsocalculatedcostsasofMay2018fortheruns.InthecaseofworkdoneonPleiadesandElectra,itusedthe“fullcost”ofrunning,basedontheProject’stotalannualbudget(includingallhardware,software,power,maintenance,staff,andfacilitycosts,etc.).Inthecaseofthecommercialcloudproviders,theteamcalculatedonlythecomputecostofeachrun,usingpublishedratesandincludingpublicly-knowndiscountsasappropriate.Otherinfrastructurecostsofrunninginthecloud—suchasnear-termstorage,deeparchivestorage,networkbandwidth,softwarelicensing,andstaffingcostsforsecurityandsecuritymonitoring,applicationportingandsupport,problemtrackingandresolution,programmanagementandsupport,andsustainingengineering—werenotconsideredinthisstudy.These“fullcloudcosts”arelikelysignificant.

Whilehardwaredifferencesacrossthethreedomainsmakeitdifficulttocompareperformanceinanapples-to-applesmanner,theAPPteamcanmakesomegeneralobservationsnonetheless.AllrunsonHECCresourceswerefaster,andsometimessignificantlyfaster,thanrunsonthemostcloselymatchingAmazonWebServices(AWS)resources.ThelargestdifferencesaremostlikelyduetoAmazon’slackofatruehigh-performanceprocessorinterconnect,whichprovidesacommunicationfabricbetweencoresondifferentcomputenodes.

ForsomeoftheNPBs,PenguinOn-Demand(POD)resourcesyieldedfasterruns;forothersithadslowerrunsthanHECC.Differencesintheprocessorinterconnectsandcommunicationlibraryarelikelythecausesoftheperformancedifferences.Forthe

1AnemployeeofCSRALLC,aGeneralDynamicsInformationTechnologycompany.2Pointofcontact;pleasedirectanycorrespondenceabouttheworktohaoqiang.jin@nasa.gov.3AnemployeeofSupersmith.

https://ntrs.nasa.gov/search.jsp?R=20190002611 2020-07-10T01:05:09+00:00Z

Page 2: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 2

applicationbenchmarks,performanceofresourcesatPenguinalwayslaggedbehindsimilarresourcesatHECC.

Inallcases,thefullcostofrunningonHECCresourceswaslessthanthelowest-possiblecompute-onlycostofrunningonAWS.TorunthefullsetoftheNPBs,AWSwas5.8–12timesmoreexpensivethanHECC,dependingontheprocessortypeused.Forthefull-sizedapplications,AWSwasinthebestcase1.9timesmoreexpensive.

TheNPBrunsatPODwere4.7timesmoreexpensivethanequivalentrunsatHECC.Thefull-sizedapplicationswere5.3timesmoreexpensive.

Basedonthisanalysisofcurrentperformanceandcostdata,thisstudyfinds:

Finding 1: Tightly-coupled, multi-node applications from the NASA workload take somewhat more time when run on cloud-based nodes connected with HPC-level interconnects; they take significantly more time when run on cloud-based nodes that use conventional, Ethernet-based interconnects.

Finding 2: The per-hour full cost of HECC resources is cheaper than the (compute-only) spot price of similar resources at AWS and significantly cheaper than the (compute-only) price of similar resources at POD.

Finding 3: Commercial clouds do not offer a viable, cost-effective approach for replacing in-house HPC resources at NASA.

WhilethestudyshowsconclusivelythatitwouldnotbecosteffectivetoruntheentireHECCworkloadonthecommercialcloud,theremaybecaseswherecloudresourceswouldprovetobeacost-effectivesupplementtoHECCresources.Forexample,itmaymakeeconomicsensetousethecloudtoprovideaccesstoresourcesthatwouldbeunderutilizedatHECC,suchasGPU-acceleratednodesusedfordataanalytics,physics-basedmodelingandsimulation,ormachine/deeplearning.ItalsomaymakesensetooffloadsomeoftheHECCworkloadthatwouldrunreasonablywelloncloudresources—namelysingle-nodejobs—inordertoreducewaittimesforremainingHECCjobs.Inthiscase,theeconomicargumentforrunningsmalljobsinthecloudisbasedontheopportunitycostofkeepingthosejobsatHECC.Thisanalysisleadstothefourthfinding:

Finding 4: Commercial clouds provide a variety of resources not available at HECC. Several use cases, such as machine learning, were identified that may be cost effective to run there.

Asaresultofthisstudy,theAPPteamhasidentifiedandstartedworkonthreeactions:

Action 1: Get a better understanding of the potential benefits and costs that might accrue from running a portion of the HECC workload in the cloud. This has two parts: • Determine the scheduling impact to large jobs of running 1-node jobs in house. • Understand the performance characteristics of jobs that might be run on the cloud.

Action 2: Define a comprehensive model that allows accurate comparisons of cost between HECC in-house jobs and jobs running in the cloud.

Action 3: Prepare for a broadening of services offered by HECC to include a portion of its workload running on commercial cloud resources. This has two parts: • Conduct a pilot study, providing some HECC users with cloud access. • Develop an approach where HECC can act as a broker for HPC use of cloud

resources, including user support and an integrated accounting model.

Page 3: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 3

1. Introduction

Background Afteryearsofdevelopment,technologiesforcloudcomputinghavebecomemature,andcloudsarebeingusedinavarietyofproblemdomains,includingphysics-basedsimulations.TherearecurrentlymanycommercialcloudprovidersthathavethepotentialtohostHighPerformanceComputing(HPC)workloads,includingAmazonWebServices(AWS),PenguinOn-Demand(POD),MicrosoftAzure,andGoogleCloud.Inaddition,severalothervendorssuchasRescalehavebegunpackagingandresellingcloudresources.Tomaketheirofferingsattractive,theresellersusuallyprovideauserinterfaceforsubmittingjobsthatsimplifiesthechoiceofdestinationforajob,includingthepossibilityofusingin-housecomputingresources.

ThecloudprovidersandresellershavebeenaggressiveaboutmarketingtheirproductstoHPCconsumers,usuallywithanalysesofhowmuchmoneycanbesavedbymovingHPCworkloadstothecloud.TheymakeacompellingcaseforcostefficiencyinsituationswhereHPCdemandishighlyvariable.InthecaseofanHPCfacilitywithauniformlyhighutilizationrate,however,theirargumentislessclearandisworthyofinvestigation.

TheDepartmentofEnergyundertooksuchaninvestigationin2011withtheirMagellanProject[1].Theircomprehensivestudyfoundthatwhilecloudshavecertainfeaturesthatareattractiveforusecasesrequiringon-demandaccesstocomputingresources,theperformanceofcloudresourceswasseverelylackingfortypicalHPCworkloadswithmoderatetohighlevelsofcommunicationorI/O.Theyconclude:“Ourdetailedperformanceanalysis,usecasestudies,andcostanalysisshowsthatDOEHPCcentersaresignificantlymorecost-effectivethanpubliccloudsformanyscientificworkloads,butthisanalysisshouldbereevaluatedforotherusecasesandworkloads.”

TheHigh-EndComputingCapability(HECC)project’sApplicationPerformanceandProductivity(APP)teamhasperiodicallyevaluatedtheclaimsofcloudproviderstobeaneffectivesubstituteforHPCforNASA.In2011,APPcomparedtheperformanceofsimilarcomputationalresourcesfromtheNebulacloudatNASAAmes,AWS,andHECC’sPleiadessystem.Thestudy[2,3,4]usedavarietyofbenchmarks,includingfullapplicationsfromthePleiadesworkload,andestablishedthattheperformanceofNebulaforHPCapplicationswasworsethanthatoftheHPCinstancesofAmazonEC2,whichinturnwasworsethanthatofPleiades,particularlyathighercorecounts.

AmajorcontributortothepoorperformanceofNebulawasvirtualizationofhardwareresources.Inaddition,performanceonbothcloudplatformssufferedfromrudimentaryprocessorinterconnectsthat,whilecosteffectiveforconventionalITapplications,arenotsufficientforNASA’sHPCworkload.90%ormoreofthatworkloadisfromphysics-basedModelingandSimulation(M&S)applicationsthatrunonmorethanonenode.Byspreadingouttheworktocoresonmultiplenodes—potentiallyathousandormore—thecomputationallyintensiveapplicationscanbespeduptoruninareasonableamountoftime.Suchmulti-nodeapplicationstypicallyneedtoexchangedatabetweencoresrunningondifferentnodesanduseaMessagePassingInterface(MPI)libraryforthatpurpose[5].MostoftheHPCapplicationsrunningatNASAexhibit“tightly-coupled”communicationpatterns,meaningthatthereisasignificantamountofcommunicationinterleavedwiththecomputationalcomponents.Intightly-coupledexecutions,therudimentaryinterconnectsprovidedbymanycommercialcloudproviderscannotkeepup,andapplicationperformanceisadverselyimpacted.

Page 4: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 4

Since2011,theAPPteamhasbeenmonitoringperformanceofAWSresources.In2013,PODresourceswereincludedinthetests.From2014,internalstudiesofcloudresourceslookedatnewfeaturesvendorswereprovidingtoseehowperformancecomparedtoequivalentHECChardware.EvaluationsincludedaspectssuchasI/Ospeedsandbatchschedulerfeaturessets.Thevarioustestsdoneinthistimeframeconfirmedtheoriginalfindingthatcloud-basedresourcescouldnotmatchtheperformanceofPleiadesonthebenchmarksused.

Inthisnewstudy,theAPPteamisevaluatingthepropositionthatmoderncloud-basedresourcescanbecost-effectiveincomparisontoin-houseHPCresources.Thisstudydiffersfrompreviousonesinthatitintroducescostintotheinvestigationanditconsiderscircumstancesunderwhichthemostcost-effectivestrategyfordeliveringHPCcyclesmightincludetheuseofcloud-basedresources.

Goals of the Study Themainquestiontobeaddressedinthisstudyiswhethercommercialcloudresourcesshouldbeusedaspartofacost-effectiveimplementationofHECC.Specifically,itsfirstgoalistodeterminewhethercommercialcloudresourceswouldmakeaviable,cost-effectivesubstituteforrunningallofthecurrentHECCworkload.Intheeventthatisnotthecase,therearetwoadditionalgoalsofthestudy.Oneistodetermineunderwhatconditionscommercialcloudresourcesmightmakeaviable,cost-effectivesupplementforrunningsomeofthecurrentHECCworkload.Theotheristoprovideguidanceforafollow-upstudytoidentifywhichelementsoftheHECCworkloadcouldbeoffloadedtoacommercialcloudandwhatstepswouldbenecessarytoaddcloud-basedresourcestoHECC.

2. Approach

Evaluation Workload ThecommonbenchmarksusedbyHECC’sAPPteaminrecentyearsforarchitectureevaluationinclude:

• theNASParallelBenchmarks[6],asetofcodesthatareportableandusefulforshowingperformanceofdifferenttypesoffullapplications,

• theNASTechnologyRefresh(NTR)procurementsuite,asetoffull-sizedapplicationsdatingto2007,and

• HECC’sStandardBillingUnit(SBU4)suites—boththeoriginalonefrom2011[7]andamorerecentversionfrom2017—whichhavebeenusedtoestablishthebillingratesforHECCresources.

AllofthecodesinthesesuitesuseMPIforinter-processcommunication.ThissuiteofcodesissizedwellbelowtheaveragejobworkloadsizeonHECCsystems,whichiscurrentlybetween2048and4096ranks,eachofwhichwouldrunonaseparatecore.

Sincethecodeswouldberunningonpublicclouds,theComputationalFluidDynamics(CFD)codesOVERFLOW,FUN3D,andUSM3D,whichareintheNTRandSBUsuites,could

4TheHECCprojectandtheNASACenterforClimateSimulation(NCCS)usetheSBUforallocatingandtrackingcomputerresourceusageacrossdissimilararchitectures.RepresentativecodesarerunoneacharchitectureandtheirruntimescomparedtothebaselinevaluescalculatedonaPleiadesWestmerenode,whoseSBUrateisdefinedtobe1.0.Forexample,anElectraSkylakenodehasanSBUrateof6.36,asitwasmeasuredtobe6.36timesmoreeffectiveatprocessingthebenchmarksuitethanaPleiadesWestmerenode.

Page 5: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 5

notbeusedbecausetheyareexportcontrolled(ITAR/EAR99).TheopensourceapplicationOpenFOAMwasusedtorepresenttheCFDmembersinstead.Thevariousbenchmarksusedinthestudyaredescribedintheremainderofthissection.

NAS Parallel Benchmarks (NPBs) TheNPBsareasetofbenchmarksdesignedtohelpevaluatetheperformanceofparallelsupercomputers.Amongthebenchmarks,eightofthem,includingBT,CG,EP,FT,LU,MG,IS,andSP,mimicthecomputationanddatamovementinCFDapplications.ClassCandClassDcomprisethethirdandsecondlargestproblemsizesforeachbenchmark.Twoofthebenchmarks,BTandSP,requirethenumberofMPIrankstobeaperfectsquare;theothersixrequireapoweroftwo.Forexample,ClassCrunsofBTandSPareperformedusing16,25,36,64,121,256,484,and1024MPIranks.Fortheothersixbenchmarks,ClassCrunsareperformedusing16,32,64,128,256,512,and1024ranks.

ATHENA++ ATHENA++isanastrophysicalmagneto-hydrodynamics(MHD)codeinC++.ItisfairlyheavilyusedonPleiadesandisacandidateforfutureSBUsuites.Runswith512,1024and2048MPIrankswereattemptedatAWSandHECC;notalltheAWSrunsweresuccessful.

ECCO (MITgcm) OneofthelargestusersofSBUsintheScienceMissionDirectorateonPleiadesistheEstimatingtheCirculation&ClimateoftheOcean(ECCO)project.TheprimarycodeusedintheprojectisbasedontheMITgeneralcirculationmodel(MITgcm),anumericalmodeldesignedforstudyoftheatmosphere,ocean,andclimate.ThetestcaseusedforthisstudyisonethatwasincludedintheNTR1(NASTechnologyRefresh)benchmarks.Runswith120and240MPIrankswereperformedatAWSandHECC.Intherestofthisstudy,“ECCO”isthelabelusedfortheMITgcmresultsandanalysis.

Enzo EnzoisoneofthebenchmarksinSBU2Suite.Itisacommunity-developed,adaptivemeshrefinementsimulationcode,designedforrich,multi-physicshydrodynamicastrophysicalcalculations.Runswith196MPIrankswereperformedusingtheSBU2dataset.

FVCore FVCoreisoneofthebenchmarksinSBU1Suite.ItisobtainedbyextractingthedynamiccoreofthefvGCMalgorithminGEOS-5,usedforEarthscienceresearch.Theterm“fvGCM”hasbeenhistoricallyusedtorefertothemodelwhichwasdevelopedinthe1990satNASA’sGoddardSpaceFlightCenter.ThegridisaCubed-Sphere,sothereisnosingularityatthepoles,anditcanbepartitionedinwaystotestthecommunicationperformancebetweenregionsaswellasthecomputationalperformanceonthecore.Theverticaldirectionwassettobe26layers,andforeachlayerthehorizontalgridisdividedinto6tiles(6facesinacube).Runswith1176MPIrankswereperformedatAWSandHECCusingtheSBU1dataset.

WRF / NU-WRF TheWeatherResearchandForecasting(WRF)applicationisanobservation-drivenregionalEarthsystemmodelingandassimilationsystematsatellite-resolvablescale.RunswiththeWRFSBU1benchmarkused384MPIranks.

NASA-UnifiedWeatherResearchandForecasting(NU-WRF)isoneofthebenchmarksintheSBU2Suite.Runswith1700MPIrankswiththeSBU2datasetwereattempted;therewasaproblemrunningonAWSSkylakenodes,however.

Page 6: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 6

OpenFOAM OpenFOAMisanopensourceComputationalFluidDynamics(CFD)application.TheChannel395testcasefromtheOpenFOAMtutorialwasusedforthisstudy.Runswith48,144,and288MPIranksweremadeonHaswell-basednodesatHECCandAWS.

Evaluation Systems TheHECCproductionsystemscurrentlyhave6differentgenerationsofIntelXeonsfromWestmerethroughSkylake.ThisstudycomparesHECCresourcesbasedonthethreemostrecentprocessortypes—Haswell,Broadwell,andSkylake—tosimilarsystemsavailablefromcloudproviders.

In-house solution: HECC Pleiades and Electra systems Processor Types and Interconnect: ThreetypesofIntelXeonprocessorsweretested:

1. Haswell–[email protected],andInfiniBand4X-FDR56-GbpsInterconnect

2. Broadwell–[email protected],andInfiniBand4X-FDR56-GbpsInterconnect

3. Skylake–[email protected],andInfiniBand4X-EDR100-Gbpsinterconnect

Storage Resources: HECCprovides:

(i) 680GBofspacefora/nasafilesystemthathoststhepublicsoftwaremodules,(ii) six/homefilesystemseachabout2TB,and(iii) sixLustreshared/nobackupfilesystemswithsizesbetween1.7PBand20PB.

Thetestingforthisstudywasperformedmainlywith/home7and/nobackupp8.

Operating System: SUSELinuxEnterpriseServer12operatingsystem,providedandsupportedbyHPE,wasusedonthesesystems.

MPI Library: MPIapplicationsrunningonHECCsystemsarerequiredtousetheHPEMPTforitsprovenscalingperformanceandpreventionofnetworkinstability.

Job Scheduling: TestjobsaresubmittedfromoneofthesevenPleiadesfront-endsystems(pfe[21-27])andscheduledtorunbythePBSbatchjobscheduler.

Commercial cloud offering: AWS public cloud available through NASA CSSO/EMCC5 ThepubliccloudresourcesavailableforthisstudywerefromtheAWSUSWest(Oregon)region.Thefollowinginstancetypeswereusedinthisstudy.

Processor Types and Interconnect: ThreetypesofinstancesbasedonIntelXeonprocessorsweretested:

5AllNASAacquisitionofcloud-basedresourcesisrequiredtogothroughtheagencyOCIO’sCloudServicesServiceOffice(CSSO),whichhasdevelopedtheEnterpriseManagedCloudComputing(EMCC)framework.

Page 7: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 7

1. c4.8xlarge–Haswellprocessors,[email protected],60GiBofmemory,nolocaltemporarystoragepernode,and10-Gbpsinterconnect

2. m4.16xlarge–Broadwellprocessors,[email protected],256GiBofmemory,nolocaltemporarystoragepernode,and25-Gbpsinterconnect

3. c5.18xlarge–Skylakeprocessors,[email protected],144GiBofmemory,nolocaltemporarystoragepernode,and25-Gbpsinterconnect

Storage Resources: A50-GBEBSvolumewasconfiguredfor/nasaand/homefilesystemsandfive250-GBEBSvolumeswerebundledtogetherandexportedasanNFS/nobackupfortesting.

Operating System: AmazonLinuxAMI,whichhasallAWSspecifictunings,changesanddriversneededtorunwell,waschosenforthisstudy.TheAWScostissmallerusingthisOScomparedto,forexample,CentOSorSLESOS.

MPI Library: ThereisnoHPEMPTavailableontheAWScloud.Intel-MPIwasusedforallMPIapplications.Auser-suppliedlicenseisrequiredtobuildIntel-MPIapplicationsonAWS.

Job Scheduling: Withoutajobschedulerinthetestenvironment,ascriptcalledspot_managerwascreatedandusedonafront-endinstance(hostnamenasfe01)forrequestingandmanagingcomputeinstancesusedforeachrun.

Commercial cloud offering: POD ThePODresourcesavailableforthisstudyarefroma“ProofofConcept”arrangementwheretheusagewastrackedbutnoactualchargesmade.Thefollowingresourceswereusedinthisstudy.

Processor Types and Interconnect: ThreetypesofIntelXeonprocessorsweretested:

1. HaswellProcessorsinMT1location-IntelXeonE5-2660v3CPU@2.6GHzwith20coresand128GiBofmemorypernode,andInfiniBand4X-QDR40-GbpsInterconnect

2. BroadwellProcessorsinMT2location-IntelXeonE5-2680v4CPU@2.4GHzwith28coresand256GiBofmemorypernode,andIntelOmni-Path100-GbpsInterconnect

3. SkylakeProcessorsinMT2location(earlyaccesspriortopublicrelease)[email protected],andIntelOmni-Path100-Gbpsinterconnect

Storage Resources: BothMT1(NASfilesystem)andMT2(Lustrefilesystem)haveabout500TBoftotalstoragespacevisibletopublic.1TBwasallocatedforthistesting.

Operating System: PODconfiguresallMT1resourceswithCentOSLinuxversion6andMT2resourceswithCentOSLinuxversion7.

Page 8: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 8

MPI Library: ThereisnoHPEMPTavailableonPOD.Intel-MPIisavailableviatheirpre-installedIntelcompilermodules.However,auser-suppliedlicenseisrequiredtousetheIntelcompilerandMPIlibrary.TherearealsomultiplemodulesofOpenMPI.

Job Scheduling: APBS-likebatchjobsystemisavailable.Thecommandsqsub,qstat,andqdelcanbeusedtomanagejobsbyeachuser.

Cost Basis Foreachapplicationintheworkload,thecostassociatedwitheachrunwascalculatedasfollows:

HECC Cost Calculation: AjobrunonHECCsystemswaschargedinSBUs[8]whicharecalculatedfrom

(i) aconversionfactorforeachprocessortype(Haswell3.34;Broadwell4.04;Skylake6.36),

(ii) thenumberofdedicatednodesused,and(iii) thedurationofthejobinhours.

Thecurrentcostof1SBUis$0.16[8],whichiscalculatedasthetotalannualHECCcostsdividedbythetotalnumberofSBUspromisedtotheMissionDirectoratesinFY18.NotethatthetotalHECCcostsutilizedinthiscalculationincludes:

• thetotalannualHECCbudgetwhichcoversthecostsofoperatingtheHECCfacilityincludinghardwareandsoftwarecosts,maintenance,supportstaff,facilitymaintenance,andelectricalpowercosts,and

• a$1Millionfacilitydepreciationcost(basedon30-yearamortizationoftheinitialcapitalcostof$30Millionforthebuildinghostingtheresource).

Inthecurrentmodularizedapproachtoexpandthefacilitybeyondthecurrentbuilding,thecostsofthecontainerstohosttheHPChardwareisalsobeingpaidbytheproject.Thus,thetotalannualbudgetbeingusedintheabovecostperSBUcalculationalsoincludesthecostofinfrastructurefacilitydevelopmentandexpansion.

AWS Cost Calculation: Becauseofthedifficultyofassessingfront-end,storageandbandwidth,technicalsupport,andCSSO/EMCCoverheadcostsdiscussedinAppendixI,afull-blowncostforusingtheAWSresourcestoruneachjobisnotavailableforthisstudy.Onlythecostofusingthecomputeinstancesispresented.DependingontheAWSregionandthepurchasetype,availabilityandcostofaninstancetypecanvarysignificantly.Forexample,spotinstancesarenotavailableforGovCloud.Forregionswheretheyareavailable,spotpricefluctuates;itcangoabovetheon-demandpriceorgolowerthan30%oftheon-demandprice.Thus,fourrepresentative(withAmazonLinuxOS)priceswillbecalculated:

• on-demandpriceoftheUS-West(Oregon),• asamplespotpricebasedon30%oftheUS-West(Oregon)on-demandprice(see

AppendixI),• on-demandpriceoftheGovCloud,and• asamplepre-leasingpriceestimatedas70%oftheGovCloudon-demandprice(see

AppendixI).

Page 9: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 9

Table1summarizesthevarioushour-ratesoftheinstancetypesusedinthisstudy.Notethatc5instancesarenotyetofferedfortheGovCloud,sonopricinginformationisavailable.

POD Cost Calculation: SimilartoAWS,itisdifficulttoassessthePODloginnodeandstoragecostsonaper-jobbasis.Thus,afull-blowncostforusingthePODresourcestoruneachjobisnotavailableforthisstudy.Onlythecostofusingthecomputeresourcesispresented.FortheHaswell,Broadwell,andSkylake,thepublishedratesasofMay2018of$0.09,$0.10,and$0.11percore-hour,respectively,willbeusedinthisreport.Notethatgovernmentandvolumediscountsmaybeavailable.

AlthoughPODadvertisesper-corecharging,itturnsoutthatrequestingfewercoresthanthemaximumnumberofcorespernodeisnotallowedformulti-nodejobs.Essentially,thecostofthewholenodewillbecharged.Thatis,$1.80per20-coreHaswellnode,$2.80per28-coreBroadwellnode,and$4.40per40-coreSkylakenode.

3. Results

Performance and Cost Oneofthegoalsofthestudywastodetermineifcloudscanprovideacost-effectivesubstituteforrunningtheHECCworkload.Tothisend,asuiteofnon-ITARworkloadsincludingtheNPBs,Athena++,ECCO,Enzo,FVCore,WRF/nuWRF,andOpenFOAM,wasusedtoassesstheperformanceandcosteffectivenessonAWSandPODversusHECCsystems.

TheNPBClassCbenchmarkruns,withcore-countrangingfrom16to1024cores,wereperformedtocheckscalingperformance.IncomparisonwithHECCsystems,theresultsshowthatthecomputationperformancescaleswellforrunsonAWS,whilecommunication

HECC AWS (Costs are Compute-Only†) POD

Model/Cores Full Cost

Instance Name

On-Demand Spot Price

GovCloud On-

Demand

GovCloud Pre-

Leasing

Compute Only

Haswell/18 m4.16xlarge $1.591 $0.477 $1.915 $1.341

Haswell/20 $1.800

Haswell/24 $0.534 $0.636* $2.160*

Broadwell/28 $0.646 $0.840* $2.800

Broadwell/32 c4.8xlarge $3.200 $0.960 $4.032 $2.822

Skylake/36 c5.18xlarge $3.060 $0.918** N/A N/A

Skylake/40 $1.018 $1.020* ** $4.400 *AWS spot prices in blue are approxima]ons for comparison purposes and are pro-rated based on rela]ve core counts versus

HECC. The POD 24-core Haswell price has been similarly converted for comparison to HECC. **This es]mated AWS spot price for Skylake is unlikely to be available—see Appendix I.

†Full cost informa]on unavailable, but expected to be significantly higher than compute-only cost.

Table 1: Hourly costs of compute nodes used in study.

Page 10: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 10

generallydoesnot,especiallyacrossmultipleinstances.Forexample,thefollowinggraphsshowthescalingbehaviorfortheBT.Cbenchmark:

Thepoorcommunicationscalingindicatesthatthe25-Gbpsinterconnect(maximumofcurrentAWSinstances)hindersefficientcommunicationbetweenMPIprocessesacrossmultipleAWSinstances.DetailsoftheClassCNPBrunscanbefoundinAppendixIII(Figure2andTable6).ThescalingresultsfortheClassDNPBscanalsobefoundthere(Tables7and8).

Table2comparesthecostofrunningtheClassDbenchmarksonIntelBroadwell-basednodesatHECC,AWS,andPOD.Foreachbenchmarkthescalingrunsaresummarizedonasingleline,showingthetotaltimeandthetotalnumberofnodesrequiredbyallrunsandthetotalcostforthenodehoursusedbytheruns.(SeeTable7inAppendixIIIforcostdetailsforeachscalingrun.)InthecaseofHECC,a“fullcost”isused,aswasdetailedintheCostBasisdiscussedabove.

Table2showsanestimateforthecomputechargesiftheAWS30%spotpricingisavailable($16.42).Italsoshowsanestimate,$48.26,forrunninginapre-leasedblockofnodesonthegovernmentresourcesatAWS(i.e.70%ofthecostofAWSGovon-demandinstances).Inthebestcasewhenspotpricingisused,AWSis5.8timesmoreexpensive.Atthepublishedprice,PODis4.7timesmoreexpensivethanHECC.

Table3showssimilarcomparativedataforSkylake-basednodesatHECCandAWS.NotethatthefullcostoftheNPBClassDrunsonHECCSkylakesismorethan12timescheaperthanthecompute-onlycostonAWS.

Table 2: Selected NPB class D performance and cost using Broadwell processors on

HECC (28 cores/node), POD (28 cores/node) and AWS (32 cores/instance).

Benchmark

# of HECC or POD

Broadwell nodes

# of AWS m4.16xlarge

instancesTotal HECC time (sec)

HECC full cost

Total AWS time (sec)

AWS Oregon

compute cost

AWS Gov compute

costTotal POD time (sec)

POD compute

cost bt.D 47 40 135.37 $0.40 327.3 $5.55 $6.99 168.30 $2.37cg.D 140 120 192.01 $0.99 629.75 $15.12 $19.05 150.68 $3.41ep.D 140 120 10.62 $0.04 17.11 $0.30 $0.38 13.84 $0.25ft.D 29 24 95 $0.20 552.59 $5.32 $6.70 58.50 $0.59is.D 29 24 10.89 $0.02 58.09 $0.56 $0.71 7.09 $0.08lu.D 140 120 147.3 $0.62 633.12 $17.07 $21.51 185.14 $3.64mg.D 140 120 17.59 $0.08 52.07 $1.03 $1.30 19.61 $0.38sp.D 47 40 152.11 $0.46 555.84 $9.77 $12.31 171.54 $2.46Total Cost $2.81 $54.72 $68.94 $13.18Estimated AWS spot cost (30% of on-demand cost) $16.42Estimated AWS pre-leasing cost (70% of US-Gov cost) $48.26

Page 11: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 11

Table4showssimilardatafortheapplicationbenchmarksdescribedinSection2,comparingthefullcostofrunningonHaswell-basednodesatHECCtothecompute-onlycostonsimilarnodesatAWS.LikeTables2and3,itaggregatesmultiplescalingrunsintoasinglelineforeachbenchmark;thefulldatacanbefoundinAppendixIII(Table9).Table5showsthedatafortwoapplicationsrunonHECCandPOD,usingHaswell-,Broadwell-,andSkylake-basednodesatbothsites.Inthiscase,multiplerunsarenotbeingaggregatedintoasingleline.

Inthebestcase,wherespotpricingisavailabletoruntheentireworkload,thecompute-onlyAWScostisstill1.9timesmorethanthefullcostofrunningonHECCin-houseresources.Withapre-leasingoptiononGovCloud,theAWScomputecostisabout5.2timesthatofHECCfullcost.ThePODcomputecostisabout5.3timesthatoftheHECCfullcost.

Analysis Thecompute-onlycostofusingtheAWSUS-West(Oregon)spotinstancestoruntheworkloadsisstillafewtimesmoreexpensivethanthefull-blownHECCcost.Forexample,asseeninTable2,forNPBclassDontheBroadwellnodes,usingspotpricingcosts$16.42(compute-only)onAWSand$2.81(full-blown)onHECC,aratioof5.8x.FortheNPBclassDontheSkylakenodes,asseeninTable3,itcosts$18.29(compute-only)onAWS,whichis~12xoftheHECCfull-blowncostof$1.50.Forthe6full-sizedapplicationsusingHaswellnodes,asdepictedinTable4,thecompute-onlycostonAWS,$141.77,is1.9xoftheHECCfull-blowncostof$76.18.

Table 4: Selected MPI applicafons performance and cost using Haswell processors on

HECC (24 cores/node) and AWS (18 cores/instance).

Benchmark Case

# of HECC Haswell

nodes

# of AWS c4.8xlarge instances

HECC time (sec)

Total HECC full cost

AWS time (sec)

Total AWS Oregon

compute cost

Total AWS Gov compute

costATHENA++ SBU2 129 171 3445 $29.51 3672 $127.11 $153.00ECCO NTR1 15 21 185 $0.19 313 $1.41 $1.69Enzo SBU2 9 11 1827 $2.44 2266 $11.02 $13.26FVCore SBU1 49 66 1061 $7.72 1104 $32.20 $38.76nuWRF SBU2 71 95 529 $5.58 1302 $54.66 $65.80OpenFOAM Channel395 20 27 27500 $30.74 51430 $246.31 $296.46Total Cost $76.18 $472.71 $568.96Estimated AWS spot cost (30% of on-demand cost) $141.77Estimated AWS pre-leasing cost (70% of US-gov cost) $398.27

Table 3: Selected NPB class D performance and cost using Skylake processors on

HECC (40 cores/node), and AWS (36 cores/instance).

Benchmark

# of HECC Skylake

nodes

# of AWS c5.18xlarge

instancesTotal HECC time (sec)

HECC full cost

Total AWS time (sec)

AWS Oregon

compute cost

bt.D 33 37 100.35 $0.31 238.81 $3.27cg.D 46 52 65.05 $0.23 1984 $30.20ep.D 46 52 8.5 $0.03 8.58 $0.09ft.D 46 52 50.3 $0.17 1118.4 $14.02is.D 46 52 4.93 $0.02 164.5 $2.20lu.D 46 52 107.12 $0.36 327.69 $4.64mg.D 46 52 13.36 $0.04 71.99 $0.88sp.D 33 37 121.84 $0.35 383.5 $5.68Total Cost $1.50 $60.98Estimated AWS spot cost (30% of on-demand cost) $18.29

Page 12: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 12

TheratiosareevenhigherwhenusingtheAWSUS-West(Oregon)on-demand,theUS-Govon-demand,orpre-leasingUS-Govinstances(seeAppendixIformoredetails).Inaddition,AWSdoesnotofferspotinstancesintheirgovernmentcloud.Payingforon-demandinstancesontheirpublicorgovernmentcloudorpre-leasingoptionwouldnotbecosteffective.

Theadditionalcostofutilizingfront-ends,filesystems,datatransfer,softwarelicenses,andsupportonAWSishardertoquantifyonaper-jobbasis.Onecanhowever,examinethetotalchargeindifferentcategories.Forexample,thetotalAWSchargeforthemonthofDecember2017inconductingthebenchmarkevaluationwas$1944,ofwhich$1068wasspentonusingvariouscomputeinstances,$136ondiskspaceusage,$738onhavingthefront-endavailable(ata$4.256perhourrate),andabout$2ondatatransfer.Therefore,therewasapproximatelyan82%($1944/$1068=1.82)extrabeyondthecomputeinstancescost.Similarly,forJanuary2018,therewasa97%($1654/$840)extracostontopofthecomputeinstancescost.Sincetheworkloadsusedinthisevaluationneitheruseverymuchstoragenorperformlargefiletransfers,itisanticipatedthatinaproductionenvironment,thisoverheadduetostorageandfiletransferwillprobablyincrease.Ontheotherhand,theoverheadduetotheuseofthefront-endwilllikelydecreaseiftherearemanyuserssharingthefront-endsystem.Also,sincetheAWSresourcesusedforthisevaluationarethroughCSSO/EMCC,therewillbeanadditionalCSSO/EMCCoverhead(whichincludesfeespaidtoAWSforsupportservicesandotheroperationalcostsofCSSO/EMCC),whichwasnotyetreportedandthusnotincludedinthiscostassessment.

ForPOD,thereisafront-endcostandstoragecostinadditiontothecomputecost.Storageis$0.10perGB-month,butgovernmentdiscountswilllikelyapply.Thereisnocostfordatatransferorgettingtechnicalsupport.PODalsooffersadditionalvolumediscounts,butthespecificswouldnotbeknownuntilHECCnegotiatesacloudservicescontractwithPOD.

EventhoughthePODMT2resourcesmayprovidecompetitiveperformance,itisnotcosteffectivecomparedtorunningontheHECCresources.Forexample,asseeninTable2,thetotalcostofrunningthevariousNPBclassDbenchmarksonPODBroadwellnodes($13.19,compute-only)is4.7timesthatoftheHECCBroadwellfull-blowncost($2.81).Thetwofull-sizedapplicationbenchmarks,EnzoandWRF,whenrunonthreenodetypeswere~5.3timeshigheronPODthanthefull-blownHECCcosts(seeTable5).NotethatthestoragecostincurredonPODwasminimalduringthisevaluationperiod.GiventhatthecostofPODloginnodeandfiletransferwilllikelybefreeandtherewillbevolumeandgovernmentdiscountswithstorage,itisexpectedthattheextracostontopofcomputeinaproductionenvironmentwillbesmallerwithPODthanwithAWS.

Table 5: Enzo and WRF performance and cost using Haswell, Broadwell, and Skylake processors on HECC

and POD. HECC: Haswell – 24 cores/node, Broadwell – 28 cores/node, Skylake – 40 cores/node; POD: Haswell – 20 cores/node, Broadwell – 28 cores/node, Skylake – 40 cores/node.

APP Case NCPUS

# of HECC

nodes

# of POD

nodes

HECC time

(sec)

Total HECC

full cost

POD time

(sec)

Total POD

compute

cost

ENZO (HAS) SBU2 196 9 10 1827 $2.44 2355 $11.78

ENZO (BRO) SBU2 196 7 7 1625 $2.04 1870 $10.18

ENZO (SKY) SBU2 196 5 5 1519 $2.15 1751 $10.70

WRF (HAS) SBU1 384 16 20 1243 $2.95 1802 $18.02

WRF (BRO) SBU1 384 14 14 1225 $3.08 1436 $15.64

WRF (SKY) SBU1 384 10 10 1069 $3.02 1352 $16.52

Total Cost $15.68 $82.84

Page 13: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 13

Basedontheresultspresentedabove,theHECCsystemsarenotonlymuchbetterintheirperformancebutalsomorecosteffectivethanexistingAWSandPODofferings.NeitherAWSnorPODwouldbeaviablesubstitutetoHECCforrunningthetraditionalHPCMPI(i.e.multi-node)applications.

Usability and Potential Technical Issues IfHECCisgoingtooffercloud-basedresources,thenusabilityisamajorconcern.Thestudyexaminesmultipleareasinthatregard.ThecompletefindingscanbefoundinAppendixIV.Thekeyfindingsaresummarizedhere:

• PortingofNASA’straditionalHPCMPIapplicationstothecloudisnottrivialandtheprocesscanencounterfrequentfailuresdueto:(i) systemandruntimeconfigurationdifferences,especiallywhenan

applicationrequiresalargenumberoflibraries(e.g.,forthecasesofGEOS-5fromtheSBU2benchmarksuiteandWRF),

(ii) failurestorunforsomecasesofanapplicationforunknownissues(e.g.,ATHENA++),and

(iii) unexpectedpoorperformancewhosecauseneedsfurtherinvestigation(e.g.,NPBCGandFT,WRF,OpenFOAM).

(SeeAppendixIVfordetails.)Thetime,effort,andcostforNASAscientists,HECCstaff,orcommercialcloudsupportstafftoinvestigateandresolvesuchissuescanbesignificantwhenmigratingtothecloud.Inaddition,theuseoflicensedsoftwarelibrarieswillincreasethecostofrunninginthecloudinadditiontorunningatHECC.

• AWSoffersmanyoperatingsystemoptions.SinceSLES12isthecurrentOSusedonHECCsystems,itisthefirstchoicefortryingonAWSsinceitislikelytointroducefewerportingissues.However,onAWS,SLESOSismoreexpensivethanthebasicAmazonLinuxOSusedinthisevaluation.PODoffersonlyCentOS.

• AWSoffersnopre-installedsoftwarestack.PODhaspre-installedmorethan250softwaremodulesforHPCuse.Forcommercialsoftware,usershavetorelyonusinglicensesprovidedbyHECCortheymustbringtheirowntoeitherAWSorPOD.

• AWSdoesnotprovideajobschedulingsystem.TheevaluationonAWSusedascript-basedmethodcreatedbyHECCstafftorequest,check,andstopinstances.PODusestheMoab/Torquemanagementsystem,whichisverysimilartothePBSbatchschedulerusedonHECCsystems.

• Giventheadditionalworkneededforporting,itmaybeworthwhiletouse“container”technology,suchasDocker,topackageajob’simagesothatthejobcanrunonbothHECCandthecloud.GettingthistechnologytoworkproperlyandsatisfyNASAsecurityrequirementsstillrequiressignificantdevelopmentandtesting.

• Enablingahybridenvironment,wherejobburstingcanoccurfromHECCtothecloud,willrequireaworkingjobschedulingsystem.AltairhasPBS18inbetatestwithMicrosoftAzure.AdaptiveComputingadvertisesavailabilityoftheirMoab/NODUScloudburstingsolutiononAWS,Googlecloud,AliCloud,DigitalOceanandothers.HECChasyettotestthefeasibilityofthesetechnologies.

• AuthorizationtoOperate(ATO)willbeneededtooperateoncloudresourcessuchasthoseatAWSorPOD.ThebestoptionislikelytoextendtheNASmoderatesecurityplan,underwhichHECCoperates,toincludetheuseofresourcesfrom

Page 14: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 14

specificcommercialcloudvendors.Atthatpoint,export-controlled(ITAR/EAR99)datawouldbepermitted.

Other Considerations Anotherconsiderationforusingcloudsisthattheymaybeabletoprovideaccesstostate-of-the-artresources,suchasGPUs,thatarenotreadilyavailableatHECC.Tothisend,thestudyfound:

• PODlaggedbehindAWSinofferingtheIntelSkylakeprocessors.AWSreleasedSkylakeonNovember7,2017foruseontheirpubliccloud.PODreleasedSkylakeforgeneraluseonMarch12,2018.Note,however,thattheAWSSkylakeprocessorsareinhighdemandanditcantakealongtimetogetspotinstancesifonedoesnotwanttopaytheon-demandprice.AlsonotethatnoSkylakeprocessorsareyetavailable(asofMay2018)throughAWSgovernmentcloud.

• AWSlagsbehindPODinofferinghigher-performanceinterconnect.ThePODMT2sitealreadyoffers100-GbpsOmni-PathinterconnectwhileAWScurrentlyoffersupto25-Gbpsinterconnect.

• PODlagsbehindAWSinofferingnewGPUprocessors.AWShasNvidiaM60,K80,andV100whilePODhasonlyNvidiaK40.Currently,HECCalsohasNvidiaK40.

• PODoffersIntelKNLbutAWSdoesnot.HECCalsohasIntelKNL.

4. Findings for NASA’s Current HPC Workload Notethatthefindingspresentedinthissectionreflecttheperformanceandcostatthetimeofthestudy(May2018).Changesincloudofferings,especiallywithregardtopricing,maynecessitateareexaminationinthefuture.

Performance AllrunsonHECCresourceswerefaster,andsometimessignificantlyfaster,thanrunsonthemostcloselymatchingAmazonresources.ThelargestdifferencesaremostlikelyduetoAmazon’slackofatruehigh-performanceinterconnectforprocessors.

ForsomeoftheNPBs,Penguinresourcesyieldedfasterruns;forothersithadslowerrunsthanHECC.DifferencesintheprocessorinterconnectsandMPIlibraryarelikelythecausesoftheperformancedifferences.Fortheapplicationbenchmarks,performanceofresourcesatPenguinalwayslaggedbehindsimilarresourcesatHECC.Theseperformancedifferencesleadtothefirstfinding:

Finding 1: Tightly-coupled, multi-node applications from the NASA workload take somewhat more time when run on cloud-based nodes connected with HPC-level interconnects; they take significantly more time when run on cloud-based nodes that use conventional, Ethernet-based interconnects.

Cost WiththeexceptionofSkylakeprocessors,Table1inSection2showsthatthebasecompute-onlycostsofanode-houratAWSusingspotpricingismoreexpensivethanthefullcostofanode-houronasimilarresourceatHECC.Skylakesarenewand,asshowninAppendixI,thespotpricingestimateisveryoptimistic.Jobspayingthatpricearelikelytobeevictedbeforecompletionbecauseofthehighdemandforthenodes.

Page 15: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 15

Table1alsoshowsthatthecompute-onlycostofresourcesatPODis4.0–4.3timesmoreexpensivethanthefullcostforsimilarresourcesatHECC.Theseobservationsleadtothesecondfinding:

Finding 2: The per-hour full cost of HECC resources is cheaper than the (compute-only) spot price of similar resources at AWS and significantly cheaper than (compute-only) price of similar resources at POD.

Cost Effectiveness Inallcases,thefullcostofrunningonHECCresourceswaslessthanthelowest-possiblecompute-onlycostofrunningonAWS.TorunthefullsetoftheNPBs,AWSwas5.8–12timesmoreexpensivethanHECC,dependingontheprocessortypeused.Thefull-sizedapplicationswereinthebestcase1.9timesmoreexpensive.

TheNPBrunsatPenguinwere~4.7timesmoreexpensivethanequivalentrunsatHECC.Thefull-sizedapplicationswere~5.3timesmoreexpensive.

TheaverageuseofHECCresourcesisforjobsusing4,000coresandrunningformorethan60hours.Atleastthree-fourthsoftheworkloadisbymulti-processjobsusingtightlycoupledcommunicationsandtheSBUsuitehasbeendefinedtoreflectthatusage.TherequirementsoftheHECCworkloadtogetherwiththecostandperformancedatainthisreportleadstothethirdfinding:

Finding 3: Commercial clouds do not offer a viable, cost-effective approach for replacing in-house HPC resources at NASA.

Thisfindingisbasedontheexpectationthatothercloudproviderscanprovideon-demandresourcesonlyatorabovethespot-pricelevelofAWS.Inaddition,providerswhodonotuseHPC-levelnodeinterconnectswillfacesimilarperformancepenaltiesasseenonAWS.

Clouds as a Supplement to HECC Resources Whileitisnotcosteffectivetousecloudstoreplacein-houseHPCresourcesatNASA,theremaybecircumstanceswhereitmakeseconomicsensetousethemforaugmentingin-houseresources.InidentifyingcandidateusesforcloudsitisworthwhiletostartwithanexaminationofwhyHECCcostsaresolow.

ThemainfactorthatkeepsHECCcostslowisitsveryhighoverallaverageutilizationrate—morethan80%ofthetheoreticalmaximumnumberofSBUsin2017weredeliveredtousers.Coupledwithlowhardwareacquisitioncostsandarelativelylowratioofstaff-to-HWinfrastructure,itisverydifficultforcloudvendorstocompetewithHECC’scostpernode-hour.Multiplecloudvendorshaveacknowledgedthisfactwithcommentssuchas,“Ifyou(HECC)cankeepsystems70%utilizedormore,thenourpricingcannotcompetewithyours.”

AnotherfactorthathelpsdrivedownHECCcostsisitsstandardpracticetoaddcomputecapacitywithoutincreasingstaffingcosts.ThispatternisexpectedtocontinuewiththeNASFacilityExpansioneffort,whichisaddingadditionalfloorspaceandcomputeresourcestoHECC.Overtime,theplancallsfordoublingthespaceavailableandequippingitwithcomputeresources,withnoadditionalstaffingcosts.

Usingutilizationandresponsivenessasdrivers,therearetwoareaswherecloudsmightbeusefultoHECC:

Page 16: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 16

• WhenHECChastheneedforresourceswhoselong-termdemandissporadicorunknown,itmaybemorecosteffectivetousethecloudforthoserequirements.Forexample,testinganewarchitectureoftenrequiresaperiodofhighdemandfollowedbyalengthyperiodofalmostnoactivity.OtherexamplesinvolveuserswithrequirementsforGPU-acceleratednodes,suchasformachinelearning.Here,thedemandislikelylong-term,buttheextentofthedemandisunknownatpresent.

• Whentherunningofmanyloosely-coupled(orserial)jobssignificantlyimpactstheschedulingoftheHECC’sprimaryworkload—large,tightly-coupledcomputations—thenitmaybeworthpayingthepriceofrunningthesmalljobsinthecloudtoimproveresponsivenessforallusers.

ThisanalysisandtheobservationthatcommercialcloudprovidersoffermanymoretypesofcomputeresourcesthanHECCisableto(seeAppendixI)leadtothefourthfinding:

Finding 4: Commercial clouds provide a variety of resources not available at HECC. Several use cases, such as machine learning, were identified that may be cost effective to run there.

5. Discussion and Further Work Thissectiondiscussespotentialcloudusecasesinfurtherdetail,whichleadstosomeactionstheAPPteamistakingtoimproveNASA’sabilitytodelivercomputecapacityinthemostcost-effectivewaypossible.

Using Cloud Resources When User Demand is Unknown or Sporadic OneofthescenarioswhereitmaymakesenseforHECCtousethecloudratherthanacquiringitsownresourcesiswhenaresourcewouldbelightlyorsporadicallyutilized.Forexample,HECCiscontinuallyprocuringsmallsystemstotesttheusefulnessofnewtechnologiesforNASAapplicationssuchasthemany-integratedcoresarchitecture-basedIntelXeonPhiprocessorortheARMprocessor.Unless,thesenewtechnologiesareseentoperformverywellonalargenumberofapplications,theirutilizationmaybelimitedtoasmallnumberofusersevenafterthetestingphaseiscompleted.Insuchcases,whereongoing,long-termdemandforthesystemsisuncertain,HECCshouldevaluatethecosteffectivenessofusingcloud-basedresourcesinsteadofacquiringitsownin-housesystems.6

AnotherscenarioistheuseofGPUsformodelingandsimulation(M&S)applications.DuringthetimethatuserdemandforGPUcyclesforM&Sisdeveloping,HECCshouldconsiderusingcloud-basedresourcesforthosejobs.OneadvantageforuserswouldbeaccesstonewerhardwaretypesthanHECChasin-house.Forexample,AWSoffersnewerGPUprocessorsthaneitherHECCorPOD.AlthoughtheAWSGPUpriceisexpensive(seeAppendixI)andmulti-nodeGPUjobswouldfacethesortofperformancepenaltiesthatmulti-CPUjobsface,itmaystillbemorecosteffectivethanregularlyacquiringnewin-houseGPUsandcertainlyprovidestheabilitytoopportunisticallyaccessasneeded.

Similarly,dataanalyticsapplications,e.g.,machine-learningalgorithms,havebeenshowntoruneffectivelyonGPUs.However,sincethedemandforGPUcyclesforsuchapplicationsisstilldeveloping,HECCshouldconsiderrunningsuchjobsinthecloud.Sincethecomputationsareusuallylooselycoupled,costeffectivenesswouldlikelybenefitfrom

6Notethatgivenrelativelyhighcostofcommercialclouds(atleast1.9timesonAWSforapplications),utilizationofin-houseresourcehastobebelow42%beforeitbecomescost-effectivetouseacommercialcloudfortheresource.

Page 17: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 17

runningonnodeswithmanyGPUs.Suchnodesareparticularlyexpensiveandwouldrepresentalargeinvestmentthatmightonlybeusedafractionofthetime.Runninginthecloudcouldbethemosteffectivesolutionforsuchaworkloaduntilsuchtimeasdemandishighenoughtojustifyanon-premisesresource.Notethatrunningdataanalyticsapplicationsinthecloudmayalsofaceaperformancepenaltydependingonwherethedataislocated.Forexample,adataanalyticsapplicationusingdatasetsalreadyatHECCwouldfaceadditionalcostinaccessingthatdatawhenrunningoncloudresources.

Inanyoftheabovecases,whenuserdemandgrowstothepointthatitismoreeconomicaltorunonpremises,thenbringingaresourcein-housecanbejustifiedusingpurelyaneconomicargument.

Using Cloud Resources to Improve Responsiveness of HECC In-House Systems Opportunity Cost of Running Small Jobs on Pleiades/Electra Experiencewiththeportingandperformancesofmulti-nodeMPIjobsonthecloudleadstotheconclusionthattheyarenotaswellsuitedassingle-nodejobsmaybe.AsignificantportionofusageofHECCin-houseresourcesisbysingle-nodejobs.In2017,1.87Mone-nodejobs(orsubjobs)fromNASA’sScienceMissionDirectorate(SMD)consumedatotal9MSBUsonPleiadesandElectra.Theyrepresentedabout85%ofthetotalnumberofjobsonthosesystemsbutonly3%ofthetotalSBUusage.

Aquestionthatthennaturallycomesupinlightofthesenumbersishowtheschedulingoftheone-nodejobsfromSMDwouldaffecttheother15%ofthejobsonthesystem.Inparticular,dowiderjobs,i.e.onesusingalargernumberofprocessors,getdelayedsignificantlybecauseoftheexistenceofone-nodejobs?

Whileitseemsthattherelativelysmallusage(3%)couldbeaccomplishedbyfinding“holes”intheschedulingofwiderjobs,thesheervolumeofsingle-nodejobsmakesthequestionsworthinvestigating.

Analysis of Leasing Cloud Resources for Small Jobs IftheproductivityimpactofleavingallofthesmalljobsonHECCin-houseresourcesisfoundtobesignificant,NASAmightconsiderleasingresourcesinthecloudinordertorunthosejobs.Sometradeoffanalysisofusingleasedresourceswasthereforeconducted.

AnalysisofPBShistoricaldatashowsthatthemajorityofone-nodejobsonHECCsystemsbelongtoNASASMD.Inaddition,itislikelythattheSMDjobshavelessstringentsecurityrequirements.Whentheeffectsofdelayingtheschedulingofwiderjobsisconsidered,migratingsomeoftheone-nodeSMDjobsfromHECCtoacommercialcloudmightprovetobecosteffectiveforthewholeHECCworkload.Tothisend,thestudyconductedsomeexperimentstodeterminetheutilizationpercentageandwaitingtimeonthecloudresourcethatwouldbeseenbyHECCwhensinglenodejobsaremovedfromin-houseresourcestoafixed-size,leasedresourceinthecloud.Historicaljobsubmissiondatafromallone-nodeSMDjobsin2017wasused.Thesizeoftheleasedresourcewasvariedfrom2to8racksof72nodeseach.TheresultsshowninFigure1clearlyindicatethatitwouldbedifficulttosizealeasetohandleallsinglenodejobsfromSMDandhavetheleasedresourcebothberesponsivetouserrequirementsandbewellutilized.Forexample,aleaseof4racks(288nodes)wouldbeabout95%utilizedbutwouldhaveanaveragewaitingtimeofmorethan750hours.If5racks(360nodes)wereleased,thewaittimewoulddroptoabout150hoursbutitwouldcost25%more.Thus,anyefficientsolutiontorunsinglenodejobsonaleased

Page 18: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 18

resourcewouldneedtohaveafallback(intheon-demandcloudoronin-houseresources)toreducewaitingtimeswhendemandexceedsthecapabilityoftheleasedresources.

Acostcomparisonofusing144HaswellnodesforoneyearamongHECC-in-house,AWSandPODissummarizedbelow.NotethattheseHaswellnodesarenotidenticalandthecostspresentedshouldbeusedwithcautionwhendrawingconclusions.SeeAppendixIIIformoredetailedinformationonhowthecostsareobtained.

• HECC:using144existingHaswellnodes(eachwith24cores,128GiBofmemory),“unlimited”amountofstoragespaceandmultiplefront-endnodescosts$674K.

• AWS:pre-leasingup-front144c4.8xlargeinstances(eachwith18-core,60GiBofmemoryperinstance)withtheAmazonLinuxOSinUSWest(Oregon)region,100TBEBSgp2volumes,1m4.16xlargeaslogin-node,costs$1.35M.Pre-leasinginstanceswithotherOS(suchasSLES,CentOS,etc.),no-up-front,orpartial-up-front,orfromadifferentAWSregion,orusingEFSinsteadofEBSwillcostmore.Forexample,pre-leasingno-up-front144c4.8xlargeinstancesintheGovCloudregion,100TBEFS,1m4.16xlargeaslogin-node,costs$1.92M.Notethatdatatransfercostisnotcounted.AdditionaloverheadfromCSSO/EMCCisnotincludedintheestimate.

• POD:reserving144Haswellnodes(eachwith20cores,128GiBofmemory)and100TBstoragecosts$2.39M.Notethatthiscostdoesnottakeintoaccountofpossiblediscountsforgovernmentagencies,high-volumeusage,long-termcontractsordedicatedresourcesthatPODwilloffer.Also,datatransferisfree.

Asdetailedabove,thecostofpre-leasingcommercialcloudresourcesforone-nodejobsisatleastdoublethecostofusingexistingHECCin-houseresources.However,itmaybethatthecommercialcloudpricingwillbecomemorecompetitiveinthefuture.NASAshouldmonitorperformanceandcostsasthemarketchanges.NASAshouldalsoevaluatethecosteffectivenessofrunningitsownsmall-jobclusterin-house.Theper-nodecostofsuchaclusterwouldlikelybelowerthanHECC’stypicalclusterbecausethenodeinterconnectwouldnotneedtosupporttightly-coupledcommunications.

Figure 1: Results of simulations where all one-node jobs in SMD that were submitted in CY 2017 are run

instead on leased cloud resources, demonstrating how the size of the leased resource (in number of nodes) gives rise to a trade off in utilization of the leased resource vs. the average wait time for each job.

Page 19: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 19

Further Work WithagoalofhelpingHECCusethemostcost-effectivewaytomeetNASA’sHPCrequirements,theAPPteamhasidentifiedandbegunworkonthreeactions:

Action 1: Get a better understanding of the potential benefits and costs that might accrue from running a portion of the HECC workload in the cloud. This has two parts:

• Determine the scheduling impact to large jobs of running 1-node jobs in house. • Understand the performance characteristics of jobs that might be run on the cloud.

HECCisalreadyintheprocessofconductingresearchintohowtheone-nodejobsaffectscheduling.HECC’ssystemsteamissimulatinghowschedulingwouldhaveoccurredoverahistoricalperiodofjobsubmissionsifsingle-nodejobsfromSMDwererunoncloudresourcesinsteadofHECCin-housesystems.Theresultofthisstudywillquantifyan“opportunitycost”ofdelayedschedulingofwidejobsinordertorunone-nodejobsin-house.Thatinformationcanbeusedtohelpdecidewhethermovingtheone-nodejobsisworththelikelyincreaseincosttodoso.

Inaddition,HECCwilldevelopbenchmarkstorepresentworkflowsthatmightberunonthecloud.Inthecasewherecloudresourcesareusedtosatisfyasporadicdemand,thebenchmarkswillbeusedperiodicallytoevaluatethecostofjobsinthecloudandonpotentialin-houseresourcestodeterminewhendemandhasreachedthattippingpoint.ThenewbenchmarkswillalsobeusedtogetherwiththeonesfromthisstudytoextendperformanceevaluationstoproviderssuchasGoogleandMicrosoftandtokeepresultscurrentasprovidershavenewofferings.

Ingeneral,suchanevaluationformovingcloudcomputationstoin-houseresourcesshouldtakeintoaccountfactorssuchas:

• Userdemand,currentandpotential,fortheresource• Up-frontcostofacquiringandmaintainingtheresourceinhouseversusthecostof

acquiringtheresourceviaacommercialcloudprovider• NeedtotesttheresourcewithinHECCenvironment,e.g.,totestitsinteractionwith

thein-housefilesystem• Possibilityofrepurposingthehardwareafterthecompletionoftesting

Action 2: Define a comprehensive model that allows accurate comparisons of cost between HECC in-house jobs and jobs running in the cloud. IfHECCisusingthecloudasacost-effectivewaytoprovideaccesstoresourcesthatwouldhavelowutilizationrates,itwillperiodicallyneedtoevaluatewhendemandissufficienttowarrantpurchasinghardwareandbringingthecomputationsonpremises.Whencomparingcloudcoststoprojectedin-housecosts,itwillbeimportanttouseapples-to-applesmetrics.ThisstudycomparesthemostpessimisticcostofrunningonHECCresourcestothemostoptimisticcostofrunningonAWSorPOD.ThefullcostofrunningonAWSneedstoaddchargesforstorage,networkbandwidth,andstaffingsupport.Itmayneedtouseahigherpriceforcomputeresourcesaswellifspotpricingisnotavailable.ThefullcostofrunningonPODmayneedtoaddchargesforstaffingsupportandforstoragebeyondthemodestamountincludedforfree.

AnalternativewayofcomparingcostswouldbetouseamarginalcostforcomputingatHECC.Inthisapproach,theadditionalcostsassociatedwithaddingresourcestoexistingHECCresources(mostlycapitalcostsforcomputeracks)wouldbeused.Then,the

Page 20: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 20

additionalcostwouldbedividedbytheexpectednumberofSBUsthatequipmentwoulddelivertoarriveatthemarginalcostofprovidingadditionalSBUs.

Asanexample,ifHECCaddedoneHPEE-Cellwith288Skylakenodes,theestimatedmarginalcostforthenewSBUsprovidedwouldbeintherangeof$0.09–0.10/SBU,dependingonthecostofadditionalstorageandnetworkhardwarerequired.Thisassumesthatthenewequipmentwouldbeutilizedatarateof80%overathree-yearperiod.Thisrateisabout60%ofthefullcostrateperSBU7.Beforeusingthisapproachtomakefinancialdecisions,thetruemarginalcostsassociatedwithstorage,networking,andsupportneedtobedetermined.

Incomparison,usingaspot-pricingestimateof30%oftheon-demandrate,anequivalentnumberofSkylakenodehoursatAWSwouldcost60–80%morethanatHECC.(Notethatthespot-pricingrateislessthantherateforleasingresourceslongtermfromAWS—seeAppendixI.)UsingPODresourceswouldcostmorethan6timesthemarginalHECCcost,butgovernmentandvolumediscountswouldlowerthatfactorsomewhat.

Inaddition,inordertoaccuratelyestimatecostsofrunninginthecloud,HECCneedstogetabetterunderstandingofstorageandnetworkI/Obandwidthrequirementsastheyinteractwithcloudofferings.HECCalsoneedstounderstandtherequirementsforadeeptapearchiveandthelong-termpreservationofdataandhowthatcouldbeintegratedwiththecloud.

Action 3: Prepare for a broadening of services offered by HECC to include a portion of its workload running on commercial cloud resources. This has two parts:

• Conduct a pilot study, providing some HECC users with cloud access. • Develop an approach where HECC can act as a broker for NASA’s HPC use of cloud

resources, including user support and an integrated accounting model. Thereareanumberofareasforworktointegratecloud-basedresourcesintoHECCproduction:

• WhileNASA’sCSSO/EMCChasexistingagreementswithAWS,HECCshouldevaluateotherpotentialsuppliersofcloudresources—potentiallythroughanRFP.NotethatNASArequiresthatmoneybefunneledthroughCSSO/EMCC,sothatorganizationwouldlikelyneedtobeinvolvedintheprocurement.

• Usingcloud-basedresourcesinproductionwillrequiremodificationstotheNASSecurityPlanunderwhichtheHECCresourcesoperate.NAS/HECCshouldextendtheirmoderateplantocloud-basedresourcesfromthevendorsNASAchoosestoworkwith.Thiswilllikelyinvolvehavingthosevendorsberesponsibleforsomeofthesecuritycontrols.

• HECCneedstoconsiderhowitsuserswouldaccesscloud-basedresources.Whilemanyvendors,especiallyresellers,offerweb-basedjobsubmissionmechanisms,HECCusersareaccustomedtoscript-basedjobsubmissions.HECCneedstoperformrequirementsanalysisonagroupofuserstodetermineifweb-basedsystemsarepreferablebeforeinvestingmoneyinasoftwarelicense.

7TheassumptionofequipmentbeinginproductionforthreeyearsispessimisticwhencomparedtotypicalHECCpractices.Itismorelikelythattheequipmentbeinusefor5+years,therebyloweringthemarginalcostofanSBUsubstantially—probablytosomethingintherangeof$0.05–0.06,dependingoncostsincurredformaintenanceafterthreeyears.

Page 21: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 21

• HECCalsoneedstoconsiderhowtosupportcloudusers,especiallywithsettingupimagesandportingapplications.

InconsideringthemovingofsomeoftheHECCworkloadtothecloud,thepilotstudyispursuingthefollowing:

• StudyingtheworkflowsofcandidateSMDone-nodeapplicationstoidentifyandresolvepotentialissuesformigration.

• Investigatingthefeasibilityofusingcontainertechnology,e.g.Charliecloud,formovingone-nodejobsbetweenHECCandAWS,PODorothercloudofferings.

• TestingthecloudburstingcapabilitywiththePBS18and/orMoab/NODUSjobschedulerasamechanismforHECCuserstosubmitjobstoruninthecloud.

• DesigningmechanismsforaccountingforcloudusagethatintegratecleanlyintoHECCcurrentaccountinguserinterface.

ToactasabrokerforNASA’sHPCuseofclouds,HECC’sapproachneedstoinclude:

• Processesforidentifyingcloud-appropriateworkflows,bothfromcurrentHECCusersandotherswhoapproachHECCaskingforhelp,

• Processes,documentation,andtrainingforportingapplicationstoruneffectivelyinthecloud,and

• Mechanisms—potentiallythroughauserportal—fornon-HECCuserstoincorporatecloudusageaspartoftheirworkflow.

6. Conclusions TheAPPteamusedacombinationoftheNASParallelBenchmarks(NPB)andsixfullapplicationsfromNASA’sworkloadonPleiadesandElectratocompareperformanceofnodesbasedonthreedifferentgenerationsofIntelXeonprocessors—Haswell,Broadwell,andSkylake.Withtheexceptionofanopen-sourceCFDcodestandinginforexport-controlledcodesthatcouldnotberuninthecloud,thefullapplicationsrepresenttypicalworkdoneacrossNASA’sMissionDirectorates.

Inadditiontogatheringperformanceinformation,theAPPteamalsocalculatedcostsfortheruns.InthecaseofworkdoneonPleiadesandElectra,the“fullcost”ofrunningjobswasdetermined.Inthecaseofthecommercialcloudproviders,theteamcalculatedonlythecomputecostofeachrun,usingpublishedratesandincludingpublicly-knowndiscountsasappropriate.Otherinfrastructurecostsofrunninginthecloud—suchasnear-termstorage,deeparchivestorage,networkbandwidth,softwarelicensing,andstaffingcostsforsecurityandsecuritymonitoring,applicationportingandsupport,problemtrackingandresolution,programmanagementandsupport,andsustainingengineering—werenotconsideredinthisstudy.These“fullcloudcosts”arelikelysignificant.

ResultsshowthatlargeapplicationswithtightlycoupledcommunicationsperformworseoncloudresourcesthanonsimilarresourcesatHECC.Inaddition,per-houruseofcloudresourcesismoreexpensivethanthefullcostofusingsimilarresourcesatHECC.Takenincombination,thedataasofMay2018leadstotheconclusionthatcommercialcloudsdonotofferaviable,cost-effectiveapproachforreplacingin-houseHPCresourcesatNASA.

Whileitisnotcosteffectivetousecloudstoreplacein-houseHPCresourcesatNASA,theremaybecircumstanceswhereitmakeseconomicsensetousethemforaugmentingin-houseresources.ThisstudyidentifiedthreeactionsforHECCinanticipationofcloudsbeingusefulforsomeoftheHECCworkload.ThemainthemesoftheactionsareforNASAtobetter

Page 22: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 22

understandtheimpactandcostofrunninginthecloud,defineacomprehensivemodelformoreaccuratecomparisonswithHECCin-houseresources,andtoprepareforitslikelyeventualuseforacertainfractionoftheagency’sHPCworkload.

Acknowledgments ThisstudywasfundedbytheNASAHigh-EndComputingCapability,includingworkundertaskARC013.11.01oncontractNNA07CA29C.

PenguinComputingmadetheircloud-basedresourcesavailabletoourevaluationatnocost;inaddition,theyprovidedaccesstotheirSkylakeprocessorspriortopublicrelease.Theauthorsappreciatetheirinvolvement.

TheauthorswouldliketothankR.Biswas,B.Ciotti,L.Hogle,P.Mehrotra,andW.Thigpenfortheirhelpfulcommentsandsuggestions.

Page 23: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 23

Acronyms AMI AmazonMachineImage

AWS AmazonWebServices

CSSO (NASA)ComputerServicesServiceOrganization

EAR99 AclassificationofitemssubjectingtoExportAdministrationRegulations

EBS (Amazon)ElasticBlockStore

EC2 (Amazon)ElasticComputeCloud

EFS (Amazon)ElasticFileSystem

EMCC (NASA)EnterpriseManagedCloudComputing

HECC (NASA)High-EndComputingCapability

HPC HighPerformanceComputing

HPE HewlettPackardEnterprise

ITAR InternationalTrafficinArmsRegulations

M&S (Physics-Based)Modeling&Simulation

MPI MessagePassingInterface,usedforparallelprogramming

NAS “NASAAdvancedSupercomputingDivision”or“NetworkAttachedStorage”(dependingoncontext)

NFS NetworkFileSystem

PBS PortableBatchSystem

POD PenguinOn-Demand

SBU (HECC)StandardBillingUnit

SMD (NASA)ScienceMissionDirectorate

Page 24: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 24

References 1. U.S.DepartmentofEnergy.TheMagellanReportonCloudComputingforScience.

OfficeofAdvancedScientificComputingResearch(ASCR),December2011.2. P.Mehrotra,R.Hood,J.Chang,S.Cheung,J.Djomehri,S.Heistand,H.Jin,S.Saini,J.

Yan,andR.Biswas.EvaluatingNASA’sNebulaCloudforApplicationsfromHighPerformanceComputing.NASAAdvancedSupercomputing,2011.

3. P.Mehrotra,J.Djomehri,S.Heistand,R.Hood,H.Jin,A.Lazanoff,S.Saini,andR.Biswas,"PerformanceEvaluationofAmazonEC2forNASAHPCApplications,"the21stInternationalACMSymposiumonHigh-PerformanceParallelandDistributedComputing(HPDC'12),Delft,theNetherlands,June18-22,2012.

4. S.Saini,S.Heistand,H.Jin,J.Chang,R.Hood,P.Mehrotra,andR.Biswas,"ApplicationBasedPerformanceEvaluationofNebula,NASA'sCloudComputingPlatform,"the14thInternationalConferenceonHighPerformanceandCommunications(HPCC'12),Liverpool,UnitedKingdom,June25-27,2012.

5. MessagePassingInterfaceStandard,Version3.1,https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.

6. NASParallelBenchmarks:https://www.nas.nasa.gov/publications/npb.html7. SBUSuite

2011:https://www.hec.nasa.gov/news/features/2011/sbus_042111.html8. StandardBillingUnits,https://www.hec.nasa.gov/user/policies/sbus.html9. AmazonwhitepaperApril2017,“OverviewofAmazonWebServices”,

https://aws.amazon.com/whitepapers/overview-of-amazon-web-services/10. AWScurrentinstancetypes.https://aws.amazon.com/ec2/instance-types/11. AmazonwhitepaperDecember2016,“AWSStorageServicesOverview”,

https://aws.amazon.com/whitepapers/storage-options-aws-cloud/12. AWSSupportPlans,https://aws.amazon.com/premiumsupport/compare-plans/13. PricinginformationforAWSon-demandandreservedinstances,

https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/14. AWSAcceleratedComputing,https://aws.amazon.com/blogs/aws/new-p2-

instance-type-for-amazon-ec2-up-to-16-gpus/15. AWSEBSVolumeTypes:https://aws.amazon.com/ebs/details/#VolumeTypes16. AWSEFSPricing:https://aws.amazon.com/efs/pricing/17. AWSGovCloud(US)DataTransferPricing,https://aws.amazon.com/govcloud-

us/pricing/data-transfer/18. AWSDataTransferCostsandHowtoMinimizeThem,

https://datapath.io/resources/blog/what-are-aws-data-transfer-costs-and-how-to-minimize-them/

19. WhatisAWSDirectConnect?https://docs.aws.amazon.com/directconnect/latest/UserGuide/Welcome.html

20. PenguinComputingOn-Demand(POD)HPCCloud,https://www.penguincomputing.com/hpc-cloud/pod-penguin-computing-on-demand/

21. PODPricing,https://www.penguincomputing.com/hpc-cloud/pricing/22. PrivatecommunicationwithPODstaff,JanandFeb2018.23. Moab/NODUSCloudBurstingSolution,http://www.adaptivecomputing.com24. AWSPublicSectorContractCenterhttps://aws.amazon.com/contract-center/

Page 25: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 25

Appendix I – Amazon Web Services

Resource Overview TheAWSCloud[9]offersmorethan90AWSservicesinfourmaincategories:computingpower,storageoptions,networking,anddatabases.Withineachcategory,therearemultipleoptionstochoosefromformatchingdifferentworkloads.Forexample,inthecomputingpowercategory,AWSEC2providesresourcesininstanceswhereaninstanceisavirtualserver.Looselyspeaking,theallocationofAWSresourcesinunitsofinstancesisequivalenttoHECC’sresourceallocationinunitsofnodes.Thechartfrom[10]belowshowsthecurrentlyavailableAWSEC2instancefamiliesandinstancetypes.Fourinstancefamilies—generalpurpose,compute-optimized,memory-optimized,andstorage-optimized—currentlyuseIntelXeonprocessorsexclusively.Forthecompute-optimizedinstances,thec3,c4,andc5instancesuseIvyBridge,HaswellandSkylakeprocessors,respectively.TheacceleratedcomputingfamilycurrentlyprovidestheNvidiaTeslaprocessors(K80,M60andV100)inadditiontotheIntelXeonprocessorsontheinstance.

Asseeninthefollowingtableobtainedthroughthespot_managerscriptcreatedbyHECCstaff,eachinstancetypeischaracterizedby:

(i) amountofmemory,(ii) numberofcores,(iii) numberofGPUprocessorsincluded,(iv) whetherthereisalocaltemporarystorageincludedanditssize,(v) networkperformanceand(vi) whetherlaunchingtheinstancesinaplacementgroupisallowed.

Therearetwotypesofplacementgroups:“cluster”groupforclusteringinstancesintoalow-latencygroup,and“spread”groupforspreadinginstancesacrossunderlyinghardwaretoreducerisksofsimultaneousfailures.Fornetworkperformance,mostinstancetypesareconfiguredwith10Gbpsorless.Afewinstancetypesareconfiguredwith25-Gbpsinterconnect.

Type Memory Cores GPUs Local Temp Storage Network Perf Enhanced Networking Placement Group | d2.xlarge | 30.5 GiB | 2 cores | | 6000 GiB | Moderate | Yes | Yes | | d2.2xlarge | 61.0 GiB | 4 cores | | 12000 GiB | High | Yes | Yes | | d2.4xlarge | 122.0 GiB | 8 cores | | 24000 GiB | High | Yes | Yes | | d2.8xlarge | 244.0 GiB | 18 cores | | 48000 GiB | 10 Gigabit | Yes | Yes | | r3.large | 15.25 GiB | 1 cores | | 32 GiB | Moderate | Yes | Yes | | r3.xlarge | 30.5 GiB | 2 cores | | 80 GiB | Moderate | Yes | Yes | | r3.2xlarge | 61.0 GiB | 4 cores | | 160 GiB | High | Yes | Yes | | r3.4xlarge | 122.0 GiB | 8 cores | | 320 GiB | High | Yes | Yes | | r3.8xlarge | 244.0 GiB | 16 cores | | 640 GiB | 10 Gigabit | Yes | Yes |

Page 26: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 26

| i3.large | 15.25 GiB | 1 cores | | 475 GiB | Up to 10 Gigabit | Yes | Yes | | i3.xlarge | 30.5 GiB | 2 cores | | 950 GiB | Up to 10 Gigabit | Yes | Yes | | i3.2xlarge | 61.0 GiB | 4 cores | | 1900 GiB | Up to 10 Gigabit | Yes | Yes | | i3.4xlarge | 122.0 GiB | 8 cores | | 3800 GiB | Up to 10 Gigabit | Yes | Yes | | i3.8xlarge | 244.0 GiB | 16 cores | | 7600 GiB | 10 Gigabit | Yes | Yes | | i3.16xlarge | 488.0 GiB | 32 cores | | 15200 GiB | 25 Gigabit | Yes | Yes | | c3.large | 3.75 GiB | 1 cores | | 32 GiB | Moderate | Yes | Yes | | c3.xlarge | 7.5 GiB | 2 cores | | 80 GiB | Moderate | Yes | Yes | | c3.2xlarge | 15.0 GiB | 4 cores | | 160 GiB | High | Yes | Yes | | c3.4xlarge | 30.0 GiB | 8 cores | | 320 GiB | High | Yes | Yes | | c3.8xlarge | 60.0 GiB | 16 cores | | 640 GiB | 10 Gigabit | Yes | Yes | | r4.large | 15.25 GiB | 1 cores | | | Up to 10 Gigabit | Yes | Yes | | r4.xlarge | 30.5 GiB | 2 cores | | | Up to 10 Gigabit | Yes | Yes | | r4.2xlarge | 61.0 GiB | 4 cores | | | Up to 10 Gigabit | Yes | Yes | | r4.4xlarge | 122.0 GiB | 8 cores | | | Up to 10 Gigabit | Yes | Yes | | r4.8xlarge | 244.0 GiB | 16 cores | | | 10 Gigabit | Yes | Yes | | r4.16xlarge | 488.0 GiB | 32 cores | | | 25 Gigabit | Yes | Yes | | p3.2xlarge | 61.0 GiB | 4 cores | 1 | | Up to 10 Gigabit | Yes | Yes | | p3.8xlarge | 244.0 GiB | 16 cores | 4 | | 10 Gigabit | Yes | Yes | | p3.16xlarge | 488.0 GiB | 32 cores | 8 | | 25 Gigabit | Yes | Yes | | x1.16xlarge | 976.0 GiB | 32 cores | | 1920 GiB | 10 Gigabit | Yes | Yes | | x1.32xlarge | 1952.0 GiB | 64 cores | | 3840 GiB | 25 Gigabit | Yes | Yes | | x1e.32xlarge | 3904.0 GiB | 64 cores | | 3840 GiB | 25 Gigabit | Yes | No | | t2.large | 8.0 GiB | 1 cores | | | Low to Moderate | No | No | | t2.xlarge | 16.0 GiB | 2 cores | | | Moderate | No | No | | t2.2xlarge | 32.0 GiB | 4 cores | | | Moderate | No | No | | p2.xlarge | 61.0 GiB | 2 cores | 1 | | High | Yes | Yes | | p2.8xlarge | 488.0 GiB | 16 cores | 8 | | 10 Gigabit | Yes | Yes | | p2.16xlarge | 732.0 GiB | 32 cores | 16 | | 25 Gigabit | Yes | Yes | | g3.4xlarge | 122.0 GiB | 8 cores | 1 | | Up to 10 Gigabit | Yes | Yes | | g3.8xlarge | 244.0 GiB | 16 cores | 2 | | 10 Gigabit | Yes | Yes | | g3.16xlarge | 488.0 GiB | 32 cores | 4 | | 25 Gigabit | Yes | Yes | | m4.large | 8.0 GiB | 1 cores | | | Moderate | Yes | Yes | | m4.xlarge | 16.0 GiB | 2 cores | | | High | Yes | Yes | | m4.2xlarge | 32.0 GiB | 4 cores | | | High | Yes | Yes | | m4.4xlarge | 64.0 GiB | 8 cores | | | High | Yes | Yes | | m4.10xlarge | 160.0 GiB | 20 cores | | | 10 Gigabit | Yes | Yes | | m4.16xlarge | 256.0 GiB | 32 cores | | | 25 Gigabit | Yes | Yes | | c4.large | 3.75 GiB | 1 cores | | | Moderate | Yes | Yes | | c4.xlarge | 7.5 GiB | 2 cores | | | High | Yes | Yes | | c4.2xlarge | 15.0 GiB | 4 cores | | | High | Yes | Yes | | c4.4xlarge | 30.0 GiB | 8 cores | | | High | Yes | Yes | | c4.8xlarge | 60.0 GiB | 18 cores | | | 10 Gigabit | Yes | Yes | | c5.large | 4.0 GiB | 1 cores | | | Up to 10 Gbps | Yes | Yes | | c5.xlarge | 8.0 GiB | 2 cores | | | Up to 10 Gbps | Yes | Yes | | c5.2xlarge | 16.0 GiB | 4 cores | | | Up to 10 Gbps | Yes | Yes | | c5.4xlarge | 32.0 GiB | 8 cores | | | Up to 10 Gpbs | Yes | Yes | | c5.9xlarge | 72.0 GiB | 18 cores | | | 10 Gigabit | Yes | Yes | | c5.18xlarge | 144.0 GiB | 36 cores | | | 25 Gigabit | Yes | Yes |

Similarly,therearemultiplestorageservices[11]tochoosefromasshowninthischart:

Amongtheseservices,AWSEBSvolumescanbeusedforpersistentlocalstorage.DatainEBSvolumescannotbesharedbetweeninstancesunlessexportedasanNFSfilesystemfromtheinstancetheEBSvolumeismountedon.TheEBSbandwidthhaslimitsdependingonthevolumeandinstancesizes.EFSisanNFSfilesystem;thus,thedataisavailabletooneormoreEC2instancesandacrossmultipleAvailabilityZones(AZs)inthesameregion.S3isascalableanddurableobject-basedstoragesystem;datainS3canbemadeaccessiblefromanyinternetlocation.PointintimesnapshotsofEBSandEFSvolumescanbetakenandstoredinS3.AWSGlacierprovideslong-termarchivalstorage.

Page 27: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 27

TheAWSservicesareavailableinmorethan16regionsaroundtheworld,eachwithafewAZs.Eachregioniscompletelyindependent.ThetworegionsclosesttotheHECCfacilityareUSWest(N.California)andUSWest(Oregon).ThereisalsoaregionmarkedasAWSGovCloud(US)whichisdesignedtoaddressstringentU.S.governmentsecurityandcompliancerequirements.AregionhasmultipleAZs.EachAZisisolated,buttheAZsinaregionareconnectedthroughlow-latencylinks.

Pricing UsageonAWSischargedonthreefundamentalcharacteristics:compute,storage,anddatatransferout.ThereisnochargeforinbounddatatransferorfordatatransferbetweenotherAmazonWebServiceswithinthesameregion.Inaddition,AWSoffersfourdifferentcustomsupportplans[12]:basic,developer,business,andenterprise.Thebasicsupportplanisincludedbuthasnoaccesstotechnicalsupportresourcesandbeyond.

Cost of compute instances TherearefourpurchasetypesforAWSEC2instances:

• On-demandinstances:fullprice.Thecustomerpaysforcomputecapacitybythehourwithnolong-termcommitments.Amazonmaychangethepriceonceayear.

• Reservedinstances:discountedratefromfullprice.Instancesarerequiredtobereservedfor1–3years.Maygetvolumediscountsupto10%whenyoureservemore.

• Spotinstances:bidprice.Thecustomerbidsforunusedinstancesandpricesfluctuateaboutevery5minutes.Notethatitispossibleforthespotpricetogoovertheon-demandprice.SpotinstancescanbeinterruptedbyEC2with2minutesofnotificationwhenEC2needsthecapacityback.

• Dedicatedhosts:on-demandpriceforinstancesonphysicalserversdedicatedforyouruse

Prices[13]varywithAWSregions,OS(categorizedasAmazonLinux,RHEL,SLES,Windows,etc.;RHELandSLESaremoreexpensivethanAmazonLinux.),numberofcores,memory,andotherfactors.Usageisbilledonone-secondincrements,withaminimumof60seconds.

Thefollowingthreechartsshowsamplepricingforac4.8xlargeinstancewiththeLinuxOSinthreeregions:(i)GovCloud(US),(ii)USWest(N.California),and(iii)USWest(Oregon).Ineachchart,reservedandon-demandpricingsforstandard1-yeartermandconvertible1-yeartermareshown.Theconvertible1-yeartermallowsforflexibilitytousedifferentinstancefamilies,OS,etc.Asseeninthesecharts,pricingforUSWest(Oregon)regionislowerthantheGovCloud(US)regionwhiletheUSWest(N.California)regionisthemostexpensiveamongthethree.

Page 28: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 28

Pricing for GovCloud (US)

Pricing for US West (N. California)

Pricing for US West (Oregon)

Page 29: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 29

NotethatspotinstancesarenotofferedfortheGovCloud(US).PricingforspotinstancesintheAWSpubliccloudschangesfrequently.Thetwochartsbelowshowsamplepricingforac4.8xlargeinstanceandac5.18xlargeinstanceintheUSWest(Oregon)regionovera1-monthperiod.Whenthedemandislowest,onecangetadeepdiscountusingspotinstancesversuson-demandinstances.Thetrickypartofthegameistopredictwhenthedemandwillbelow.

• Spotpricingforac4.8xlargeinstance

• Spotpricingforac5.18xlargeinstance

Cost of GPU instances CurrentAWSAcceleratedComputinginstances[14]includeIntelXeonE5-2686v4(Broadwell)CPUandvariousGPUs.Pricinginthefollowingchartisonaper-hourbasis:

Page 30: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 30

Cost for storage space Chargeforstorageisbasedontheamountallocated,nottheamountactuallyused.Pricingistiered;itischeaperpergigabytewhenmorestorageisallocated.Ifnolong-termstoragesuchasS3isneeded,theoptionsareEBSandEFS:

EBS Thischart[15]fromAWSshowsthedescription,characteristics,andpricingofvariousEBSvolumetypes.

EFS AWSEFSfunctionsasasharedfilesystem,whichprovidesacommondatasourceforworkloadsandapplicationsrunningonmorethanoneEC2instance.EFSfilesystemscanbemountedonon-premisesserverstomigratedataovertoEFS.Thisenablescloud-burstingscenarios.

WithAWSEFS,acustomerpaysonlyfortheamountoffilesystemstoragetheyusepermonth.Thereisnominimumfeeandtherearenoset-upcharges.Therearenochargesfor

Page 31: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 31

bandwidthorrequests. PricingforEFSstorageinUSWest(Oregon)is$0.30/GB-month[16].Thereisalsoachargeof$0.01/GBforsyncingdatatoEFS.

Note:gettingaccurateusageaccountingperuserisgoingtobeextremelydifficultinasharedenvironment.Thus,thecostofstoragespaceisprobablybesttreatedasanoverheadoftheoperation.

Cost for transferring data Inmostcases,AWSonlychargesfortransferdataoutofanAWSservice.

AWSwebsitesonlylistthedatatransferpricing[17]fortheAWSGovCloud(US)region.ThischartshowsasamplepricingfortransferdatafromAWSS3tointernet:

TherearealsocostsfordatatransfersbetweenEC2instances,betweenAZsinthesameregion,andbetweendifferentAWSregions[18].

TheoutbounddatatransferisaggregatedacrossmultipleAmazonservicessuchasAmazonEC2,AmazonS3,etc.,andthenchargedattheoutbounddatatransferrate.Thischargeappearsonthemonthlystatementas“AWSDataTransferOut”.

Also,outbounddatatransfercostsarereducedwhengoingoveradirectconnectionintoacustomersite.Thereisthecostofhavingthedirectconnect[19]aswellasasmallchargeofalldatagoingoverthedirectconnect.Thebenefitofusingadirectconnectissmallintermsofcostbuthugeintermsofsecurity.Thereisa1-GbpsdirectconnectionbetweenNASA’sCSSO/EMCCandAWS.

Thecostforout-bounddatatransferisgenerallysmallenoughthatitisbetterincludedasoverheadoftheoperation.

Page 32: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 32

Appendix II – Penguin On Demand (POD)

Resource Overview PenguinComputingofferson-demandpubliccloud(i.e.,POD)andprivatecloudresources[20].PODusesbare-metal(non-virtualized),InfiniBand-andOmniPath-basedclustersintwolocations,referredtoasMT1andMT2,whichareaccessiblethroughasinglePODPortal.Eachlocationhasitsownlocalizedstorage.Therearehigh-speedinterconnectstofacilitateeasymigrationofdatafromonelocationtoanother.

MT1locationprovidesIntelWestmere,SandyBridge,HaswellandSandyBridge+NvidiaK40processorswithInfiniBandQDR40-Gbpsinterconnectandhigh-speedNetworkAttachedStoragevolumes.Nostoragequotaissetbydefault.

MT2locationprovidesIntelBroadwell,Skylake(releasedonMarch12,2018),andKNL,allwithIntelOmni-Path100-GbpsinterconnectandaLustrefilesystem.Nostoragequotaissetbydefault.

PODalsoofferscustomizedlargefilesystemssuchasGPFSifneeded.

Differentloginnodesineachlocationcanbecreated.Fourchoicesofloginnodesareavailable.PODalsooffersaScyldCloudWorkstation,whichisaremote,3D-acceleratedvisualizationsolutionforpre-andpost-processingtasksonPOD.

Fortheprivatecloud,thePenguinteamwillinstallandconfiguretheclusterforthecustomerandtakecareofalloperationalandmaintenancetasks.SuchprivatecloudscanbeintheformofaportionofPOD’spubliccloudonamonthlyoryearlybasisorbedesignedandhostedwithaspecificconfigurationtailoredtomeetthecustomer’sneeds.

Pricing PODpricinginformation,exceptfortheirIntelKNL,ispublishedonline[21].

Onefreeloginnodeperlocationisallowedpermanagementaccount(notethattherecouldbemultipleusersunderonemanagementaccount)perPODlocation.Othertypesofloginnodesarechargedperserverhour.

Formostcomputeresources,chargingbasedonper-core-hourisadvertised.UsageoftheNvidiaK40resourcesisbasedonnodehours.

Page 33: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 33

Auser’shomedirectorycomeswith1GBoffreestorage.Storagebeyond1GBischarged$0.10perGB-month.ThereisnochargefordatatransferinandoutofPOD.

[email protected].

PODmayofferdiscountsforgovernmentagencies,volume,longtermcontractanddedicatedresources[22];thusthepublishedpublicpricesof$0.09,$0.10,and$0.11per-core-hourforPODHaswell,Broadwell,andSkylake,respectively,andthe$0.10perGB-monthstoragecostcouldbereduced.

Page 34: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 34

Appendix III – Performance and Cost

Cost Basis SeeSection2forthecostbasisusedforthisstudy.

Performance and Cost of Workloads NPB Scaling Study TheNPBsarecommonlyusedtocheckthescalingperformanceofasystem.RunswiththeBroadwellprocessorsofferedbyHECC(2.4GHz,28-core,128GiB,56Gbps),AWS(2.3GHz,32-core,256GiB,10Gbps)andPOD(2.4GHz,28-core,256GiB,100Gbps)wereperformed.SincemanyoftheNPBbenchmarksrequire2nMPIranksandwouldfitnicelyina32-corenode,usingtheAWS32-coreBroadwellinstancesresultsinneedingfewerAWSinstancesandpossiblyfeweroccurrencesofMPIinter-nodecommunicationthanusingthe28-coreHECCorPODBroadwellnodesforsomeruns.

Toexaminethescalingperformance,runsusingtheeightClassCmicro-benchmarks(BT,CG,EP,FT,IS,LU,MG,andSP)from16-CPUcountupto1,024-CPUcountwereperformedonHECCandAWS.Thetimingofeachrunwasseparatedintotwocomponents–computationandcommunication.Thestudy’sresultsshowthatthescalingperformanceofthecomputationcomponent,asrepresentedbyBT.Conthelowerleftgraph,ofbothHECCandtheAWSBroadwellisquitegood:

Onthecontrary,thescalingperformancesofthecommunicationcomponentofHECCandAWSBroadwell(asshownontherightabove)aredrasticallydifferent.Figure2showsthecommunicationperformanceforallNPBClassCbenchmarksonHECCBroadwellandAWSBroadwell.WiththeonlyexceptionoftheEP“embarrassinglyparallel"benchmark,theAWScommunicationtimesaresubstantiallyhigher.TheseresultsclearlydemonstratethesuperiorityoftheHECCBroadwellwith56-GbpsInfiniBandversustheAWSBroadwellwith25-Gbpsnetwork.Table6showstheClassCresultsintabularform.TheresultsfortheClassDbenchmarksforAWSinTable7showsimilarcharacteristicstotheClassCresults.

Page 35: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 35

Figure 2: NPB Class C. Time spent in communicafon.

0.10

1.00

10.00

100.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

BT.C

0.10

1.00

10.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

CG.C

0.10

1.00

10.00

100.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

FT.C

0.10

1.00

10.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

IS.C

0.10

1.00

10.00

100.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

LU.C

0.01

0.10

1.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

MG.C

0.10

1.00

10.00

100.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

SP.C

0.00

0.01

0.10

1.00

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

TIME

(SEC

)

NCPUS

EP.C

0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0

Electra-Bro AWS-m4.16xlarge

Page 36: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 36

OnthePODBroadwellsystem,thestoryissomewhatdifferent.TheClassDresultsinTable7showthatthePODBroadwellsystemwitha100-Gbpsnetworkperformsquitewell.Forsomebenchmarks,suchasFT.DandIS.D,PODperformsbetterthanHECC.Forothers,suchasEP.DandLU.D,thereverseistrue.ItshouldbenotedthatofthetwoPODlocationsavailablefortesting,theMT2siteoffersresourceswitha100-GbpsOmni-Pathinterconnect.TheNPBClassDperformanceonPODMT2Broadwellnodesareclosertoperformanceon

Table 6: Scaling Runs of NPB Class C on Broadwell.

Benchmark NCPUS

# of HECC Broadwell

nodes

# of AWS m4.16xlarge (Broadwell)

instancesHECC time

(sec)AWS time

(sec)bt.C 16 1 1 83.33 90.14bt.C 25 1 1 52.52 61.13bt.C 36 2 2 32.81 41.51bt.C 64 3 2 17.66 22.72bt.C 121 5 4 8.89 19.63bt.C 256 10 8 4.87 10.86bt.C 484 18 16 2.60 25.54bt.C 1024 37 32 1.58 17.14cg.C 16 1 1 20.62 21.73cg.C 32 2 1 7.32 9.71cg.C 64 3 2 4.37 6.87cg.C 128 5 4 2.32 3.72cg.C 256 10 8 1.45 2.71cg.C 512 19 16 1.05 1.74cg.C 1024 37 32 1.08 1.69ep.C 16 1 1 5.54 6.11ep.C 32 2 1 2.77 3.09ep.C 64 3 2 1.38 1.57ep.C 128 5 4 0.69 0.84ep.C 256 10 8 0.38 0.43ep.C 512 19 16 0.22 0.21ep.C 1024 37 32 0.13 0.11ft.C 16 1 1 17.80 16.54ft.C 32 2 1 9.45 11.29ft.C 64 3 2 5.35 9.81ft.C 128 5 4 3.01 19.09ft.C 256 10 8 1.79 12.28ft.C 512 19 16 0.92 4.27ft.C 1024 37 32 0.86 3.06is.C 16 1 1 1.34 1.22is.C 32 2 1 0.72 0.77is.C 64 3 2 0.47 0.96is.C 128 5 4 0.31 1.48is.C 256 10 8 0.17 0.65is.C 512 19 16 0.14 1.02is.C 1024 37 32 0.19 1.93lu.C 16 1 1 50.11 46.82lu.C 32 2 1 27.07 32.66lu.C 64 3 2 14.08 18.60lu.C 128 5 4 8.06 24.44lu.C 256 10 8 3.74 26.19lu.C 512 19 16 1.93 37.26lu.C 1024 37 32 1.20 50.81mg.C 16 1 1 6.75 4.31mg.C 32 2 1 3.20 4.19mg.C 64 3 2 1.49 2.08mg.C 128 5 4 0.79 1.46mg.C 256 10 8 0.45 0.78mg.C 512 19 16 0.22 0.67mg.C 1024 37 32 0.13 0.39sp.C 16 1 1 114.10 71.34sp.C 25 1 1 64.84 62.23sp.C 36 2 2 38.91 48.28sp.C 64 3 2 17.92 23.38sp.C 121 5 4 9.43 30.61sp.C 256 10 8 4.93 47.47sp.C 484 18 16 2.81 48.36sp.C 1024 37 32 2.02 31.18

Page 37: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 37

HECC’sBroadwellnodes,whichhavea4x-FDR(56-Gbps)InfiniBandinterconnect(seeTable7).AnadditionalfactoristhattheHPE-MPTMPIlibraryusedonHECChasbetterscalingthantheIntel-MPIonPOD.

Inadditiontothetimingcomparison,Table7alsoincludesthecostofrunningthebenchmarksonBroadwellnodes.OntheHECCBroadwell,thesumofthefull-blowncostofeachbenchmarkis$2.81.Thisis~4.7xcheaperthanthecompute-onlycostonPODof$13.19.OfthefourAWScostslisted[includingthe(i)USWest(Oregon)on-demand,(ii)USWest(Oregon)spotpricesampledas30%ofon-demandprice,(iii)Govon-demand,and(iv)Govpre-leasingsampledas70%ofGovon-demand],thelowestcostisthecompute-onlyUSWest(Oregon)spotpriceof$16.42,whichisstill5.8xthatoftheHECCfull-blowncostof$2.81.TheotherthreeAWScostsareallevenmoreexpensive.

FurthercomparisonwasperformedbetweentheHECCSkylakewitha100-GbpsandtheAWSSkylakewitha25-Gbpsnetwork.TheperformanceandcostresultsusingselectedNPBclassDbenchmarksaredetailedinTable8.ThelowestAWSSkylaketotalcompute-onlycostof$18.29,basedonasamplespotprice,isabout12xmoreexpensivethanthe$1.50HECCfull-blowncost.Examinedindividually,someofthebenchmarkssuchasCG.DandFT.DperformworseonAWSSkylakethanonAWSBroadwell.ThisindicatesthatthenewAWSSkylakeinstallationmayneedsometuningortherewassomethingwronginthestudy’sruns.

Inconclusion,thisNPBstudyamongHECC,AWSandPODprovidesastrongargumentthattheHECCsystemsnotonlyprovidegoodscalingperformancesbuttheyarealsomorecosteffectivethanexistingAWSandPODofferings.

Table 7: Scaling Runs of NPB Class D on Broadwell.

Benchmark NCPUS

# of HECC or POD

Broadwell nodes

# of AWS m4.16xlarge (Broadwell)

instancesHECC time

(sec)HECC full

cost AWS time

(sec)

AWS Oregon

compute cost

AWS Gov compute

costPOD time

(sec)

POD compute

cost bt.D 256 10 8 104.03 $0.19 176.23 $1.25 $1.58 117.72 $0.92bt.D 1024 37 32 31.34 $0.21 151.07 $4.30 $5.41 50.58 $1.46cg.D 256 10 8 71.12 $0.13 230.79 $1.64 $2.07 56.96 $0.44cg.D 512 19 16 44.98 $0.15 130.84 $1.86 $2.34 33.91 $0.50cg.D 1024 37 32 45.18 $0.30 127.95 $3.64 $4.59 33.93 $0.98cg.D 2048 74 64 30.73 $0.41 140.17 $7.97 $10.05 25.88 $1.49ep.D 256 10 8 5.64 $0.01 6.86 $0.05 $0.06 6.32 $0.05ep.D 512 19 16 2.82 $0.01 5.91 $0.08 $0.11 3.25 $0.05ep.D 1024 37 32 1.43 $0.01 2.88 $0.08 $0.10 3.16 $0.09ep.D 2048 74 64 0.73 $0.01 1.46 $0.08 $0.10 1.11 $0.06ft.D 256 10 8 58.54 $0.11 357.7 $2.54 $3.20 38.52 $0.30ft.D 512 19 16 36.46 $0.09 194.89 $2.77 $3.49 19.98 $0.30is.D 256 10 8 6.67 $0.01 37.02 $0.26 $0.33 3.55 $0.03is.D 512 19 16 4.22 $0.01 21.07 $0.30 $0.38 3.54 $0.05lu.D 256 10 8 68.15 $0.12 135.26 $0.96 $1.21 75.07 $0.58lu.D 512 19 16 40.32 $0.14 170.15 $2.42 $3.05 48.18 $0.71lu.D 1024 37 32 23.46 $0.16 174.14 $4.95 $6.24 42.19 $1.21lu.D 2048 74 64 15.37 $0.20 153.57 $8.74 $11.01 19.70 $1.13mg.D 256 10 8 8.89 $0.02 22.16 $0.16 $0.20 8.93 $0.07mg.D 512 19 16 4.47 $0.02 12.6 $0.18 $0.23 4.29 $0.06mg.D 1024 37 32 2.81 $0.02 10.21 $0.29 $0.37 4.26 $0.12mg.D 2048 74 64 1.42 $0.02 7.1 $0.40 $0.51 2.13 $0.12sp.D 256 10 8 112.29 $0.20 283.02 $2.01 $2.54 117.81 $0.92sp.D 1024 37 32 39.82 $0.26 272.82 $7.76 $9.78 53.73 $1.55Total Cost $2.81 $54.72 $68.94 $13.19Estimated AWS spot cost (30% of on-demand cost) $16.42

$48.26Estimated AWS pre-leasing cost (70% of US-Gov cost)

Page 38: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 38

NTR1/SBU1/SBU2 Workload Performance and Cost Comparison InadditiontoNPB,aperformanceandcostcomparisonwasalsomadeforafewfull-sizedapplications,includingATHENA++,ECCO,Enzo,FVCore,NU-WRF,andOpenFOAM.Table9showstheresultsbetweenHECCandAWSHaswellsystems.

Asshowninthattable,theHECCfull-blowncostwaslowerthantheAWSOregoncompute-onlyon-demandcostforeachoftheapplicationstested.Evenifonewereabletogetthelow30%compute-onlyspotprice,theHECCfull-blowncostisstillcheaper.Usingthesum,thereisacostratioof1.9x($141.77AWScompute-onlycostversus$76.18HECCfull-blowncost).

FindinganAWSfull-blowncostonaper-jobbasisisnotpossible.AnestimateoftheextracostontopofthecomputeinstancescostisdonebasedonthereportedtotalcostfortheDecember2017AWSusage.Ofthe$1,944chargereportedforDecember2017,$136was

Table 9: All Applicafon Benchmark Results for HECC and AWS.

Benchmark Case NCPUS

# of HECC Haswell

nodes

# of AWS c4.8xlarge

(Hasewell) instances

HECC time (sec)

HECC full cost

AWS time (sec)

AWS Oregon compute cost

AWS Gov compute cost

ATHENA++ SBU2 1024 43 57 2268 $14.48 2298 $57.89 $69.68ATHENA++ SBU2 2048 86 114 1177 $15.03 1374 $69.22 $83.32ECCO NTR1 120 5 7 120 $0.09 173 $0.54 $0.64ECCO NTR1 240 10 14 65 $0.10 140 $0.87 $1.04ENZO SBU2 196 9 11 1827 $2.44 2266 $11.02 $13.26FVCore SBU1 1176 49 66 1061 $7.72 1104 $32.20 $38.76nuWRF SBU2 1700 71 95 529 $5.58 1302 $54.66 $65.80OpenFOAM Channel395 48 2 3 4759 $1.41 7646 $10.14 $12.20OpenFOAM Channel395 144 6 8 12547 $11.17 20771 $73.44 $88.39OpenFOAM Channel395 288 12 16 10194 $18.16 23013 $162.73 $195.87Total Cost $76.18 $472.71 $568.96

$141.77$398.27

Estimated AWS spot cost (30% of on-demand cost)Estimated AWS pre-leasing cost (70% of US-gov cost)

Table 8: Scaling Runs of NPB Class D on Skylake.

Benchmark NCPUS

# of HECC Skylake

nodes

# of AWS c5.18xlarge

(Skylake) instances

HECC time (sec)

HECC full cost

AWS time (sec)

AWS Oregon

compute cost

bt.D 256 7 8 79.83 $0.16 146.73 $1.00bt.D 1024 26 29 20.52 $0.15 92.08 $2.27cg.D 256 7 8 34.3 $0.07 614.2 $4.18cg.D 512 13 15 17.24 $0.06 650.58 $8.29cg.D 1024 26 29 13.51 $0.10 719.22 $17.73ep.D 256 7 8 4.82 $0.01 4.93 $0.03ep.D 512 13 15 2.44 $0.01 2.46 $0.03ep.D 1024 26 29 1.24 $0.01 1.19 $0.03ft.D 256 7 8 27.41 $0.05 526.08 $3.58ft.D 512 13 15 14.84 $0.05 349.26 $4.45ft.D 1024 26 29 8.05 $0.06 243.06 $5.99is.D 256 7 8 2.64 $0.01 71.5 $0.49is.D 512 13 15 1.45 $0.01 48.28 $0.62is.D 1024 26 29 0.84 $0.01 44.72 $1.10lu.D 256 7 8 57.58 $0.11 121.6 $0.83lu.D 512 13 15 32.37 $0.12 106.79 $1.36lu.D 1024 26 29 17.17 $0.13 99.3 $2.45mg.D 256 7 8 8.18 $0.02 39.99 $0.27mg.D 512 13 15 3.36 $0.01 15.56 $0.20mg.D 1024 26 29 1.82 $0.01 16.44 $0.41sp.D 256 7 8 101.52 $0.20 211.48 $1.44sp.D 1024 26 29 20.32 $0.15 172.02 $4.24Total Cost $1.50 $60.98Estimated AWS spot cost (30% of on-demand cost) $18.29

Page 39: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 39

spentondiskspaceusage,$738onhavingthefront-endavailableona$4.256perhourbasis,andlessthan$2ondatatransfer.Therest,$1,068,wasspentonusingvariouscomputeinstances.Therefore,thereis~82%(1944/1068=1.82)extracostontopofthecomputeinstancecost.Similarly,fortheJanuary2018AWSusage,ofthe$1,654charge,$137wasspentondiskspaceusage,$665onthefront-end,$12ondatatransfer,and$840oncomputeinstances.Thus,therewasa~97%(1654/840=1.97)extracostontopofthecomputeinstancescost.ThereisalsotheadditionalOSSC/EMCCoverhead,whichhasnotyetbeenreportedandthusnotincludedinthiscalculation.

ForcomparisonbetweenHECC(withHPEMPT)andPOD(withIntelMPI),onlyEnzoandWRFweretested.TheresultsaresummarizedinTable5ofSection3.Thecompute-onlycostsonPODwere5.3xhigherthanthefull-blownHECCcosts.NotethatthestoragecostincurredonPODwereminimalduringthisevaluationperiod.GiventhatthecostofPODloginnodeandfiletransferwilllikelybefreeandtherewillbevolumediscountwithstorage,itisexpectedthattheextracostontopofcomputeinaproductionenvironmentwillbesmallerwithPODthanwithAWS.

Inconclusion,runningtheMPIworkloadsonAWSorPODarenotcosteffectivecomparedtorunningthemontheHECCsystems.

Cost Comparison: Preleasing versus On-Premises for 1-Node Jobs Section4raisedthepossibilitythattherecouldbeacostadvantageinmovingsomeone-nodejobstothecloud.Tocomparetheannualcostofusingin-houseresourcesversuscommercialcloudofferingsforsomeofthe1-nodejobs,aresourceof144nodeswasassumed.

Cost of using HECC in-house resources ForexistingHECCresources,thefull-blowncostisbasedonanSBUrateof$0.16andaSBUfactor.Thefollowingtablesummarizedtheannualcostwith144nodesfor3processortypes:

ProcessorType SBUfactor Annualcost

Haswell 3.34 $674,114

Broadwell 4.04 $815,395

Skylake 6.36 $1,283,641

Cost for pre-leasing 144 AWS compute instances, 1 front-end and 100-TB storage Thec4.8xlargeinstances(with18coresand60GiBofmemoryperinstance)areusedinthecostestimate.Othertypesofinstances,suchasc5.18xlargeorm4.16xlarge,willcostmore.Thecostofdatatransferisnotcounted.

The compute-only cost varies among the three regions: • CostforleasinginstancesfromtheGovCloud(US)region:

1instance:

NoUpfront:$883.30✕12months=$10,599.60

PartialUpfront:$5,055+$421.21✕12months=$10,109.52

AllUpfront:$9,907

With144instancesfor1yearAllUpfront,thecostis$1,426,608.

Page 40: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 40

• CostforleasinginstancesfromtheUSWest(N.California)region:

1instance:

NoUpfront:$993.53✕12months=$11,922.36

PartialUpfront:$5,690+$474.14✕12months=$11,379.68

AllUpfront:$11,152

With144instancesfor1yearAllUpfront,thecostis$1,605,888.

• CostforleasinginstancesfromtheUSWest(Oregon)region:

1instance:

NoUpfront:$735.84✕12months=$8,830.08

PartialUpfront:$4,231+$352.59✕12months=$8,462.08

AllUpfront:$8,293

With144instancesfor1yearAllUpfront,thecostis$1,194,192.

Cost for 1 front-end running Inthetestenvironment,anr4.16xlargeon-demandinstance([email protected],with32coresand488GiBofmemoryand25-Gbpsinterconnect)isusedasafront-end.Suchaninstancewillcost$4.256perhourintheUSWest(Oregon)region.

Tohaveafront-endrunningfor1yearwillcost$4.256✕365✕24=$37,282.56

Cost for storage space Chargeforstorageisbasedontheamountallocated,nottheamountactuallyused.

• AWSEBSgp2volumesareusedinthetestenvironment.Acapacityof100TBusinggp2volumesforaproductionenvironmentwillcost:

Samplechargeforprovisioning100,000GB(100TB)ofEBSgp2volumes

for1month$0.10✕100,000GB=$10,000/month.

for1year$0.10✕100,000GB✕12months=$120,000/year

• IfEFSisusedinsteadofEBS,thecostis:

Samplechargeforstoring100TBofdataonEFS:

$0.30/GB-month✕100,000GB=$30,000/month

$30,000/month✕12months=$360,000/year

Total Cost Theminimumannualcostofthecompute,front-endandstoragecombinedwillbe$1,194,192+$37,282+$120,000=$1,351,474.

Cost for pre-leasing 144 POD compute instances, 1 login-node and 100 TB storage Iftheuseofafreeloginnodeisadequateforthe1-nodejobworkload,thecostwillincludejustthe144computenodesandthestorage.ThePODHaswellnodes(20coresand128GiBofmemorypernode)willbeusedinthecostestimate.Possiblediscounts(government,volume,longtermcontractordedicatedresources)forbothcomputeandstorageresourcesarenotreflectedintheestimate.

• Costforpre-leasing144Haswellnodesfor1year

Page 41: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 41

• 1node:20cores/node✕$0.09core-hour✕24✕365=$15,768

144nodes:$15,768✕144=$2,270,592

• Costforallocating100TBstoragefor1year

$0.10GB-month✕100,000GB✕12=$120,000

Thetotalcostofthecomputeandstoragecombinedwillbe$2,270,592+$120,000=$2,390,592.

Page 42: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 42

Appendix IV – Usability Considerations Thissectiondocumentsthestudy’sexperiences,besidesperformanceandcost,inusingAWSandPOD.

Account Creation AWS–AccountsforHECCstaffwhoparticipatedinthetestingwerecreatedbyanHECCstaffmemberwhohasbeenactingasasupportstaffforCSSO/EMCC’sAWScloudusage.TheSSHpublickeyofeachuserwascopiedfromaPleiadesfront-endnode(PFE)toAWStofacilitatedirectloginfromaPFEtoanAWSfront-endinstance(nasfe01).Intheeventthataccountsformanyusersaretobecreated,theadministratorssupportingNASAAWSshouldbeabletohandleitwithoutuserinvolvement.

POD–ThroughthePODwebportal,onecanrequestanindividualaccountandberesponsibleforthecost.Onecanalsogetanaccountthroughaninvitationbyanexistinguser.Inthatcase,theusagebytheinviteewillbechargedtotheinviter.Afteranaccountiscreated(needanemailaddressandchoosea14to64characterslongpassword),onehastouploadanSSHpublickeyforuseonMT1and/orMT2andchoosethestorageandloginnodepreferencesthroughthewebportal.Intheeventthattherearemanyusersfromasite,PODstaffcantakecareofsendingoutanemailinvitationtotheindividualusersifanExcelspreadsheetwithusers’emailaddressesisprovided.PODstaffcanprovideafreetrainingtousersonhowtousetheirwebportaltocreatepasswordanduploadtheirSSHpublickeys.

Front-end AWS–Thenasfe01front-endsetupforthistestingincursa$4.256/hr.charge.Toreducecost,thereisregularscheduleddowntimefornasfe01.Restartingitcanbedonethroughawebcontrolinterface:https://gpmcepubweb.aws.nasa.gov/usingausernameandpassword.

POD–Eachmanagementaccountcancreateitsownloginnode.OnMT1,abasicandfreeloginnodehas1coreand256MiBofmemory.OnMT2,thefreeloginnodehas1coreand2GiBofmemory.Otherlargerloginnodeswillcostmoneytopreventabuse.Intheeventthatmanyusersunderamanagementaccountneedalargeloginnode,PODmaybeabletoofferitforfree[22].

ConnectiontoeitheranAWSfront-endoraPODloginnodefromauser’sdesktoporaPleiadesfront-endisthroughauthenticationwithuser’sSSHpublickey.Thestudy’sbenchmarkershadanissuewithSSHfromHECCPFEnodestoourMT2free-loginnodewherethesshwouldhang,butnoissuetotheMT1free-loginnode.Itwasresolvedwithinafewdaysby“forcingtheMTUto1500onPOD’suplinkstomakesurethatanythingleavingPOD’snetworkhasamaxMTUof1500”.Similarly,PODstaffresolvedanotherissuewithscpbetweenourPFEtotheMT2loginnodebyadjustingsomenetworksettings.

Filesystem AWS–AWSoffersmanystoragechoices–EBS,EFS,S3,Glacier,etc.forvariousneeds./homeand/nobackupusingEBSvolumesaresetupbyanHECCstaffforthisstudy.StoragecostisbasedonthesizeoffilesystemallocatedforHECC,nottheamountactuallyused.Inproduction,hourlyusageforeachuserwillberecordedandthatcanbeusedtoallocatetheoverallstoragecostsifnecessary.

POD–PODoffersNASstorageonMT1andLustreonMT2.Storageischarged$0.10perGBpermonth.Volumediscountisavailable[22].PODcanalsoprovidecustomizedlargefilesystemsifneeded.

Page 43: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 43

Data Transfer AWS–SSHfromPFEtonasfe01worksfineandthereisnoissuewithdatatransfer.DatatransferfromPFEtoAWSisfreeandiscurrentlyusingCSSO/EMCC’s1-Gbpsdirectconnect.OutboundtransferfromAWSisnotfree.CostforindividualoutboundtransferisnottrackedbyCSSO/EMCC.However,costofoutboundtransfergenerallyissmallenoughandCSSO/EMCCincludesitintheoverheadcost.Intheeventthatthedatatransfercostisnotsmall,itwillneedtobewrittenoffasHECCoverheadorevenlysplitacrossallusers.

POD–WithproperSSHpublickey,onecanscpfromPFEtotheMT1andMT2loginnodes.Thereisnocostfordatatransfer.

Operating System AWS–Inthetestingenvironment,AmazonLinuxwaschosenfortesting.OtheroperatingsystemssuchasCentOS,Ubuntu,Rhel,SLES,WindowsarealsosupportedbyAWSbuttheywillcostmore.

POD–MT1andMT2arefixedwithCentOS6andCentOS7,respectively.

Software Stack AWS–Thetestenvironmentdoesnothaveanysoftwarepackagespre-installed.Anyrequiredsoftwarepackagesfortesting,suchasIntelcompiler,netcdf,hdf4/5,IntelMPI,OpenMPI,MetisandParmetis,wereinstalledbyanHECCstaff.

POD–Morethan250commercialand3rdpartysoftwarepackageshavebeenpre-installedontheMT1andMT2clusters.Therearesmalldifferencesinwhat’savailableonMT1andMT2.PODtechnicalsupportteamwillhelpwithinstallingadditionalsoftwarepackagesuponrequest.

Forcommercialsoftwarepackages(includingIntelcompilersandMPI),eitherHECCortheindividualuserswillhavetoprovidelicensesforuseoneitherAWSorPOD.

Batch Job Management AWS

• Theuseofthespot_managerscript,createdbyanHECCstaff,to(i)checkpricingandwhatresourcesareavailable,(ii)request/useinstances,(iii)terminateinstances,and(iv)checkcostforarun,isatemporarysolutionforstafftestingontheAWScloud.RegularHECCuserswillhavedifficultiesusingtheAWScloudlackingaPBS-likejobscheduler.Thereisnojobidassociatedwitha‘job’.

POD • TheuseofthePBS-likeMoabschedulerandTorqueresourcemanagerprovidesa

familiarenvironmenttosubmit,monitorandterminatebatchjobswiththeqsub,qstat,andqdelcommands.Eachjobhasajobid.POD-providedutilitypodstatusshowswhatresourcesarenotbeingusedandpodstartprovidesanestimateofwhenasubmittedjobwillstartrunning.Belowisasummaryofavailablequeues.TheB30,S30andIntelKNLqueuesareforMT2andtherestareforMT1.

Queue Compute Nodes Cores/Node RAM/Node ------------------------------------------------------- FREE Free 5 minute, 24 core jobs 12 48GB M40 2.9GHz Intel Westmere 12 48GB H30 2.6GHz Intel Sandy Bridge 16 64GB T30 2.6GHz Intel Haswell 20 128GB H30G H30 with two NVIDIA K40 GPUs 2 64GB S30 2.4GHz Intel Skylake 40 384GB B30 2.4GHz Intel Broadwell 28 256GB

Page 44: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 44

IntelKNL 1.3GHz Intel Xeon Phi 256 112GB

• AlthoughPODadvertisedaper-corechargeforallprocessortypesexceptforH30G,ourattempttorequestfewercoresthanmaximumpernodefailedformulti-nodejobs.Forexample,fortheEnzo196testcase,wetriedtorequest14coresoutofthe28-coreBroadwellnodesandgotanerror.ThishasbeenconfirmedwithPODsupportastheintendedbehavior.

#PBS –l nodes=14:ppn=14 qsub error: ERROR: * ERROR: Multinode jobs in B30 must be ppn=28 (full node)

• For1-nodejobs,weverifiedthatrequestingfewercoresthanthemaximuminanodeisallowedandchargingisbasedonthenumberofcoresrequested.

Ease of Getting the Resources AWS–ItisdifficulttogetthenewSkylakeinstanceswithoutpayingtheon-demandprice.

POD–Computeresourcesinmostqueuesarefrequently100%taken.TheHaswellnodesseemtobemorelikelytobeavailable.AccesstotheSkylakenodeswasgivenonMarch5,oneweekbeforethepublicreleasedataofMarch12,2018.

Porting Experience PortingMPIapplicationsfromHECCtoeitherAWSorPODmayexperiencedifficultiesinthreedifferentways:(i)compiling/linkingissueswithlotsofdependenciesofpackages,(ii)failingtorun,and(iii)poorperformanceforunknowncauses.Debuggingeachoftheseissueswillbenon-trivialandputsasignificantburdenontheHECCsupportstaffortheusers.Belowaresomeexamples.

AWS • PortingGEOS-5(oneoftheSBU-2)toAWSwasdifficultduetolotsofdependencies.

TheHPEMPTusedatHECCisnotavailableatAWScloud.Instead,onehastorebuildwithIntel-MPI.ThiswillpreventaseamlessmigrationofMPIapplicationstothecloud.

• EvenforMPIapplicationsthatovercometheportingissueswithIntel-MPIonAWScloud,therewerecaseswheresomeapplications(suchasGEOS-5)wouldfailwithIntelMPIerror.Someapplications(suchasATHENA++)wouldrunwithsomecore-countcaseswhilefailedwithothercore-countcases.Forexample,usingtheAWSc4.8xlargeinstances,the1024-corecaseofATHENA++consistentlyrunsfinewhilethe512-coreand2048-corecaseswouldsometimesfailwithanIntel-MPIerrorinsocksm.c.Ontheotherhand,usingAWSc5.18xlargeinstances,ATHENA++runswith512-corebutfailswith1024and2048-corecounts.

• Forsomeapplicationsthatranproperly,theirperformanceswerenotgreat.Oneexampleiscg.Dandft.DonAWSSkylake.AnotherexampleisWRFonAWSHaswellandBroadwell(notattemptedonSkylake),whichwouldtakeafewdaystocompleteinsteadof1–2hoursinourearlierruns.Afterexperimentingwithmpiexecoptions,wewereabletoreducetheWRFruntimeonAWSSkylakesignificantly,butitstilltookmorethan2xlongerthantherunonHECC’sSkylake.

POD • ThefactthatPOD’ssoftwarestackalreadyincludesmanyfamiliarcommercialand

third-partysoftwarepackageshelpstoreducetheamountofworkneededduringaportingprocess.Forexample,PODhasprovidedready-to-useWRFandOpenFOAMmoduleswithOpenMPI.Someissuesweencounteredare:

Page 45: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 45

• POD-providedOpenFOAMmodulewithOpenMPIdoesnotprovidegoodperformanceforourtestcase.Performanceanalysiswithtoolswillbeneededtounderstandthisbehavior.

• FailureinbuildingWRFwithIntel-MPIourselvesonPODMT2systembutnotonMT1.WiththehelpfromPODtechnicalstaff,thisissuewasresolvedbychangingsomehard-codedcppcommands.

State-of-the-Art Architecture AWS–ThenewSkylakeprocessorswerereleasedinearlyNov,2017.NvidiaTeslaGPGPUprocessorsK80,M60andV100areavailable.

POD–PODlagsbehindAWSinprovidingstate-of-the-artarchitectures.TheSkylakeprocessorswerereleasedforpubliconMarch12,2018.IntelKNLandNvidiaTeslaK40areavailable,butPODcurrentlyhasnotimeframeonofferingnewerNvidiaGPGPUprocessors.

Cloud Bursting Readiness TheAWSandPODtestingsystemsusedforthisstudydonothaveamechanisminplaceyettoallowaseamlessmigrationofjobsfromadatacentertotheirclouds.Thereareafewvendorswhohavesolutionsforcloudbursting.Forexample,AltairPBShascloud-burstingbetatestingwithMicrosoftAzure.AdaptiveComputingadvertisesavailabilityoftheirMoab/NODUScloudburstingsolutiononAWS,Googlecloud,AliCloud,DigitalOceanandothers[23].Whenthistechnologymatures,therewillbemanyissuestobeworkedoutonHECCandthecloudend.Inthemeantime,ajobcanbemigratedmanuallyifitissetupproperlytorununderboththeHECCandthecloudenvironment.Forexample,theNASANEXgrouphasstartedusingDockertobuild“containers”fortheirjobsthatcanberunoneitherHECCorAWS.However,buildingsuchcontainersonasystemrequiresrootprivileges.GrantingrootaccesstogeneralusersonHECCsystemsforbuildingthecontainersisimpossibleduetosecurityconcerns.

Technical Support AWS–CSSO/EMCChasasupportcontractwithAWS.Wedonotknowthecostofthiscontract.

POD–Technicalsupportisavailable6AMto5PMPDT.Requestforhelpisbyemailtopod@penguincomputing.com.Thereisalsoanonlinesupportportal(https://penguincomputing.my.salesforce.com/secur/login_portal.jsp?orgId=00D300000000y1I&portalId=06050000000D4C9)whereonecanopenandtracksupportcases.

Usage/Cost Tracking System AWS–Withthetestenvironment,afterstoppingajob,thespot_managerscriptreturnsthecompute-costassociatedwiththerun.AnHECCstaffalsohasaccesstohistoricaldatawherehecanlookupthecostbasedonthedate/timeandinstancetype.Thereisalsoamonthlycostreportwherethetotalchargeisseparatedintocomputeinstances,front-end,storage,etc.NotethatthemonthlychargedoesnotincludetheCSSO/EMCCoverheadandAWSsupportcontract.

POD–Itswebportalprovides“MyAccountUsage”foreachuser.Therearedownloadableexcelspreadsheetstoshowusagebyjob,usagebyday,usersummary,groupsummary,andprojectsummary.Forusersunderamanagedaccount,bydefault,theper-jobcostisonlyavailabletothepersonwhoownsthemanagedaccount.

Page 46: Evaluating the Suitability of Commercial Clouds for NASA’s ...€¦ · HECC workload on the commercial cloud, there may be cases where cloud resources would prove to be a cost-effective

NAS 2018-01—HECC Cloud Trade Study May 2018 46

Authorization to Operate (ATO) AWS–ThistestingwasdoneonAWSpubliccloudundertheATOofCSSO/EMCC.EMCCalsohasresourcesonAWSGovCloudwhichinprincipalallowsprocessingITAR/EAR99data.However,CSSO/EMCCprojectmanagerdoesnotyetallowanyonetorunITAR/EAR99onAWSexceptfortightlycontrolledsituations.IfHECCweretouseAWSoutsideofCSSO/EMCC,aseparateATObetweenHECCSecurityLeadandNASAHQwouldbeneeded.

POD–Thistestingwasdonewitha“proofofconcept”arrangementwithoutcost.NoATOhasbeenfiledbyCSSO/EMCCorHECCforusingPODforproduction.PODisSSAE(StatementonStandardsforAttestationEngagements)SOC1andSOC2compliant.

Sales Channel AWS–GovernmentagencieshavemanyoptionsforbuyingAWScloudservices,eitherdirectfromAWSorthrougharangeofcontractvehicles[24].NASA’sSolutionsforEnterprise-WideProcurement(SWEP)workswithmultipleAWSresellerstoprovideservicestofederalagencies.FortheCSSO/EMCCAWS,theresellerusedisFourPointsTechnology.

POD–SalesarehandledbyPOD’sdirectorofSalesandFederalAccountingRep.