CERN Azure evaluation v2.5 · 3 Date Milestone Feb. 02 Round table meeting Mar. 17 Technical...

Post on 16-Mar-2020

0 views 0 download

Transcript of CERN Azure evaluation v2.5 · 3 Date Milestone Feb. 02 Round table meeting Mar. 17 Technical...

1

CERN-ITevaluationofMicrosoftAzurecloudIaaS

C. Cordeiro(1), A. Di Girolamo(1),L. Field(1), D. Giordano(1), H. Riahi(1), J. Schovancova(1), A.Valassi(1), L.Villazon(1)

Reviewed and approved by: S. Coriani(2), N. Erfurth(2), P. Koen(2), X. Pillon(2), H. Scherer(2), R.Stütz(2)

(1) CERN-IT, (2) MicrosoftAzure

CERN,14-02-2016

1 IntroductionThis document reports the experience acquired from using the Microsoft Azure cloudInfrastructure as a Service (IaaS) within the distributed computing environments of the LHCexperiments. The activity has been conducted in the framework of a joint collaborationbetween Microsoft and CERN-IT initiated by the round table meeting of 2nd of February2015[1].

1.1 GoalsAsagreedintheaforementionedroundtablemeeting,thecollaborationforeseesfourphases:

1. informaltechnicalevaluationofAzure,2. deploymentatprogressivelylargerscales,3. initialassessmentofcostsandTCO,and4. investigationofprocurementmodels.

In this document we summarize the experience acquired in the execution of Phase 1. Theobjective of Phase 1 is “[…] to achieve an informal technical evaluation of Azure and AzureBatchtounderstandthetechnology, theAPIs,ease-of-use,anyobviousbarrier to integrationwith existingWLCG tools and processes. […] the resulting feedback will give both parties achancetoassessthebasictechnicalfeasibilityformoreadvancedactivities”[1].

1Contact:domenico.giordano@cern.ch,cristovao.cordeiro@cern.ch

2

2 WorkorganizationThisactivityhasbeenconductedbymembersoftheCERN-ITSupportforDistributedComputinggroup involved in similar projects of integration, deployment and operation of commercialcloudresourceswithinthecomputingworkloadsystemsoftheLHCexperiments[2].GuidanceandsupporthasbeenprovidedbyMicrosoftAzureSolutionArchitects(Mr.H.Schereretal.).

2.1 SupportmeetingsWeekly synchronizationmeetings (one hour long Skype calls) were held throughout the fulldurationoftheactivitytoreportprogress,discussissuesandexchangefeedback.Participants:

• CERN:D.Giordano,C.Cordeiro,L.Villazon,and• Microsoft:H.Scherer,P.Koen,X.Pillons,N.Erfurth.

2.2 DedicatedmeetingsTwodedicatedtopicalmeetingswerealsoorganized.

2.2.1 MicrosoftAzureintroduction(Apr.10)A day-long introduction of Azure IaaS was given by Mr. H. Scherer on site at CERN. Azureoffered serviceswerediscussed, includingAzureBatch, aswell as theprocesses to configureAzure accounts, define storage accounts, import images andmanage user data and secrets.Descriptionofthedifferentdatacentresandlocationswasalsocovered.DuringthetutorialtheAzure web portal (https://manage.windowsazure.com) was largely used and illustrated. Thefuture availability of a new portal, Azure Preview Portal (https://portal.azure.com/)was alsoanticipated.

2.2.2 CernVMdeploymentsupport(Aug.21)Afullafternoonhands-onremotemeetingviaSkypecall,organizedtoevaluate integrationofCernVMimageinAzureIaaS.

3 TimelineandmilestonesTheactivityheldakick-offmeetinginMarch17th,duringwhichthelistoftestcaseswasdraftedand the initial technical aspects to acquire access to the Azure resources were covered. InparticularthecreationofaMicrosoftAzureSponsoredaccountwasproposedtoenablelarge-scale tests. AMicrosoft Outlook account (cern-mscloudtest@outlook.com) has been createdandconnectedtothesponsorship.Forthedevelopmentandsmall-scaletests,individualCERNMSDNsubscriptions($50permonthofcredit)wereconsideredsufficienttostartwith.

Table1reportsthemajormilestonesandachievementsduringthefullactivity.

3

Date Milestone

Feb.02 RoundtablemeetingMar.17 Technicalkick-offmeetingMar.19 ConfiguredCERNMSDNaccounts($50/month)forinitialtestactivityMar.23 SponsoredAccountready($10kuntilendofJune).

SignatureofLoanAgreementneededApr.10 FulldayintroductiontoAzureMay11 StartmigrationtoAzureResourceManagerJune16 StartusageofSponsoredAccount.

ExtendedSponsoredAccounttoendofSeptember,increasedto$40kAug.14 CernVMimageforAzureIaaSisavailableAug.21 CernVMdeploymentsupportmeetingSep.4 Startlarge-scaletestsSep.22 ExtendedSponsoredAccounttoendofNovemberSep.28 About2,700vCPUsprovisionedacross3datacentresNov.18 CompletedlargescaletestofD3VMseries

Nov.27-29 Large-scaletest:provisioned4,600vCPUs.AllVMssteadilyrunworkloadsofLHCexperiments.

Nov.30 SponsoredAccountterminationTable1:Listofmajormilestonesduringthefullactivity

4 Technicaldesign

4.1 ArchitectureFigure1showsthearchitectureoftheset-upusedtoprovisionresourceswithinAzure,tosteerLHC experiment workflows to the running VMs and to monitor the VM status and usageaccounting. VMs are provisioned using an agent able to interact with the Azure API. Twodifferentagents,communicatingwithtwodifferentAPIs,havebeenused (namelyVcycleandCERN-ARMwrapper, see later) inorder to respectively evaluate andadopt the twoavailableAzure provisioning models, Service Manager2 and Resource Manager3. Two different VMimages were also used: CernVM [3] and CentOS 6 [4]. In both cases the images included aminimalamountofpackagesandbasicservicessuchasCVMFS[5], theXRootD[6]clientandtheGanglia[7]monitoringdaemon.TheexperimentrelatedlibrariesandconfigurationdataareaccessedbytheapplicationsatruntimethroughtheHTTP-basedCVMFSread-onlyfilesystem.

2https://msdn.microsoft.com/en-us/library/jj838706.aspx3https://azure.microsoft.com/en-us/documentation/articles/resource-group-overview/

4

TheCERNEOS[8]datastoragesystemisused,whenneeded,toaccessinputfilesandtostoreoutputfiles,exploitingremotedataaccessacrosstheWANusingtheXRootDprotocol.

4.2 VMimageandsizeThe size and OS selected for the provisioned VMs is based on the technical specificationsalready adopted in other commercial cloud contexts, which proved to be satisfactory forrunningMonteCarlosimulationjobsofLHCevents.TheminimumsizeisasinglevCPUVMwith2GBofRAM,20GBoffreediskspaceandapublicIPv4address.ThislastrequirementonpublicIPv4addressesisbasedonthesimilarrequestmadeinrecentpriceenquiriesfortheacquisitionofcommercialcloudresources,andcorrespondstoananalogousconfigurationadoptedfortheCERNresources.Multi-vCPUVMshavealsobeentested.

Figure1.Architectureoftheset-upusedtoprovisionresourceswithinMicrosoftAzure,tosteerLHCexperimentworkflowstotherunningVMsandtomonitortheVMstatusandusageaccounting

5

4.2.1 CernVMimageCernVM is a virtual machine image based on Scientific Linux 6 combined with a custom,virtualization-friendly Linux kernel. It is based on the µCernVM boot-loader distributed as aread-only imageofabout20MBytescontainingaLinuxkernelandtheCernVM-FSclient.TherestoftheoperatingsystemisdownloadedandcachedondemandbyCernVM-FS.Thevirtualmachinestillrequiresaharddiskasapersistentcache,butthisharddiskisinitiallyemptyandcan be created instantaneously, instead of being pre-created and distributed. Since AugustCernVM image for theAzureplatformhasbeenmadeavailableby theCERNCernVM team4.The image, imported in the Sponsored Account, has been extensively used for the testsdescribedinSection6.

4.2.2 CentOSimageCentOSisanalternativeimageoptionwhichisalsobasedontheLinuxOSfamily.InAzurethisimageisnativelyprovidedbyOpenLogicindifferentversions,containingtheinstallationoftheBasicServerpackages.Amongthedifferentversionsofferedbytheprovider,CentOS6.xarethepreferredonesaccordingtothetechnicalspecificationsmentionedabove.

Sincethis imagedoesnot includeanyLHCexperimentrelatedenvironment,aninitialsetupisrequired.ThiscanbeachievedeitherthroughacontextualizationprocessatVMstartuporbysnapshotting the image after installation of needed packages. To avoid systematic heavyroutinesrunningoneachVMprovisioningandsinceAzureallowstheusageofcustomimagesthelatteroptionwasadopted.Theresultingsnapshotcontainstheminimalamountofsoftwareandconfigurations taken frompreviouscloudexperiences [9]allowingacorrectexecutionoftheexperimentsworkloads.

4.2.3 MicrosoftLinuxAgentThe Microsoft Azure Linux Agent (waagent)5 manages the interaction between the virtualmachines and theAzure Fabric Controller. It providesmany functions for Linux and FreeBSDIaaSdeployments in severalareas, including ImageProvisioning,Networking,DiagnosticsandVMExtensions.As the responsible component for the communicationbetween theplatformand the VMs, this agent needs to be addressedwhenworkingwith customOS images. ThismeansthatunliketheCentOSsnapshotwherethewaagentwasalreadysetupinthebaseOS,theCernVMimagehadtobemodifiedinordertoincludethisagent.

4http://cernvm.cern.ch/portal/downloads5https://github.com/Azure/WALinuxAgent

6

4.3 VMContextualizationAs referred above, the experiments’ related software and configuration data are setup atruntime, more specifically when the VM starts. This process is addressed as thecontextualizationstage.

Azurehasdifferentwaystoprovisionastart-upscripttotheVM,dependingmainlyonthetypeofOSinuse.IntheCentOSsnapshotusecase,thedefactocontextualizationpackageCloud-init6wasnotavailable,leavingthenuserstorelyonAzure’sdatainjectionmechanismslikeCustomData7andVMExtensions8.WithCernVMintheotherhandtheimagewaspreparedinsuchaway that Cloud-init is also enabled, allowing then for a more versatile generation of thecontextualizationuserdata.

Withanyofthespecifiedmechanisms,theprovideduserdataalwaysperformthesameactionschangingonlyitsinputformatandinternalinterpretation.

4.4 InfrastructuremodusoperandiTheuserinteractionwiththeAzureIaaSchangessignificantlywhencomparingwithothercloudIaaS, inthesensethat inAzurethere isa largerresourcemanagement flexibilitygiventotheusertoconfigurethedesiredresources.Ontheotherhandthis impliesthateveryunderlyingresource necessary for the successful instantiation of virtual machines must be properlyprovisioned.ForeveryVMusershave toadequatelycreateand linkcomponents likeStorageAccounts, Network Interface Cards, Dynamic IPs, Virtual Networks and Resource Groups orCloudServices,whicharebasicallyresourcecontainers.

4.5 AzureAPIAsothercloudprovidersAzurealsoofferswaystointeractwiththeinfrastructure.Thealreadycitedweb user interfaces are good options formonitoring andmanaging resources at smallscale, but prove to be suboptimal in case of large-scale deployments and automateddeployments. For this use case other options are available, including two REST APIs, aCommand Line Interface and a SDK for Python, the latter twobeingwrappers on topof theRESTAPIs.

ThetwoRESTAPIscorrespondtotwodifferentmanagementmechanisms:

• AzureServiceManagement(ASM)2,and• AzureResourceManagement(ARM)3.

6https://help.ubuntu.com/community/CloudInit7https://azure.microsoft.com/en-us/blog/custom-data-and-cloud-init-on-windows-azure/8https://msdn.microsoft.com/en-us/library/azure/dn832621.aspx

7

The ARM API is newer than the ASM API and cross-compatibility between them is not fullyensured.TheARMAPIisofficiallytheprimaryAPIwhiletheASMAPIiscalledtheclassicAPI.

4.5.1 AzureServiceManagementTheASMAPIisoftenreferredtoasthe“classic”wayofhandlingresourcesthroughadedicatedmanagement portal9. It is a REST API where all the operations are performed over SSL andmutually authenticated using X.509 v3 certificates. Based on the feedback received byMicrosoft Azure Solution Architects, this API should in time be phased out and all ASMprovisionedresourcesmigratedtotheARMmodel.

WithintheASMmodel,akeycomponentistheCloudServicethatrepresentsa“container”forthe application resources. A Cloud Service can host amaximum number of 50 VMs and themaximumnumberofCloudServicespersubscriptionis200(nonmodifiablelimits)10.TheCloudservicehasapublicdynamicIPaddressandactsasagatewayforalltheunderlyingVMswithinitsprivatenetwork,accessiblethrougha1:Nportmapping.ThisnetworkingmodeldoesnotfitwiththeCERNrequirementofhavingallVMswithdynamicpublicIPaddressesperVM,andtheworkaroundofhavingasingleVMperCloudServicewouldbringtoamaximumcapacityof200VMs.Most of those limitations are not present in the alternative provisioningmodel (ARM)thereforelarge-scaletestshavebeenperformedusingtheARMAPI.

4.5.2 AzureResourceManagementTheARMAPIisaJSONdrivenRESTAPIanditislinkedtoitsowndedicatedwebportal11aswell.ThebiggestchangeandadvantagewiththisAPIistheJSONtemplatedeploymentmodel.ThistemplateisaJSONfilethatcontainscloudresourcesdefinitionsandallowstheusertomakearequestformultipleservicesandresourcesinonesinglecall.

Thetemplateprovisioningmechanismisadeclarativemethodtoinstantiateresources,whereall the underlying deployment instructions are moved to the infrastructure side. TheprovisionedresourceswillfallundercontainersentitledasResourceGroups,whichstandforalogicalsetofcorrelatedcloudresources.

4.6 VMprovisioningsystemInthissectionthetwoprovisioningapplicationsusedtoautomaticallyinteractwiththeAzureAPIsaredescribed.

9https://manage.windowsazure.com10https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#subscription-limits11https://portal.azure.com/

8

4.6.1 VcycleVcycle[10]isanopen-sourceVMlifecyclemanagerthatimplementstheVACmodelonIaaS12cloudproviders,allowingforanautomatedcreation/destructionoftheVMs.Itisoneofmanytools used in the HEP community in order to deliver elastic capacity from cloud providers.Vcycle supervises the VMs and instantiates/shutdowns VMs depending on the state of theexperiments central task queue. This approach enables elastic capacity management of theavailablecloudresources.

Vcycle interacts with different clouds via specific Python plugins (connectors) exploiting thepreferredcloudAPI.Wehavedeveloped the connector forAzure.ThisVcycle connectorwasdeveloped using the Azure SDK13 for Python focusing on ASM API. It is available in a publicGitHubrepository14.

4.6.2 ARM-CERNwrapperAnotherprovisioningoptionistousewhatisalreadyprovidedbyAzure.WiththeARMAPIallthe provisioning steps are on the provider side, leaving the consumer with JSON templategeneration only. In other cloud contexts this approach is often defined as “orchestration”becausetheeffectiveprovisioninghandling,faulttoleranceandretryismovedontheproviderside.

In order to use the ARM approach, a customwrapper (ARM-CERNwrapper) was developedaroundtheAzureXPlat-CLI15withthegoaltoallowausertobuildJSONtemplatesandissuetheprovisioning calls automatically based only on configuration. This wrapper is exclusivelyprepared for the Azure infrastructure in order to prepare the experiments’ environment inAzure froma custom image. It takes careofmakinga copyof theOS image ineach createdStorageAccount,whenworkingwithcustomOSimages.

The number of VMsprovisioned per StorageAccount is configured to respect the suggestedmaximum of 40 VMs as commented in the Azure Service Limit documentation: “You canroughly calculate the number of highly utilized disks supported by a single storage accountbasedonthetransaction limit.Forexample,aBasicTierVM,themaximumnumberofhighlyutilizeddisksisabout66(20,000/300IOPSperdisk),andforaStandardTierVM,itisabout40(20,000/500 IOPS per disk). However, note that the Storage Account can support a largernumberofdisksiftheyarenotallhighlyutilizedatthesametime.”

12https://www.gridpp.ac.uk/vac/13https://azure.microsoft.com/en-us/documentation/articles/python-how-to-install/14https://github.com/vacproject/vcycle15https://github.com/Azure/azure-xplat-cli

9

4.7 VMMonitoringThestatusofVMshasbeenmonitoredintheprovisioningandoperationphasesbymeansofseveralmonitoringsystems,namelytheAzurewebportalsandtheCERNinstantiatedGangliamonitoring.

4.7.1 AzuremonitoringAs reportedabove, both availableAzureportals havebeenused for the interactionwith theIaaS, either tomanually provision few resources or tomonitor the status of the provisionedresources. The views offered by the monitoring dashboard are extremely detailed, withpossibilitytonavigatedowntothelevelofeachsingleresourcecomponent,whilstkeepingthefullconnectiontothedependencytree.Availabilityofalarmsandalertreportsarebeneficialforissuetrackinganddebugging.

Figure2.SnapshotofAzureportaltomanagetheusedsponsoredaccount.

4.7.2 GangliaGanglia is a scalable distributedmonitoring system for high-performance computing systemssuchasclustersandGrids. Ithasalreadybeendemonstratedtoworkforcloudscenariosandhow to scale the system is understood [11]. The VMs run a Ganglia gmond service tocommunicatedirectlywitha receivingGangliagmond collectorsittingonthehead-node.TheGangliagmetadservicepointstothesecollectorsasdatasources,fetchingallthemetricsdatafromthemthroughaTCPchannelandstoringtheminalocalRound-Robindatabase.ThisdataistheninterpretedandpresentedinawebinterfaceprovidedbyGanglia’sWebfrontend.

For this specific activity a dedicated Ganglia server has been instantiated16, monitoringseparately Azure VMs provisioned for each experiment (ATLAS, CMS, LHCb). Themonitoringmetrics per VMs cover CPU related quantities (load, percentage of cpu time spent in idle,system,nice,etc),memoryrelatedquantities(freememory,swap,cache,etc),networkrelated

16http://azucern-ganglia-monitor.cern.ch

10

quantities(inboundandoutboundtraffic)anddiskrelatedquantities(disksize,free,full,etc).Figure3showsasnapshotoftheGangliaWebUImonitoringVMsprovisionedinAzure.

5 ResultsThissectiondescribesthethreetestcasesperformed:scaletests,profilingresourcesandabilityto runexperiment’s jobs.ThreedifferentMicrosoftdatacentreshavebeen targeted,namelyCentralUS,NorthEuropeandWestEurope,withmaximumallowedcapacityof1500,1500and1000VMsrespectively.

Figure3SnapshotoftheCERNGangliamonitoringinstanceforAzureCloud.IntheplotstheAzureresourcesprovisionedfortheATLASVOduringascale-outtestareshown.

5.1 ScaletestsScale tests have been performed with the goal of testing the infrastructure performance inregimesof largenumberofVMsprovisioned.Thecapacitytargetedisthemaximumavailablecapacityofthesponsoredsubscriptionperdatacentre.Theadoptedapproachconsistsoffiringrequests foranumberofResourceGroups,eachcontaininga fractionof thetotalnumberoftargetedVMs.Thefractioningreflectsthesuggestedlimitof40VMsperStorageAccountand

11

thedefault limitof100StorageAccountspersubscription.TheCernVMimagehasbeenusedformost of the tests. The small size of the image (around 20MBytes) eases the process ofcopyingtheimageintoeachStorageAccountandduplicatingitasmanytimesasthenumberofrequestedVMs.

Figure4.Gangliaplotofthenumberofsingle-vCPUprovisioned(redcurve)fordifferentscale(left).Detailofthelargesttestisalsoshown(right).

Figure4showstwoGangliaplotsoftheramp-upandramp-downphasesindifferentscaletests.A maximum number of 4,600 vCPUs has been provisioned across the three available datacentres, namely 2,000 vCPUs in both North Europe and Central US and 600 vCPU in WestEurope.

5.2 ProfilingresourcesPerformancemeasurements andmonitoring are essential for the efficient use of computingresources as they allow selecting and validating the most effective resources for a givenprocessingworkflow.Inacommercialcloudenvironmentanexhaustiveresourceprofilinghasadditional benefits due to the intrinsic variability of a virtualized environment. Ultimately itprovidesadditional informationtocomparethepresumedperformanceof invoicedresourcesandtheactualdeliveredperformanceasperceivedbytheconsumer.

In this phase, all provisioned VMs were profiled using different benchmarkmetrics and theresults analysed. The adopted benchmarks span from generic open-source benchmarks(encoding algorithm and kernel compilers) to LHC experiment specific benchmarks (ATLASKitValidation [12]) and fast benchmarks based on random number generation. Profiling hasbeenperformedacrossthethreetargeteddatacentresanddifferentflavorsofVMs(BasicA1,StandardA1andA3,StandardD1andD317).

Figure 5 shows the comparison among VMs of type Standard_A1 provisioned in three datacentres.DatahavebeenaggregatedperdatacentreandperCPUtype.ThetwoseparatepeaksarerepresentativeoftwodifferentIntelseries.EvenifallVMshavebeenprovisionedunderthe17https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-size-specs/

12

classStandard_A1,itisevidentthatasetofVMs,mainlyprovisionedincentralUSdatacentre,hasaperformancethatisabout50%worstthantherestofprovisionedVMs.

Figure5.BenchmarkperformanceforVMsofStandardA1series,aggregatedperdatacentre(left)andperCPUtype(right).

Figure6showsthecomparisonbetweentwobenchmarksusedtoprofileresources,namelyafast benchmark based on random number generation and the KV benchmark that simulatescollisions insidetheATLASdetector.Thecorrelationbetweenthetwomeasurementsappearstobegood,evenconsideringthelargemeasurementrangeduetodifferentCPUarchitectures.

Figure6.Comparisonoftwobenchmarkmeasurements(FastbenchmarkandKVbenchmark)performediterativelyinVMsofclassStandard_A1provisionedintheNorthEuropedatacentre.Scatterplotofthetwovariablesisshown(bottomleft)aswellasthetwoprojectiononthex-axis(bottomright)andthey-axis(topleft).Aprofilehistogramwithalinearfit isalsoshown(topright).

13

Figure7.ComparisonofVMperformanceonthebasisoftheKVbenchmarkmeasuringthecpuprocessingtimeperevent.BenchmarkedVMsareclassifiedperdatacentre (NorthEurope,WestEuropeandCentralUS),VMseries (D3andA3)andCPUmodel(IntelXeonE5-2673v3,IntelXeonE5-2660v0,andAMD4141-HE).TheleftplotshowstheaverageperformanceofeachVMclass,ordered fromthe fastest to theslowestmodel (blue line), togetherwith the50% inter-quartileandthetotalvaluerangeasboxplot.Theamountofmeasurementscollectedforeachclassissuperimposed(greenhistogram,righty-axis).ThecostpereventforthesameVMclassesisalsoreported(rightplot).

Acost-to-benefitanalysisofvariousVMoffershasalsobeenperformed.ForthispurposeVMsof type Standard_D3 and Standard_A3 have been provisioned in the three available datacentres,eachVMconsistingof4vCPUs.ByprovisioningofseveralhundredsVMs,ithasbeenpossibletoprobedifferentCPUmodels,namelyIntelXeonE5-2673v3,IntelXeonE5-2660v0,andAMD4141-HE.Thiswasanobservationduringthetestphase.AccordingtoMicrosoftthiswasduetonewhardwaredeploymentforthenewDv2series.TheVMperformancehasbeenmeasuredrunningtheKVbenchmark.EachKVtestconsistedinrunningfourparallelinstancesof the benchmark application, in order to load the 4 vCPUs available in each VM. ThemeasurementshavebeenrepeatedmultipletimesineachVMinordertoaveragefluctuationsduetouncontrollableeffectssuchastheresourcesharingwithothercustomers inthepubliccloud.Figure7(left)showstheaverageperformanceofeachVMclass,orderedfromthefastesttotheslowestmodel(blueline),togetherwiththe50%inter-quartileandthetotalvaluerangeas box plot. The amount of measurements collected for each class is superimposed (green

14

histogram,righty-axis).TheplotshowsthattheperformanceofD-seriesVMsdependsontheCPUmodel,withtheIntelXeonCPUE5-2660v0@2.20GHzbeing50%slowerthantheotherprobedmodelIntelXeonCPUE5-2673v3@2.40GHz.A-seriesVMsfollowintheperformanceranking.Inaddition,theprobedVMshavebeencomparedonthecost/eventbasis,obtainedbymultiplyingthemeasuredCPUtime/eventtimestheVMcostperhour,asreportedintheAzurecatalogue17.ThecostpereventforeachVMclassisreportedinFigure7(right). ItshowsthatthecostpereventofStandard_D3VMsbasedonIntelXeonCPUE5-2660v0ismoreexpensivethanthecostpereventprocessedonStandard_A3VMs,acrossthethreedatacentres.

5.3 Experienceinexecutingexperimentworkloads

5.3.1 ATLASTheWorkloadManagementSystemoftheATLASexperiment,PanDA[13],utilizespilot-basedapproach to exploit the available resources. In this scale test pilots were provisioned byHTCondor [14]. TheAutoPilot Factory submits a pilotwrapper toHTCondormaster instance,which then submits a condor job with a pilot wrapper executable to an ATLAS distributedcomputingcentre,where it isexecutedonaWorkerNode(WN).ThepilotonaWNcontactsPanDAserverwitharequestforapayload,whichthenexecutes.ThejoboutputistransferredfromtheWNtoapermanentstorage,inthiscaseEOS[15]storageatCERN.

Each WN ran a HTCondor client that communicated with the HTCondor master instance.Network-wise, the VMs came with public IP address, however were situated in a privateaddressspace.ATLASleveragedHTCondorConnectionBroker(CCB)tosuccessfullyoperatetheavailable resources with minimal additional operational overhead. With HTCondor CCB, theWNs joined thepool ofCERN resources. The relevantPanDA resourcewas configured in theATLASGridInformationSystem[16].

In the last 3 days of November 2015 ATLAS contributed to the scale test of the Azure IaaS,submitting simulation jobs, characterized by low disk I/O and high CPU time overwall clocktime ratio. In total~2millionsofeventswereprocessed, running for theequivalentof206kwall-clock hours. At the peak ATLAS benefited from up to 4.6k cores simultaneously. Theresourceswerestable,withalowrateoffailureof~3.2%ofwall-clocktimemainlycausedby“lostheartbeat”errorsbetweenPanDAandtheAzureVMs,duringcontrolledVMterminationforramp-downofthecluster.

Figure 8 shows twomonitoring plots of the job activity. The hourly number of finished jobsshows the successful and failed jobs, in green and red respectively. The two red peaks

15

correspond to the external termination of the VMs. On the right the cumulative number ofprocessedeventsisreported.

Figure8.NumberofATLASfinishedjobsperhour(left),wheresuccessfuljobsarereportedingreen,failedjobsinred.Thecumulativenumberofprocessedeventsduringthescaletestisalsoreported(right).

5.3.2 CMSTheCMSWorkloadManagement system includesWMAgent [17] for central dataproductionandprocessingandCRAB[18]fordistributeddataanalysis.ItreliesontheglideinWMS[19],apilot-basedsubmissionsystembuiltuponHTCondor[14].ThemainelementsofglideinWMSarefactoriesforpilotsubmissiontodistributedsitesandaglideinWMSfrontendtorequestpilotsfollowing the need for resources in the underlyingHTCondor pool. TheHTCondor pool itselfconsistsofthejobqueuesandacentralmanager,whichmatchesqueuedjobsandresources.

Within the third party VM factory model, VMs are provisioned independently from the jobqueue.TheCMSpilotscriptakaGlideinisdownloadedandexecutedafterthecontextualizationof theVM,and then it retrievesandprocessesa job fromthe jobqueue.AGrid site inCMSprovidescomputingandstoragecapacitiestotheexperiment.SincethenewcloudsitehostedinAzure infrastructure isdisk-less, ithasbeenconfiguredaccordingly intheCMS informationsystemsothatthejobs’outputistransferredtoapersistentremotestorageintheGrid.

TheworkflowssupportedbyCRAB3areend-useranalysisandprivateMonteCarloproductions.BothhavebeenintegratedtoruninAzure.ThejobstatusismonitoredviaCMSDashboardTaskmonitoring[20]asanyotherCMSGridjob.Figure9showstheexecutionresultsof100privateMonteCarloproductionjobssubmittedwithCRAB3,whichwereallsuccessfullyexecuted.

16

Figure9.Executionresultsof100CMSMonteCarloproductionjobs

5.3.3 LHCb The Workload Management System of the LHCb experiment is based on the DIRAC [21]communityGridsolutionandonitsLHCb-specificextension,LHCbDIRAC[22].LHCbDIRACusesapilot-based approach to exploit the available resources. Each pilot contacts the LHCbDIRACserverwitharequestforapayloadtobeexecuted.ThepayloadqueuesforallworkflowtypesandtheexecutionstatusofeachpayloadcanbemonitoredusingtheLHCbDIRACwebportal.

Onresourceswherebatchsystemsareinstalled,batchjobsaresubmittedwithapilotwrapperexecutable. In the caseof theAzure IaaS, similarly towhat isdoneonother cloud resourcesavailabletoLHCb,adifferentapproachwasadopted,whereaprocessresponsibleforspawningDIRACpilotswasimmediately launchedoneachVMprovisioned,duringthecontextualizationstage.More precisely, on each of the four logical processors available onAzure IaaS VMs, abenchmarkingprocessandaDIRACpilotpullingsingle-processorpayloadswereexecutedoneafter the other. When a DIRAC pilot ended, because its payload completed execution orbecausenopayload jobwas found in thequeue,anotherbenchmarkingprocesswasstarted,andsooninanendlessloop.

LHCbusedtheAzureIaaStoexecutesimulationpayloads,characterizedbylowdiskI/Oandahigh ratio of CPU time over wall clock time. Every payload included the full chain of eventgeneration, detector simulation and event reconstruction, each of these steps involving adifferentapplicationexecutable.Theoutput fromeachapplicationstepwas transferred fromtheVMtoapermanentstorage,inthiscaseEOSstorageatCERN.

17

The largest scale test of the Azure IaaS by LHCb was performed during the last week ofNovember 2015. At the peak, LHCb executed 1.3k single-processor simulation payloadssimultaneouslyonasmanycores.ThisisillustratedbyFigure10,whichshowstherapidramp-upofthenumberofLHCbsimulationjobsasthenumberofprovisionedVMswasincreased.

Figure10.GangliamonitoringplotfortheLHCbscaleteston28-29November2015

6 GapAnalysisandRecommendationsInthissectionwesummarizeourfindingsontheusageofAzureresourcesandintegrationwiththe computing frameworks of the LHC experiments. The major showstoppers identified arereported, as well as the mitigation solutions and the forthcoming tools/improvementsintroducedbytheAzureSolutionArchitects.Table2summarizesthemajorissuesfacedduringtheevaluationactivity.

6.1 Procurement

6.1.1 SponsoredSubscriptionDelaysintheinitialusageofaSponsoredSubscriptionhavebeencausedbytheneedtocheckandagreeon rulesandconditionsbetweenMicrosoftandCERN for sponsorshipswithPublicSector/Research customers and international organizations. Someof those delays have beencaused by the need of introducing a Credit Card number in the subscription portal even forcontractsbasedonaninvoice.Inouropinion,aprerequisiteforthenextphaseistotacklethecontractualaspectsanddefineaMicrosoft–CERNagreementwell inadvancerespect to thetechnicalactivity.

18

Date Issue Class Time tosolve

Mar.23 Sponsored Subscription ready, but pendingapprovalfromMSCompliancedepartment

Procurement 1month

Apr.20 Loan Agreement CERN – Microsoft forSponsoredAccounttobereviewedandsigned

Procurement 1month

Jun.9 Configuration of Account for SponsoredSubscriptionrequiresmodificationfromCreditCardtoInvoicemethod

Procurement8days

Aug.3 Increase subscriptiondefault limit onnumberofcores

Configuration 10days

Aug.27Increase subscriptiondefault limit onnumberof dynamic public IPv4 addresses: default 60,needed1000perdatacentre

Configuration12days

Sep.21Increase subscription limit on number ofNetwork Interface Cards (NIC) per region persubscription: default 300, needed 1000 perdatacentre

Configuration2days

Sep.24 FailuresinVMprovisioningseensystematicallyindeploymentsbasedonCernVM.Failuresare~30% of requests. The problem has beensolvedproperlyresizingtheCernVMimage.

Provisioning 1month

Table2.Listofmajorissuesfaced,withtimeneededtoobtainaworkingsolution.

6.2 Configuration

6.2.1 SubscriptionlimitsAs a consequence of the Azure IaaS design described in Section 4.4, subscription limits areappliedtoessentiallyeachofthemultipleresourcesneededtobuildaVM(storage,network,IPaddresses,etc).Someofthoselimitscannotbemodified,suchasthenumberofCloudServicesintheASMmodel.ThishasbeenthemainreasonforthemigrationtotheARMmodel.IntheARMmodel,otherlimitshold,butmostofthemcanbeincreased.Anoverallnumberofstorageaccountsacceptedis100persubscription.Followingthesuggestedpracticeofhaving40VMsperStorageAccount, thisbringstoamaximumamountof4000VMsthatcanbeprovisioned

19

per subscription. Given that a Sponsored Subscription is in general bound to a singlesubscription,thisfixesalimittothecapacityachievableperSponsoredSubscription.

6.2.2 PublicIPaddressesAs reported in Section 4.2 a public IPv4 address per VM is required. This requirementcorresponds toasimilarconfigurationadopted for theCERNresources.Asimilar requesthasbeenmadeintherecentpriceenquiriesfortheacquisitionofcommercialcloudresources.

InthecontextoftheAzureevaluation,thelimitationofpublicIPv4addresseshasrepresentedadelayforthetestsatareasonablylargescale.Theinitiallimitof60IPv4dynamicaddressesperregionpersubscriptionhasbeenexceptionallyincreasedforalimitedamountoftimeineachregion to reach 1500 (both Central US and North Europe) and 1000 (West Europe) IPv4addresses.

As suggested by the Solution Architects, this limitationwill be solved in the futurewith theadoption of networking virtual gears to leverage VPN and relays/gateways. It has beenhighlightedthatlargepublicIPavailability(evenwithIPv6)isnotanoption.

6.3 CapacityManagementThe ARM model is the suggested approach for the acquisition of large capacity in Azure,throughitstemplatefunctionalitythatenablesafasterVMprovisioning.AshighlightedbytheAzure Solution Architects new features will be made available soon to allow parallelprovisioningavoidingthecreationofmanyresourcegroups.Thisnewcapability isnamedVMScalesetandisinpublicpreviewsinceNov-1stforcomputebasescenario.

6.4 MonitoringMonitoringofresourcesandactionsthroughtheAzurewebportalsareveryadvancedandinline with expectations. Aggregations of metrics per VMs, summarized per Resource Groupwould be beneficial too. The Azure monitoring used in conjunction with the client-sidemonitoringtools,suchasGanglia,provetobeeffectiveinmonitoringthefullVMlifecycleandinverifyingthatthedeliveredresourcesmatchwiththeaccountedresources.

6.5 AccountingTheaccountingreportofusedresourcesisavailablethroughtheAzureportal18.Thedailyandhourlyusagebreakdownpercomputing resources (CPU, storage,andnetwork) isavailable incsvformat.Thecostoftheusedresourceswasnotavailableinthesponsoredsubscriptionandtherefore,neitherintheCSVnorinthesummarybillinghistorysessionoftheportal(seeFigure11).Ingeneralpricesaresubjecttocontractandspecificpervolume.TheMonthlySponsorshipStatementisreceivedonthe18thofeverymonthandcoverstheactivityofthepreviousmonth.18https://account.windowsazure.com/Subscriptions/billinghistory

20

This will change once we move from a sponsored subscription to an Enterprise Agreementcontract.

As stated in the Azure Sponsorship Offer: “The special pricing will terminate and yoursubscription(s)undertheMicrosoftAzureSponsorshipofferwillbeconvertedautomaticallytothePay-As-You-Goofferupon theearlieroccurrenceof (1)whenyour total cumulativeusagereachestheUsageCap(specifiedabove)atstandardPay-As-You-Goratespriortoapplicationofanydiscountor(2)whenyoureachyourEndDate(specifiedabove).”

Giventhelarge-scaletestsCERNexecutes,inouropinionthislackofongoingfeedbackontheremaining credit can represent a financial liability for CERN in case that the full credit isprematurelyexhausted.

Figure11.Snapshotofthereportedbillinghistoryfortheusedresources.

6.6 AzureBatchAzureBatch19introducesaconceptwellknownintheworkloadmanagementsystemsforHEPcomputingthat isworkloadsplittingandscheduling.AzureBatchincludestheconceptof jobsandtasks,jobsplitting,schedulinganddatadrivenscaling-out.Figure12showsanexampleofparallelworkloadonAzureBatchasreportedinthedocumentation19.Evenifweconsiderthisapproach interesting to investigate, we consider its integration within the current workloadmanagementsystemoftheexperimentsanadditionalindirectionlayerthatwillintroducemorecomplexitythanbenefits.Eachexperimentalreadyhasadoptedasystemtoscheduleresourcesand splitworkloads in jobs. Their interplaywith a similar system,AzureBatch,wouldbenottrivial. The alternative of adopting Azure Batch to schedule resources and splitworkloads injobswithineachexperimentwillintroducemorebenefits.

19https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview/

21

Figure12.ScaleoutaparallelworkloadonBatch.

7 References1A.DiMeglio,CERNIT-Microsoftmeetingminutes(restrictedaccess)2CERN-ITSDCCloudresourceintegration

https://twiki.cern.ch/twiki/bin/view/ITSDC/WLCGResourceIntegration3CernVMhttp://cernvm.cern.ch/portal/Imagedownloadpagehttp://cernvm.cern.ch/portal/downloads4https://www.centos.org/5P.Buncicetal.,2010J.Phys.:Conf.Ser.2190420036F.FuranoandA.Hanushevsky,2010J.Phys.:Conf.Ser.2190720057M.Massieetal2012MonitoringwithGanglia(O’ReillyMedia)8A.J.Petersetal.,2014J.Phys.:Conf.Ser.3310520159D.Giordanoetal.,2015J.Phys.:Conf.Ser.66402201910A.McNabetal.,2014J.Phys.:Conf.Ser.51303206511C.Cordeiroetal.,2015J.Phys.:Conf.Ser.66402201312DeSalvoAandBrasolin,2010J.Phys.:Conf.Ser.21904203713T.Maenoetal.,2008J.Phys.:Conf.Ser.11906203614E.Fajardoetal.,2015J.Phys.:Conf.Ser.66402201415X.Espinaletal.,2014J.Phys.:Conf.Ser.51304201716A.Anisenkovetal.,2012J.Phys.:Conf.Ser.39603200617E.Fajardoetal.,2012,J.Phys.:Conf.Ser.39604201818M.Mascheronietal.,2015J.Phys.:Conf.Ser.66406203819J.Lettsetal.,2015J.Phys.:Conf.Ser.66406203120E.Karavakisetal.,2010J.Phys.:Conf.Ser.21907203821A.Tsaregorodtsevetal.,2014J.Phys.:Conf.Ser.51303209622F.Stagnietal.,2012J.Phys.:Conf.Ser.396032104