Performance Analysis and Tuning in China Mobile's ... · • neutron port creation failure: A...

October 2016

1

October 2016

2

Contents 1 Introduction..........................................................................................................................32 PerformanceStudy:What,Why,How?.................................................................................4

2.1 ChinaMobile1000NodeProductionCluster............................................................42.2 LowOverheadInstrumentation................................................................................42.3 VirtualMachineLaunchLatency...............................................................................42.4 OpenStacknovaArchitecture....................................................................................52.5 Experiments...............................................................................................................5

3 PerformanceTestInsights.....................................................................................................63.1 ExecutionTimesunderVaryingRequestPressure....................................................63.2 ExecutionTimesunderVaryingSchedulers...............................................................73.3 LaunchErrors–NumberandNature.........................................................................83.4 ConfigurationFixes....................................................................................................93.5 ServiceLevelProfiling..............................................................................................113.6 Analysis....................................................................................................................13

4 ConclusionsandFutureWork.............................................................................................14

Figures

Figure2-1 Architecture............................................................................................................5Figure3-1 Send5~2000concurrentrequeststo530computenodes..................................6Figure3-2 Varyingschedulersettings.....................................................................................7Figure3-3 Tracefailurerequests.............................................................................................8Figure3-4 Servicequalityimprovements,beforeandafter..................................................10Figure3-5 Controller,keystone,RabbitMQandMySQLutilizationasafunctionoftime.....12TABLESTable1:ErrorsandCategories……………………………………………………………………………………………….….9Table2:OpenStackcomponents’peakconsumptionofCPU,memory&networkresources…..11

October 2016

3

1 INTRODUCTION Developers,architectsandoperatorscanfacechallengeswhenanalyzingOpenStack*cloudsoftwareperformance.Thosewhodevelop,deployandoperatecloudinfrastructuresmustensuretheircloudsmeetservice-levelagreements(SLAs).Developerswanttoquicklylocateandresolvedesignandimplementationissuesthatlimitcloudscaling.Deployerswanttosetupservices,networks,andnodestomeetscaleneeds.Operatorsalsowanttoestablishtherealcapacityoftheircloudunderheavyworkloadconditions.OrganizationsconsideringdeployingaprivatecloudneeddatatomakeeducateddecisionsaboutwhetherOpenStacksolutionscanmeettheirneeds.

OpenStackblack-boxtestingoftenusestherally[1]project,butforfinegrainedanalysis,constantpollingisrequiredwhichwouldintroduceanartificialloadonthesystemandmuddytheanalysis.Analternativeistosimulatethetestsystem.Whileattractiveinneedinglesshardware,ithoweverrequiresaccuratemodellingofinter-componentinteractionsandthetime/resourcesspentineachcomponent.Forthiscasestudy,wewerefortunatetogainaccesstoChinaMobile’sproductionenvironmentbeforeitwentlive,whichallowedustotestonrealhardware.Wethenintroducedlimitedextraloggingtominimizeanyartificialsystemload.

Thispapershareslessonslearnedfromacareful,componentlevelanalysisofChinaMobile’s1000-nodeOpenStackcloud.Weuncoveredthreemajorissuesinthecourseofthisstudy,alladdressedthroughOpenStackconfigurationchanges.Theresult:amorestableandperformantChinaMobileOpenStackcloudandinsightsintoscalebottlenecksandareasforfuturework.

October 2016

4

2 Performance Study – What, Why and How?

2.1 China Mobile 1000-Node Production Cluster ChinaMobile’s1000nodeproductioncloudrunsOpenStackNewtonRC2(13.0.0.rc2).Thecloudincludes530computehosts,fivecontrollernodes,athree-nodekeystoneclusterforidentityandauthenticationservices,athree-nodeactive-activeGaleraClusterforMySQLdatabaseservices,athree-nodeRabbitMQclusterforinter-componentmessagingsupport,andtheremainingnodesset-upforstorageandnetworkservices.Note:becausethecloudrunsonmulti-coremachines,multi-threadedprocessescanconsumemorethan100%ofCPU.Serverconfiguration:2xIntel®Xeon®ProcessorE5-2680v3(30MCache,2.50GHz),128GBRAM(8x16GBHP®DDR4),3TBDisk.

2.2 Low Overhead Instrumentation Giventhe“slated-for-production”statusoftheclusterundertest,weintroducedadditionallogging,takingcaretominimizeoverhead.Thatenabledustofurtheranalyzethelogsasapostprocessingstep,withoutintroducinglatenciesduringexecution.

Python-basedOpenStackcodelendsitselftomonkey-patching[3],enablingtheeasyintroductionandeliminationofmonitoringinstrumentation.

EventhoughNetworkTimeProtocolwasusedtosynchronizethenodesinthedistributedcluster,itssizemadethesynchronizationprecisioninadequatetostudyfine-grainedtiming.Toresolvetheissue,wedevisedamethodtoadjusttimestampsusingcausalconstraintheuristics.

2.3 Virtual Machine Launch Latency Wefocusedonvirtualmachine(VM)launchlatencyasthemostimportantfeatureofanInfrastructure-as-a-Service(IaaS)platform.LaunchlatencyalsoinvolvesmostcoreOpenStackservicesandtheircomponents.Ourworkseekstoquantifyperformanceasafunctionofthenumberofconcurrentrequests.

October 2016

5

2.4 OpenStack Nova Architecture Tofacilitatefuturediscussions,webrieflydelveintotheOpenStacknovaarchitecturewhichhandlesVMprovisioning.Incomingrequestsfirstarriveatthenova-apiservice,whichinturnhandsoffthetasktothenova-conductorservice,alayerofsoftwarethatenvelopesthedatabase.Thenova-conductorinvokesthenova-scheduler,whichattemptstoidentifyahospitablehostfortheworkload.Onceitidentifiesahost,theschedulertransferscontroltothenova-computeservicerunningontheselectedhosttolaunchtheworkload.TheorangestatetransitionsinFigure2-1(below)representerrorstatesthatmightstemfromincorrectlyformulatedrequests,inputconstraintsthatcannotbesatisfiedand/oralackofresources: Figure 2-1: Architecture

2.5 Experiments Westudyperformanceundervariousloadconditions,inparticularhandlingconcurrentVMlaunchrequestsrangingfromfiveto2000.

Wefurtherexploreperformanceunderthesameloadconditionsbutusingdifferenttypesandnumbersofnova-schedulerservices,tohandlethevolumeofrequeststhroughhorizontalscaling.

UsingZabbix[5],anopensourcemonitoringtool,westudysystemresourceconsumptionacrossthevariouscloudcomponentswithrespecttotimeoncerequestsenterthesystem.

Mostvaluableofall,weanalyzedthelogsoverfourseparateruns,eacha20-minutetimewindowwith2000concurrentrequests.Wecollecteddataonhowmanyrequestscompletedsuccessfully,howmanyfailed,andinwhatmanner.

Inthenextsectionweshareourfindings.

October 2016

6

3 Performance Test Insights 3.1 Execution Times under Varying Request Pressure Figure3-1showsnovacomponent-levelcostswhenthecloudhandlesfiveto2000concurrentlaunchrequests.Comparingourresultstothoseofanearlierstudy[2]withsimulatedcomputehosts,wefoundmajordifferenceswithrespecttothecostincomputeservicesandtheschedulersaturationpoint.Realcomputenodes—i.e.,realhypervisors—takesignificantlylongertospawnVMsthanweassumed,drivingupthecompute-hosttimecomponent.Observingbestpracticesforproductionenvironments,themessagequeueanddatabaseservicesweredeployedinclusteredconfigurationstodelivertherequiredstabilityandperformance.Notethattheschedulerservicetimeincreaseswithload,butplateausaround800concurrentrequests.Asthenumberofconcurrentrequestsrisesbeyond800,saturatingthescheduler,thecostofmessage-handlingincreases.Whilethenova-apiservicetimeriseswithload,itneverconsumesmoretimethantheschedulercomponent.Thecloudsucceedsinhandling1.78requestspersecond.

Figure 3-1: Send 5 ~ 2000 concurrent requests to 530 compute nodes

October 2016

7

3.2 Execution Times under Varying Schedulers Next,weexploretheeffectoffiveschedulerinstances,versus10schedulerinstances,versusfiveinstancesofanewcachingscheduler.Thecachingschedulereliminatestheneedtovisitthedatabasetoretrieveresourceusageupdatespriortoscheduling.

ThegraphsinFigure3-2belowsupportourhypothesis:using10schedulerinstancesinsteadoffiveroughlyhalvesthetimetotacklelaunchrequests.Asonemightexpect,usingacachingschedulerradicallyshrinksthetimeforahosttoscheduleaworkload.Butintroducingmoreschedulerinstancesimpactsthenova-apicomponentbycausingmorere-triesstemmingfromusingstaleresourceviews.

Thegraph’sredregionsindicatefailedscheduleattemptsinnova-api,andthedarkgreyregionsindicatethetimespentinre-tries—puzzling,giventheavailabilityofphysicalresourcesonthecomputehosts.Withfivefilterschedulers,thefailureratereached25.7percentandtheretryratereached29.0percent.Further,thetimetospawnVMimagesfluctuatesmorewithalargernumberofschedulers—sometimesswingingfrom6.45secondsto354seconds(nearlysixminutes),whichbegsfurtherinvestigation.

Figure 3-2: Varying scheduler settings

October 2016

8

3.3 Launch Errors – Number and Nature Tobetterunderstandthelaunchprocessandthenatureoferrors,weanalyzeddetailedlogsobtainedfromfourrunsof2000requestseach.Ofthe8000totalrequests,7953weresuccessfullytracked,with4646successfullycompleting,yieldingasuccessrateofonly58.45percent.

First,toobtainthenecessarytimesynchronizationprecision,weadjustedlogtimestampsusingcausalityheuristics.Thenwedevelopedadetailedstatemachineusinglogeventssuchasenter-nova-api,exit-nova-api,andothers.Wenexttracedeacheventinthisstatespaceandclassifiedtheerrorsusingtheirmessagestrings.

Figure 3-3: Trace failure requests

Figure3-3highlightsafailedrequestina2000requesttestrun.Therequest-ID,“p1966”failedoncomputenode“YW-SV153”withtheerrorloggedas“lossofdatabaseconnection.”Thelogfilesrevealedthatitoccurredwhenthecomputenodetriedtoupdateitsresourceusagethroughnova-conductor.Furtheranalysisrevealedthatnova-conductorranintoadatabasedeadlockissuewhenitranthe“computenodeupdate”operation.Thelogsshowthataftertherequestisreceivedbythecomputenode,ittakesapproximately30secondsbeforethedatabaseoperationfails,characteristicofloadrelatedfailure.

October 2016

9

Thefollowingtableshowseacherroranditscategoryandsub-categoriesbasedontheerrormessagestring.Thestatemachine-basedanalysishelpedeliminateduplicateandirrelevanterrormessages.Table 1: Errors and Categories

Component Error Message Count

API This result object does not return rows. 2284

API Not authorized for image 845

API HTTP Internal Server Error (HTTP 500) 133

API Unknown 8

Compute Node neutron error creating port on network, retry 1222

Compute Node Unknown, retry 12

Compute Node Build of instance … Failure prepping block device 5

Compute Node Build of instance … Failed to allocate the network(s), not rescheduling 1

Compute Node Remote error: DB Error This connection is closed 1

3.3.1 TOP THREE ERRORS

• Databasedeadlocks:Deadlockscausedthemajorityofdatabasefailuresreportedbythenova-apiserviceandoccurredwhenrunningthe“quotareserve”operation.Afterreachingthedeadlockretrylimit,therequestfailedwiththeerrormessage“norowsreturned.”

• neutronportcreationfailure:AsubstantialportionoftheretriesstemmedfromtheOpenStackneutronservicefailingtoallocateaportforaVM.Thisissuecausedatleast1222compute-noderetriesandonecompute-nodefailure.

• keystoneauthenticationfailure:EverystageintheOpenStackservicepipelineattemptstoauthenticatetherequestwiththeOpenStackidentityservice,keystone.Underheavyload,authenticationoftentimesout.Ofthe8000totalrequests,thisaccountedfor47failurestoauthenticateduetotime-outsandultimatelycaused845imageaccesserrorsand133HTTPinternalservererrors.

October 2016

10

3.4 Configuration Fixes Furtherinvestigationrevealedthatthethreemajortypesoferrorsstemmedfrommisconfigurationsandimproperdeploymentdecisions—notfromsoftwarebugs:

• GaleraClusterConfiguration—MisconfigurationoftheGaleraClustercausesthedatabasedeadlocks.Theoriginalconfigurationusedcluster-wideoptimisticlocking,resultinginfailureandroll-backofthedistributedquotareservationsinthenova-apiservices,particularlyasthelaunchrequestpressureincreased.Usingasingle-nodewritestrategyinsteadofallowingwritestoalltheGaleranodesresolvedtheissue.

• UsingFernettokensinsteadofPKItokensinkeystone—Fernettokens,comparedtoPKItokens,performbetterbecauseoftheirsmallerpayloadandnon-persistentimplementation.MovingtoFernettokensresolvedtheauthenticationfailures.

• Setting“schedulerhostsubsetsize”[4]tomorethanthenumberofschedulersdecreasesthechancesofconflictandspreadstheworkloadamongcomputenodes.

• Setting“rpcresponsetimeout”toalongerintervalallowsthenova-conductortowaitlongerfortheRPCresponsefromthenova-schedulersinalargecluster.

• Increasing“rpcconnpoolsize”allowsthenova-schedulerservicetoconcurrentlyacceptmorerequests.

• Increasing“maxpoolsize”and“maxoverflow”inthenova-apidatabaseandinthedatabaseconfigurationgroups(giventhattheschedulerservicedirectlyconnectstothedatabase)allowsalargernumberofsimultaneousconnectionstodatabases.

Figure 3-4: Service quality improvements, before and after

Figure3-4illustratestheoverallimprovementsaftertroubleshootingandtuning.ResolvingthethreemajorrootcauseseliminatedallAPIfailures(inred)andunexpectedretries(ingrey).Morerandomlydistributingworkloadsimprovedstability,includingreducingtheprobabilityofportallocationtime-out.Timespentinnovascheduling—thankstoswitchingtoasingle-nodewritestrategyintheGaleraCluster—resultedinfewertime-outs,fewerrollbacks,andfewerretries.

• TheFailurerateinnova-apidecreasedfrom25.7percentto0percent.

• TheRetryratedecreasedfrom29.0percentto0.83percent

• TheOverallOutputincreasedfrom1.31requests/secondto4.35requests/second

October 2016

11

3.5 Service Level Profiling ForservicelevelprofilingweusedZabbix,anopensourcemonitoringtoolwithavarietyofpluginstomonitorsoftwareapplicationssuchasMySQLdatabaseandRabbitMQ,inadditiontobeingabletomonitorsystem-levelactivitiesofindividualprocesses.WesetupmonitoringagentsontheservershostingtheOpenStackcontrollers,keystone(identity),thedatabase,andthemessagequeue.Weexaminedusageundervariousloads,namelyidle,800,and2000concurrentrequestsinour1000nodecluster.Ourpurposeinsharingthisdataistoillustratewhatonemightanticipateintermsofresourceconsumptionina1000nodecluster,inordertoguidehardwaresizingefforts.

Table 2: OpenStack components’ peak consumption of CPU, Memory, Network resources

Item Idle

(min value) 800 requests (peak value)

2000 requests (peak value)

Nova CPU usage (nova-api) 58% 172% 219%

Nova CPU usage (nova-scheduler) 22% 95% 95%

Nova CPU usage (nova-conductor) 81% 270% 329%

Nova Memory usage (nova-api) 60GiB

Nova Memory usage (nova-scheduler) 1.5GiB

Nova Memory usage (nova-conductor) 16GiB

Nova Incoming network (controller node) 527Kbps 35Mbps 41Mbps

Nova Outgoing network (controller node) 301Kbps 3.1Mbps 5.2Mbps

Keystone CPU usage 0% 169% 202%

Keystone Memory usage 68GiB

Keystone Incoming network (controller node) 27Kbps 6.2Mbps 12Mbps

Keystone Outgoing network (controller node) 86Kbps 3.1Mbps 8.6Mbps

Database MySQL bytes received 250KBps 9.3MBps 6.9MBps

Database MySQL bytes sent 400KBps 33MBps 39MBps

Database MySQL queries rate 92qps 40Kqps 57Kqps

Database Incoming network 1.5Mbps 32Mbps 49Mbps

Database Outgoing network 2.5Mbps 180Mbps 222Mbps

Messaging RabbitMQ receive rate 19msgs/s 1.7Kmsgs/s 2.1Kmsgs/s

Messaging RabbitMQ deliver rate 32msgs/s 6.4Kmsgs/s 7.5Kmsgs/s

Messaging RabbitMQ file descriptors used 8.2K

Messaging RabbitMQ sockets used 8.1K

Messaging Incoming network 6.5Mbps 71Mbps 94Mbps

Messaging Outgoing network 11.9Mbps 125Mbps 179Mbps

October 2016

12

Figure 3-5: Controller, keystone, RabbitMQ and MySQL utilization as a function of time

Figure3-5showsthatthenova-apiandkeystoneservicesconsumeprocessingresourcesasrequestsarriveintothesystem.Asnova-schedulerresourcerequirementsclimb,theyalsoshiftfurthertotherightintime.Inaddition,nova-schedulerdirectlycommunicateswiththedatabasetorefreshitsresourcestatuscache,visibleasamatchedriseindatabaseprocessingneeds.Theresourceusageofnova-conductorrisesonlyduringthedeploymentphase(peakingat270percent),andthusitmatchestheriseofresourceusageonthenova-computenodes.Databaseresourceusageremainshighinthedeploymentphasebecausenova-conductorkeepsitbusyupdatingthecomputenoderecords.Thedeliver-rateandreceive-rateofRabbitMQstarttoriseafterthefirstrequestleavesnova-api,andtheypeakwhenrequestsgotothecomputenodes.Thecloudranonmulti-coremachines,allowingusagetoexceed100percent.Notethatbecausethenova-schedulerisinherentlysingle-threaded,itsCPUusagecannotexceed100percent.

October 2016

13

3.6 Analysis Thesystemprofilingillustratesvariedperformancefootprintsofthenovaservices:

• nova-apiimposespressureuponkeystoneandMySQLserviceswithnumerousSQLoperations,indicatingahighdegreeofdifficultyforperformanceoptimizationsduetocomplexinteractionswithotherservices.

• nova-schedulerconfiguredwiththefilterschedulerdriverrequiresoutbound-intensiveMySQLoperations,anditcannotleveragemultipleCPUcores.Webeganinvestigatingacachingschedulertoreducedatabaseburdenandspeedscheduling,andwehaveobtainedpromisingpreliminaryresults.

• RabbitMQpressure:Thedecisiondeliveryfromnova-conductortocomputenodesimposespressureuponRabbitMQservices,whichseemsinevitable.Weproposeinvestigatingalighter-weightmessagingprotocol.

• VMprovisioningbynova-computeservicesreliesonnova-conductorservicestoupdatethedatabase.Thenumberofnova-computetonova-conductorservicesshouldmatchtoavoidbottlenecks.

ZhenYuShenofChinaMobilegiveslocalmediareportersatourofoneofseveralmulti-thousandnode,OpenStackclusters.

October 2016

14

4 Conclusions and Future Work Ourlightweightinstrumentationalongwithtechniqueforobtainingmoreprecisetime-synchronizationenabledtrackingofeveryVMlaunchrequestinnova.Byanalyzingeachfailureandclassifyingit,weidentifiedthreemajorissuesandaddressedthemthroughconfigurationchanges—obtainingamorestableandperformantcloud.Weexperimentedwithdifferentclouddeploymentsandanalyzedtheirmerits.WeconfirmedthatthefilterschedulercreatesthemostsignificantbottleneckinalargeOpenStackcloud,andthatitstemsfromthecache-refreshdesignthatimposesheavypressureonthedatabase.Switchingtoacachingschedulerrelievesdatabasepressurebutalsocarriestheriskofoperatingwithstaleresourceviews.Moreefficientdatabaseindexingmechanismsshowpromiseasawaytoreducedatabasepressuretoprunethelistofhostsbeforereachingouttothedatabase.Alternatively,updatingtheschedulersbythecomputenodes—onanyresourcechange—boostsperformancebyeliminatingthecallstoupdatetheschedulerdatabasecachedresourceviewoneachschedulerequest.Forthosewillingtorelaxtheneedtofindtheoptimalhostforscheduling,usingcells,and/orpartitioningthehostsintosubsetscanspeedscheduling,especiallywithknownshort-runningworkloads.Otherareastoexploreincludealightermessagingcomponent.Oursystem-levelprofilingofOpenStackservicesshowedpatternsandexposedpotentialbottlenecks,andoffersareastoexploreforredesignandimprovement.References:[1] https://wiki.openstack.org/wiki/Rally[2] https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-

performance-summit.pdf[3] http://en.wikipedia.org/wiki/Monkey_patch[4] http://docs.openstack.org/mitaka/config-reference/compute/scheduler.html[5] http://www.zabbix.com/Acknowledgements:HaoLiofChinaMobilehelpedpreparetheexperiments.QingWangandJian-FengDingofIntelorganizedthecollaborationwithChinaMobile.MaggieLiangofInteldevelopedthecasestudywithDanFinebergeditingforpublication.

October 2016

15

5 Legal Notices Inteltechnologies’featuresandbenefitsdependonsystemconfigurationandmayrequireenabledhardware,softwareorserviceactivation.Performancevariesdependingonsystemconfiguration.Nocomputersystemcanbeabsolutelysecure.Checkwithyoursystemmanufacturerorretailerorlearnmoreat[intel.com].

SoftwareandworkloadsusedinperformancetestsmayhavebeenoptimizedforperformanceonlyonIntelmicroprocessors.Performancetests,suchasSYSmarkandMobileMark,aremeasuredusingspecificcomputersystems,components,software,operationsandfunctions.Anychangetoanyofthosefactorsmaycausetheresultstovary.Youshouldconsultotherinformationandperformanceteststoassistyouinfullyevaluatingyourcontemplatedpurchases,includingtheperformanceofthatproductwhencombinedwithotherproducts.

Testingmethodsandtools:

• Performancestudiedundervaryingloadconditions,handlingconcurrentVMrequestsrangingfromfiveto2000.

• Performancealsoexploredunderthesameloadconditionsbutusingdifferenttypesandnumbersofnova-schedulerservicestohandlethevolumeofrequeststhroughhorizontalscaling.

• SystemresourceconsumptionstudiedacrossthevariouscloudcomponentsusingZabbix[5],anopensourcemonitoringtool.

• Loganalysisovera20-minutetimewindowwith2000concurrentrequests.Datacollectedonhowrequestscompletedsuccessfully,failed,andinwhatmanner.

Configuration:

• The1000-nodeexperimentswerecarriedoutbytheauthorsatChinaMobile’sdatacenter.o The1000nodecloudrunsOpenStackNewtonRC2(13.0.0.rc2).o Itincludes530computehosts,fivecontrollernodes,athree-nodekeystoneclusterfor

identityandauthenticationservices,athree-nodeGaleraClusterforMySQLdatabaseservices,athree-nodeRabbitMQclusterforinter-componentmessagingsupport,andtheremainingnodesset-upforstorageandnetworkservices.

• Servers:o CPUs—2xIntel®Xeon®ProcessorE5-2680v3(30MCache,2.50GHz)o Memory—128GBRAM(8x16GBHP®DDR4)o Storage—3TBLocaldisk

FormoreinformationonperformanceofInteltechnologies,gotowww.intel.com/performance.IntelandtheIntellogoaretrademarksofIntelCorporationintheU.S.and/orothercountries.*Othernamesandbrandsmaybeclaimedasthepropertyofothers.©2016IntelCorporation

Performance Analysis and Tuning in China Mobile's ... · • neutron port creation failure: A...

Documents

Transcript of Performance Analysis and Tuning in China Mobile's ... · • neutron port creation failure: A...