Performance Analysis and Tuning in China Mobile's ... · • neutron port creation failure: A...
Transcript of Performance Analysis and Tuning in China Mobile's ... · • neutron port creation failure: A...
October 2016
1
October 2016
2
Contents 1 Introduction..........................................................................................................................32 PerformanceStudy:What,Why,How?.................................................................................4
2.1 ChinaMobile1000NodeProductionCluster............................................................42.2 LowOverheadInstrumentation................................................................................42.3 VirtualMachineLaunchLatency...............................................................................42.4 OpenStacknovaArchitecture....................................................................................52.5 Experiments...............................................................................................................5
3 PerformanceTestInsights.....................................................................................................63.1 ExecutionTimesunderVaryingRequestPressure....................................................63.2 ExecutionTimesunderVaryingSchedulers...............................................................73.3 LaunchErrors–NumberandNature.........................................................................83.4 ConfigurationFixes....................................................................................................93.5 ServiceLevelProfiling..............................................................................................113.6 Analysis....................................................................................................................13
4 ConclusionsandFutureWork.............................................................................................14
Figures
Figure2-1 Architecture............................................................................................................5Figure3-1 Send5~2000concurrentrequeststo530computenodes..................................6Figure3-2 Varyingschedulersettings.....................................................................................7Figure3-3 Tracefailurerequests.............................................................................................8Figure3-4 Servicequalityimprovements,beforeandafter..................................................10Figure3-5 Controller,keystone,RabbitMQandMySQLutilizationasafunctionoftime.....12TABLESTable1:ErrorsandCategories……………………………………………………………………………………………….….9Table2:OpenStackcomponents’peakconsumptionofCPU,memory&networkresources…..11
October 2016
3
1 INTRODUCTION Developers,architectsandoperatorscanfacechallengeswhenanalyzingOpenStack*cloudsoftwareperformance.Thosewhodevelop,deployandoperatecloudinfrastructuresmustensuretheircloudsmeetservice-levelagreements(SLAs).Developerswanttoquicklylocateandresolvedesignandimplementationissuesthatlimitcloudscaling.Deployerswanttosetupservices,networks,andnodestomeetscaleneeds.Operatorsalsowanttoestablishtherealcapacityoftheircloudunderheavyworkloadconditions.OrganizationsconsideringdeployingaprivatecloudneeddatatomakeeducateddecisionsaboutwhetherOpenStacksolutionscanmeettheirneeds.
OpenStackblack-boxtestingoftenusestherally[1]project,butforfinegrainedanalysis,constantpollingisrequiredwhichwouldintroduceanartificialloadonthesystemandmuddytheanalysis.Analternativeistosimulatethetestsystem.Whileattractiveinneedinglesshardware,ithoweverrequiresaccuratemodellingofinter-componentinteractionsandthetime/resourcesspentineachcomponent.Forthiscasestudy,wewerefortunatetogainaccesstoChinaMobile’sproductionenvironmentbeforeitwentlive,whichallowedustotestonrealhardware.Wethenintroducedlimitedextraloggingtominimizeanyartificialsystemload.
Thispapershareslessonslearnedfromacareful,componentlevelanalysisofChinaMobile’s1000-nodeOpenStackcloud.Weuncoveredthreemajorissuesinthecourseofthisstudy,alladdressedthroughOpenStackconfigurationchanges.Theresult:amorestableandperformantChinaMobileOpenStackcloudandinsightsintoscalebottlenecksandareasforfuturework.
October 2016
4
2 Performance Study – What, Why and How?
2.1 China Mobile 1000-Node Production Cluster ChinaMobile’s1000nodeproductioncloudrunsOpenStackNewtonRC2(13.0.0.rc2).Thecloudincludes530computehosts,fivecontrollernodes,athree-nodekeystoneclusterforidentityandauthenticationservices,athree-nodeactive-activeGaleraClusterforMySQLdatabaseservices,athree-nodeRabbitMQclusterforinter-componentmessagingsupport,andtheremainingnodesset-upforstorageandnetworkservices.Note:becausethecloudrunsonmulti-coremachines,multi-threadedprocessescanconsumemorethan100%ofCPU.Serverconfiguration:2xIntel®Xeon®ProcessorE5-2680v3(30MCache,2.50GHz),128GBRAM(8x16GBHP®DDR4),3TBDisk.
2.2 Low Overhead Instrumentation Giventhe“slated-for-production”statusoftheclusterundertest,weintroducedadditionallogging,takingcaretominimizeoverhead.Thatenabledustofurtheranalyzethelogsasapostprocessingstep,withoutintroducinglatenciesduringexecution.
Python-basedOpenStackcodelendsitselftomonkey-patching[3],enablingtheeasyintroductionandeliminationofmonitoringinstrumentation.
EventhoughNetworkTimeProtocolwasusedtosynchronizethenodesinthedistributedcluster,itssizemadethesynchronizationprecisioninadequatetostudyfine-grainedtiming.Toresolvetheissue,wedevisedamethodtoadjusttimestampsusingcausalconstraintheuristics.
2.3 Virtual Machine Launch Latency Wefocusedonvirtualmachine(VM)launchlatencyasthemostimportantfeatureofanInfrastructure-as-a-Service(IaaS)platform.LaunchlatencyalsoinvolvesmostcoreOpenStackservicesandtheircomponents.Ourworkseekstoquantifyperformanceasafunctionofthenumberofconcurrentrequests.
October 2016
5
2.4 OpenStack Nova Architecture Tofacilitatefuturediscussions,webrieflydelveintotheOpenStacknovaarchitecturewhichhandlesVMprovisioning.Incomingrequestsfirstarriveatthenova-apiservice,whichinturnhandsoffthetasktothenova-conductorservice,alayerofsoftwarethatenvelopesthedatabase.Thenova-conductorinvokesthenova-scheduler,whichattemptstoidentifyahospitablehostfortheworkload.Onceitidentifiesahost,theschedulertransferscontroltothenova-computeservicerunningontheselectedhosttolaunchtheworkload.TheorangestatetransitionsinFigure2-1(below)representerrorstatesthatmightstemfromincorrectlyformulatedrequests,inputconstraintsthatcannotbesatisfiedand/oralackofresources: Figure 2-1: Architecture
2.5 Experiments Westudyperformanceundervariousloadconditions,inparticularhandlingconcurrentVMlaunchrequestsrangingfromfiveto2000.
Wefurtherexploreperformanceunderthesameloadconditionsbutusingdifferenttypesandnumbersofnova-schedulerservices,tohandlethevolumeofrequeststhroughhorizontalscaling.
UsingZabbix[5],anopensourcemonitoringtool,westudysystemresourceconsumptionacrossthevariouscloudcomponentswithrespecttotimeoncerequestsenterthesystem.
Mostvaluableofall,weanalyzedthelogsoverfourseparateruns,eacha20-minutetimewindowwith2000concurrentrequests.Wecollecteddataonhowmanyrequestscompletedsuccessfully,howmanyfailed,andinwhatmanner.
Inthenextsectionweshareourfindings.
October 2016
6
3 Performance Test Insights 3.1 Execution Times under Varying Request Pressure Figure3-1showsnovacomponent-levelcostswhenthecloudhandlesfiveto2000concurrentlaunchrequests.Comparingourresultstothoseofanearlierstudy[2]withsimulatedcomputehosts,wefoundmajordifferenceswithrespecttothecostincomputeservicesandtheschedulersaturationpoint.Realcomputenodes—i.e.,realhypervisors—takesignificantlylongertospawnVMsthanweassumed,drivingupthecompute-hosttimecomponent.Observingbestpracticesforproductionenvironments,themessagequeueanddatabaseservicesweredeployedinclusteredconfigurationstodelivertherequiredstabilityandperformance.Notethattheschedulerservicetimeincreaseswithload,butplateausaround800concurrentrequests.Asthenumberofconcurrentrequestsrisesbeyond800,saturatingthescheduler,thecostofmessage-handlingincreases.Whilethenova-apiservicetimeriseswithload,itneverconsumesmoretimethantheschedulercomponent.Thecloudsucceedsinhandling1.78requestspersecond.
Figure 3-1: Send 5 ~ 2000 concurrent requests to 530 compute nodes
October 2016
7
3.2 Execution Times under Varying Schedulers Next,weexploretheeffectoffiveschedulerinstances,versus10schedulerinstances,versusfiveinstancesofanewcachingscheduler.Thecachingschedulereliminatestheneedtovisitthedatabasetoretrieveresourceusageupdatespriortoscheduling.
ThegraphsinFigure3-2belowsupportourhypothesis:using10schedulerinstancesinsteadoffiveroughlyhalvesthetimetotacklelaunchrequests.Asonemightexpect,usingacachingschedulerradicallyshrinksthetimeforahosttoscheduleaworkload.Butintroducingmoreschedulerinstancesimpactsthenova-apicomponentbycausingmorere-triesstemmingfromusingstaleresourceviews.
Thegraph’sredregionsindicatefailedscheduleattemptsinnova-api,andthedarkgreyregionsindicatethetimespentinre-tries—puzzling,giventheavailabilityofphysicalresourcesonthecomputehosts.Withfivefilterschedulers,thefailureratereached25.7percentandtheretryratereached29.0percent.Further,thetimetospawnVMimagesfluctuatesmorewithalargernumberofschedulers—sometimesswingingfrom6.45secondsto354seconds(nearlysixminutes),whichbegsfurtherinvestigation.
Figure 3-2: Varying scheduler settings
October 2016
8
3.3 Launch Errors – Number and Nature Tobetterunderstandthelaunchprocessandthenatureoferrors,weanalyzeddetailedlogsobtainedfromfourrunsof2000requestseach.Ofthe8000totalrequests,7953weresuccessfullytracked,with4646successfullycompleting,yieldingasuccessrateofonly58.45percent.
First,toobtainthenecessarytimesynchronizationprecision,weadjustedlogtimestampsusingcausalityheuristics.Thenwedevelopedadetailedstatemachineusinglogeventssuchasenter-nova-api,exit-nova-api,andothers.Wenexttracedeacheventinthisstatespaceandclassifiedtheerrorsusingtheirmessagestrings.
Figure 3-3: Trace failure requests
Figure3-3highlightsafailedrequestina2000requesttestrun.Therequest-ID,“p1966”failedoncomputenode“YW-SV153”withtheerrorloggedas“lossofdatabaseconnection.”Thelogfilesrevealedthatitoccurredwhenthecomputenodetriedtoupdateitsresourceusagethroughnova-conductor.Furtheranalysisrevealedthatnova-conductorranintoadatabasedeadlockissuewhenitranthe“computenodeupdate”operation.Thelogsshowthataftertherequestisreceivedbythecomputenode,ittakesapproximately30secondsbeforethedatabaseoperationfails,characteristicofloadrelatedfailure.
October 2016
9
Thefollowingtableshowseacherroranditscategoryandsub-categoriesbasedontheerrormessagestring.Thestatemachine-basedanalysishelpedeliminateduplicateandirrelevanterrormessages.Table 1: Errors and Categories
Component Error Message Count
API This result object does not return rows. 2284
API Not authorized for image 845
API HTTP Internal Server Error (HTTP 500) 133
API Unknown 8
Compute Node neutron error creating port on network, retry 1222
Compute Node Unknown, retry 12
Compute Node Build of instance … Failure prepping block device 5
Compute Node Build of instance … Failed to allocate the network(s), not rescheduling 1
Compute Node Remote error: DB Error This connection is closed 1
3.3.1 TOP THREE ERRORS
• Databasedeadlocks:Deadlockscausedthemajorityofdatabasefailuresreportedbythenova-apiserviceandoccurredwhenrunningthe“quotareserve”operation.Afterreachingthedeadlockretrylimit,therequestfailedwiththeerrormessage“norowsreturned.”
• neutronportcreationfailure:AsubstantialportionoftheretriesstemmedfromtheOpenStackneutronservicefailingtoallocateaportforaVM.Thisissuecausedatleast1222compute-noderetriesandonecompute-nodefailure.
• keystoneauthenticationfailure:EverystageintheOpenStackservicepipelineattemptstoauthenticatetherequestwiththeOpenStackidentityservice,keystone.Underheavyload,authenticationoftentimesout.Ofthe8000totalrequests,thisaccountedfor47failurestoauthenticateduetotime-outsandultimatelycaused845imageaccesserrorsand133HTTPinternalservererrors.
October 2016
10
3.4 Configuration Fixes Furtherinvestigationrevealedthatthethreemajortypesoferrorsstemmedfrommisconfigurationsandimproperdeploymentdecisions—notfromsoftwarebugs:
• GaleraClusterConfiguration—MisconfigurationoftheGaleraClustercausesthedatabasedeadlocks.Theoriginalconfigurationusedcluster-wideoptimisticlocking,resultinginfailureandroll-backofthedistributedquotareservationsinthenova-apiservices,particularlyasthelaunchrequestpressureincreased.Usingasingle-nodewritestrategyinsteadofallowingwritestoalltheGaleranodesresolvedtheissue.
• UsingFernettokensinsteadofPKItokensinkeystone—Fernettokens,comparedtoPKItokens,performbetterbecauseoftheirsmallerpayloadandnon-persistentimplementation.MovingtoFernettokensresolvedtheauthenticationfailures.
• Setting“schedulerhostsubsetsize”[4]tomorethanthenumberofschedulersdecreasesthechancesofconflictandspreadstheworkloadamongcomputenodes.
• Setting“rpcresponsetimeout”toalongerintervalallowsthenova-conductortowaitlongerfortheRPCresponsefromthenova-schedulersinalargecluster.
• Increasing“rpcconnpoolsize”allowsthenova-schedulerservicetoconcurrentlyacceptmorerequests.
• Increasing“maxpoolsize”and“maxoverflow”inthenova-apidatabaseandinthedatabaseconfigurationgroups(giventhattheschedulerservicedirectlyconnectstothedatabase)allowsalargernumberofsimultaneousconnectionstodatabases.
Figure 3-4: Service quality improvements, before and after
Figure3-4illustratestheoverallimprovementsaftertroubleshootingandtuning.ResolvingthethreemajorrootcauseseliminatedallAPIfailures(inred)andunexpectedretries(ingrey).Morerandomlydistributingworkloadsimprovedstability,includingreducingtheprobabilityofportallocationtime-out.Timespentinnovascheduling—thankstoswitchingtoasingle-nodewritestrategyintheGaleraCluster—resultedinfewertime-outs,fewerrollbacks,andfewerretries.
• TheFailurerateinnova-apidecreasedfrom25.7percentto0percent.
• TheRetryratedecreasedfrom29.0percentto0.83percent
• TheOverallOutputincreasedfrom1.31requests/secondto4.35requests/second
October 2016
11
3.5 Service Level Profiling ForservicelevelprofilingweusedZabbix,anopensourcemonitoringtoolwithavarietyofpluginstomonitorsoftwareapplicationssuchasMySQLdatabaseandRabbitMQ,inadditiontobeingabletomonitorsystem-levelactivitiesofindividualprocesses.WesetupmonitoringagentsontheservershostingtheOpenStackcontrollers,keystone(identity),thedatabase,andthemessagequeue.Weexaminedusageundervariousloads,namelyidle,800,and2000concurrentrequestsinour1000nodecluster.Ourpurposeinsharingthisdataistoillustratewhatonemightanticipateintermsofresourceconsumptionina1000nodecluster,inordertoguidehardwaresizingefforts.
Table 2: OpenStack components’ peak consumption of CPU, Memory, Network resources
Item Idle
(min value) 800 requests (peak value)
2000 requests (peak value)
Nova CPU usage (nova-api) 58% 172% 219%
Nova CPU usage (nova-scheduler) 22% 95% 95%
Nova CPU usage (nova-conductor) 81% 270% 329%
Nova Memory usage (nova-api) 60GiB
Nova Memory usage (nova-scheduler) 1.5GiB
Nova Memory usage (nova-conductor) 16GiB
Nova Incoming network (controller node) 527Kbps 35Mbps 41Mbps
Nova Outgoing network (controller node) 301Kbps 3.1Mbps 5.2Mbps
Keystone CPU usage 0% 169% 202%
Keystone Memory usage 68GiB
Keystone Incoming network (controller node) 27Kbps 6.2Mbps 12Mbps
Keystone Outgoing network (controller node) 86Kbps 3.1Mbps 8.6Mbps
Database MySQL bytes received 250KBps 9.3MBps 6.9MBps
Database MySQL bytes sent 400KBps 33MBps 39MBps
Database MySQL queries rate 92qps 40Kqps 57Kqps
Database Incoming network 1.5Mbps 32Mbps 49Mbps
Database Outgoing network 2.5Mbps 180Mbps 222Mbps
Messaging RabbitMQ receive rate 19msgs/s 1.7Kmsgs/s 2.1Kmsgs/s
Messaging RabbitMQ deliver rate 32msgs/s 6.4Kmsgs/s 7.5Kmsgs/s
Messaging RabbitMQ file descriptors used 8.2K
Messaging RabbitMQ sockets used 8.1K
Messaging Incoming network 6.5Mbps 71Mbps 94Mbps
Messaging Outgoing network 11.9Mbps 125Mbps 179Mbps
October 2016
12
Figure 3-5: Controller, keystone, RabbitMQ and MySQL utilization as a function of time
Figure3-5showsthatthenova-apiandkeystoneservicesconsumeprocessingresourcesasrequestsarriveintothesystem.Asnova-schedulerresourcerequirementsclimb,theyalsoshiftfurthertotherightintime.Inaddition,nova-schedulerdirectlycommunicateswiththedatabasetorefreshitsresourcestatuscache,visibleasamatchedriseindatabaseprocessingneeds.Theresourceusageofnova-conductorrisesonlyduringthedeploymentphase(peakingat270percent),andthusitmatchestheriseofresourceusageonthenova-computenodes.Databaseresourceusageremainshighinthedeploymentphasebecausenova-conductorkeepsitbusyupdatingthecomputenoderecords.Thedeliver-rateandreceive-rateofRabbitMQstarttoriseafterthefirstrequestleavesnova-api,andtheypeakwhenrequestsgotothecomputenodes.Thecloudranonmulti-coremachines,allowingusagetoexceed100percent.Notethatbecausethenova-schedulerisinherentlysingle-threaded,itsCPUusagecannotexceed100percent.
October 2016
13
3.6 Analysis Thesystemprofilingillustratesvariedperformancefootprintsofthenovaservices:
• nova-apiimposespressureuponkeystoneandMySQLserviceswithnumerousSQLoperations,indicatingahighdegreeofdifficultyforperformanceoptimizationsduetocomplexinteractionswithotherservices.
• nova-schedulerconfiguredwiththefilterschedulerdriverrequiresoutbound-intensiveMySQLoperations,anditcannotleveragemultipleCPUcores.Webeganinvestigatingacachingschedulertoreducedatabaseburdenandspeedscheduling,andwehaveobtainedpromisingpreliminaryresults.
• RabbitMQpressure:Thedecisiondeliveryfromnova-conductortocomputenodesimposespressureuponRabbitMQservices,whichseemsinevitable.Weproposeinvestigatingalighter-weightmessagingprotocol.
• VMprovisioningbynova-computeservicesreliesonnova-conductorservicestoupdatethedatabase.Thenumberofnova-computetonova-conductorservicesshouldmatchtoavoidbottlenecks.
ZhenYuShenofChinaMobilegiveslocalmediareportersatourofoneofseveralmulti-thousandnode,OpenStackclusters.
October 2016
14
4 Conclusions and Future Work Ourlightweightinstrumentationalongwithtechniqueforobtainingmoreprecisetime-synchronizationenabledtrackingofeveryVMlaunchrequestinnova.Byanalyzingeachfailureandclassifyingit,weidentifiedthreemajorissuesandaddressedthemthroughconfigurationchanges—obtainingamorestableandperformantcloud.Weexperimentedwithdifferentclouddeploymentsandanalyzedtheirmerits.WeconfirmedthatthefilterschedulercreatesthemostsignificantbottleneckinalargeOpenStackcloud,andthatitstemsfromthecache-refreshdesignthatimposesheavypressureonthedatabase.Switchingtoacachingschedulerrelievesdatabasepressurebutalsocarriestheriskofoperatingwithstaleresourceviews.Moreefficientdatabaseindexingmechanismsshowpromiseasawaytoreducedatabasepressuretoprunethelistofhostsbeforereachingouttothedatabase.Alternatively,updatingtheschedulersbythecomputenodes—onanyresourcechange—boostsperformancebyeliminatingthecallstoupdatetheschedulerdatabasecachedresourceviewoneachschedulerequest.Forthosewillingtorelaxtheneedtofindtheoptimalhostforscheduling,usingcells,and/orpartitioningthehostsintosubsetscanspeedscheduling,especiallywithknownshort-runningworkloads.Otherareastoexploreincludealightermessagingcomponent.Oursystem-levelprofilingofOpenStackservicesshowedpatternsandexposedpotentialbottlenecks,andoffersareastoexploreforredesignandimprovement.References:[1] https://wiki.openstack.org/wiki/Rally[2] https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-
performance-summit.pdf[3] http://en.wikipedia.org/wiki/Monkey_patch[4] http://docs.openstack.org/mitaka/config-reference/compute/scheduler.html[5] http://www.zabbix.com/Acknowledgements:HaoLiofChinaMobilehelpedpreparetheexperiments.QingWangandJian-FengDingofIntelorganizedthecollaborationwithChinaMobile.MaggieLiangofInteldevelopedthecasestudywithDanFinebergeditingforpublication.
October 2016
15
5 Legal Notices Inteltechnologies’featuresandbenefitsdependonsystemconfigurationandmayrequireenabledhardware,softwareorserviceactivation.Performancevariesdependingonsystemconfiguration.Nocomputersystemcanbeabsolutelysecure.Checkwithyoursystemmanufacturerorretailerorlearnmoreat[intel.com].
SoftwareandworkloadsusedinperformancetestsmayhavebeenoptimizedforperformanceonlyonIntelmicroprocessors.Performancetests,suchasSYSmarkandMobileMark,aremeasuredusingspecificcomputersystems,components,software,operationsandfunctions.Anychangetoanyofthosefactorsmaycausetheresultstovary.Youshouldconsultotherinformationandperformanceteststoassistyouinfullyevaluatingyourcontemplatedpurchases,includingtheperformanceofthatproductwhencombinedwithotherproducts.
Testingmethodsandtools:
• Performancestudiedundervaryingloadconditions,handlingconcurrentVMrequestsrangingfromfiveto2000.
• Performancealsoexploredunderthesameloadconditionsbutusingdifferenttypesandnumbersofnova-schedulerservicestohandlethevolumeofrequeststhroughhorizontalscaling.
• SystemresourceconsumptionstudiedacrossthevariouscloudcomponentsusingZabbix[5],anopensourcemonitoringtool.
• Loganalysisovera20-minutetimewindowwith2000concurrentrequests.Datacollectedonhowrequestscompletedsuccessfully,failed,andinwhatmanner.
Configuration:
• The1000-nodeexperimentswerecarriedoutbytheauthorsatChinaMobile’sdatacenter.o The1000nodecloudrunsOpenStackNewtonRC2(13.0.0.rc2).o Itincludes530computehosts,fivecontrollernodes,athree-nodekeystoneclusterfor
identityandauthenticationservices,athree-nodeGaleraClusterforMySQLdatabaseservices,athree-nodeRabbitMQclusterforinter-componentmessagingsupport,andtheremainingnodesset-upforstorageandnetworkservices.
• Servers:o CPUs—2xIntel®Xeon®ProcessorE5-2680v3(30MCache,2.50GHz)o Memory—128GBRAM(8x16GBHP®DDR4)o Storage—3TBLocaldisk
FormoreinformationonperformanceofInteltechnologies,gotowww.intel.com/performance.IntelandtheIntellogoaretrademarksofIntelCorporationintheU.S.and/orothercountries.*Othernamesandbrandsmaybeclaimedasthepropertyofothers.©2016IntelCorporation