Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale...
Transcript of Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale...
Experience of Development and Deployment of a Large-Scale Ceph-Based Data Storage System at RAL
BrunoCanningScien.ficCompu.ngDepartment
UKTier1,STFCRutherfordAppletonLaboratoryHEPiXFall,2016,LBNL
Outline• Abitaboutme• Briefrecap:
– CephworkinSCDatRAL– Echoclusterdesignbrief– Clustercomponentsandservicesoffered
• OurexperienceofdevelopinganddeployingCeph-basedstoragesystems– Notastatusreportbutapresenta.onofourexperience
Hello!• LHCDataStoreSystemAdministrator
– 3yearsatRALinSCD– BasedinDataServicesGroup,whoprovideTier1storage,yetembeddedinFabricTeam
– PreviouslyworkedonCASTOR,nowworkonCeph– Specialisein:
• Linuxsystemadministra.on• Configura.onmanagement• Datastorage• Serverhardware• Fabricsupport
TheStorysoFar• Twopreproduc.onCephclustersinSCD:EchoandSirius• Echodesignedforhighband-widthfile/objectstorage
– LHCandnon-LHCVOs• Siriusdesignedforlowlatencyblockdevicestorage
– Departmentalcloudserviceandlocalfacili.es• Usecaseandarchitectureofbothdiscussedpreviouslyin
greaterdetailbyourJamesAdams:– h_ps://indico.cern.ch/event/466991/contribu.ons/2136880/
EchoClusterComponents• ClustermanagedviaQua_oranditsAquilonframework• 3×monitornodes(ceph-mon)
– firstmonitorisalsoceph’sdeployhost– ncm-cephautomatesceph-deploy
• Preparesanddistributesceph.conftoallnodes• ManagescrushmapandOSDdeployment
• 3×gatewaynodes(xrootd,globus-gridjp-server,ceph-radosgw)
• 63×storagenodes(ceph-osd)– 36×6.0TBHDDfordata,totalcount=2268– Capacity=12.1PiB,13.6PBRaw,8.8PiB,9.9PBusable
• NoSRM
XrootDPlugin• XrootDpluginforCephwri_enbySébas.enPonce,CERN
– h_ps://indico.cern.ch/event/330212/contribu.ons/1718786/• Useslibradosstriper,contributedbySébas.entoCephproject
– AvailablesinceGiantrelease• PluginitselfcontributedtoXrootDproject• Firstdemonstra.onbyourGeorgeVasilakakoswithGiantreleasein
firsthalfof2015,butnotpackaged• Thexrootd-cephRPMisnotdistributed,versionofCephinEPEL
(Fireflyrelease)predateslibradosstriper• NeededtobuildxrootdRPMsagainstCephdevelopmentpackagesto
getxrootd-cephplugin
Star=ngtheXrootDDaemon• FollowingupgradetoCephHammerrelease,xrootddaemon
willnotstart,reportsanLTTngerror• Causedbytheregistra.onoftracepointprobeswiththesame
nameinbothradosandlibradosstripercodebytheirrespec.vecontributors
• Patchcontributed(May2015),mergedintomasterajer0.94.5released(October2015)
• NeededtobuildCephRPMstogetpatchesandmeettes.ngdeadlines
• Demonstratedworkingxrdcpinandoutofcluster(November2015)withx509authen.ca.on
• Patchesincorporatedinto0.94.6release(February2016)
DataIntegrity-XrootD(1)• Problemswithfilescopiedoutwithmul.plestreams• Incorrectchecksumoflarge(4GiB,DVDiso)filesreturnedin
81%of300a_emptsbutalwayssamesizeasoriginal• Createdatestfilecontaining1069×4MiBblocks• Testfile:16bytelinesoftextin1069×4MiBblocks,can
iden.fylinenumberinablockandblocknumber• Iden.fiedthefollowing:
– 1to3blockscontainedincorrectdata,appearingatrandom– Incorrectblockscontained2MiBduplicatedfromanotherblock– Overlappedexactlywitheither1stor2ndhalfofblock– Typicallyfromnextblockbutcouldbefromupto11blocks
behindor12blocksinfrontofbadblock
DataIntegrity–XrootD(2)• CommunicatedfindingstoSébas.en• Hewasquicklyabletoreproducetheproblem• Hedeterminedaracecondi.onoccurredduetotheuse
ofnon-atomicopera.onsinathreadedcontext• Patchcommi_edtoGitHub,rebuiltXrootDRPMsatRAL,
problemsolved• Resolvedinoneweekajerini.alcontactwithSébas.en• Greatteameffortandcollabora.onwithpartners• HappysysadminsandrelievedprojectmanagersatRAL
GridFTPPlugin• StartedbySébas.enPonce,con.nuedbyIanJohnsonatRAL
– h_ps://github.com/ijjorama/gridFTPCephPlugin• Alsouseslibradosstriper• UsesXrootD’sAuthDBforauthorisa.on• Usesxrdacctesttoreturnauthorisa.ondecision• Noproblemswithsinglestreamfunc.onality…• …butout-of-orderdatadeliverywithparallelwritesinMODEEused
byFTStransfersproblema.c• ErasureCodedpoolsinCephdon’tsupportpar.alwrites,hencethey
don’tsupportnon-con.guouswri.ng• Wenowhaveafixthatisundergoingtes.ng
PluginSummary• XrootDandGridFTPavailableandmaturing• PluginstalkdirectlytoCeph• Pluginsareinteroperable• Removedrequirementforleading‘/’inobjectnamefromboth
– ComesfromhistoryofusewithPOSIXfilesystems– ThiswouldrequireVOpoolnamestohavealeading‘/’– Thischaracterissupported,butconcernedthismaychange– GridFTPpluginworksaroundthis– XrootDaddedsupportforobjectIDs
• Wehaveworkingauthen.ca.onandauthorisa.onwithboth
CephVersionUpgrades• Typicalsequence*:Monitors,storagenodes,metadata
servers(ifpresent)thengateways• Restartdaemonsonnodetype,thenupgradenexttype• Performedwhilsttheserviceisonline• Veryeasyfromoneversionwithinareleasetothenext• JustchangeRPMversionnumber• Canskipsomeintermediateversions• Takesc.30minutes,canbeperformedbyonesysadmin• MuchsimplerthanCASTORupgrades
– *ChangeinHammerOSDMapformatwith0.94.7requiresdifferentorder,ourthankstoDanvanderSter– h_ps://www.mail-archive.com/[email protected]/msg32830.html
CephReleaseUpgrades• ChangeRPMversionandrepository• FireflytoGiant:Easy• GianttoHammer:Easy,butintroducedLTTngproblem• HammertoJewel:Moreinvolved:
– JewelrequiresSL7– Daemonsnowseparate,notonedaemonwithmanyroles– Daemonsarenowmanagedbysystemd...– …andrunas‘ceph’user,not‘root’user– Neededtodefineulimitsfor‘ceph’user(nproc,nofile)– Changeownershipof/var/lib/ceph
• UpgradedtoSL7withHammerfirst,thenmovedtoJewel
NewOpera=ngSystem-Installa=ons• UpgradetoJewelrequiredustosupportanewOSmajor
version,beforewewerereallyreadyforit• Workrequiredtoadaptdeploymentconfigura.on
– PreviouslyreliantondhcpfornetworkconfigduringPXEandkickstart,qua_orthenupdatestoproper,sta.cconfig
– New(again)NICnamingconven.onwithSL7,controlsNICnamesthroughouten.reinstalla.on
– However,nodesinsubnetsotherthanthedeployserveralsoneedrou.nginforma.oninordertoinstall,currentlynotconfiguredinpxelinux.cfgorkickstart
• SL7installa.onss.llneedsupervisingandojenneedhelp,SL6installa.onsjustwork
NewOpera=ngSystem–SiteConfig• Workrequiredtoadaptsite-wideconfigura.onforSL7andfor
Ceph– /etc/rsyslog.conf
• Wesendcoresystemlogsandcertainapplica.onlogstocentralloggers
• FewermodulesloadedbydefaultwithSL7,requiredmodules(e.g.imuxsock)mustbedeclaredinfile,modulesforsystemdjournalmustbeadded(imjournal),asmustotherdirec.ves
– /etc/sudoers• Long-standingrequirementtosupportcephasasecondusecaseofsudo
• UsefornagiostestsexecutedviaNRPE,sudoersconfiginanRPM• Deployhostneedssudoersconfigonallnodesforceph-deploy• ncm-cephprovidesthisviancm-sudo• ConflictwithRPMandncm-sudo
NewOpera=ngSystem-Aler=ng• UsingnagiosRPMsfromEPEL• Olderthanlatestversionbutwedon’tneednewfeaturesyet• NSCA
– VersiondifferencebetweenEPEL6(2.7)and7(2.9)– Versionsareincompa.ble,packetsizeexpectedbyserverdiffers– Packagednsca-client2.7forSL7hosts
• NRPE– NRPEdaemonrunsas‘nrpe’user,not‘nagios’user– ReconfiguredNRPEtorunas‘nagios’userviasystemdunitfile– Addunitfileunder/etc/systemd/system/nrpe.service.d/– Noneedtomodifyexis.ngsudoersconfigforSL7
• WouldhavepreferredtoupgradesiteinfrastructuretoSL7first
UpgradetoSL7andJewel• AjerupgradetoSL7,OSDdaemonsstarttocrash• FirstOSDcrashesajerc.2daysofnormalopera.on• Crashescon.nueoneata.me,buthundredscanbuildup• HavetoperiodicallystartOSDs,butcanrestorefullopera.onwith
pa.ence• ProblembecamemuchworseduringupgradetoJewel
– CouldnotkeepallOSDsrunningdespitebesteffort– OverhalftheOSDsstoppedrunning,clientI/Ohalted– Readytodismantleexis.ngcluster(FY’14storage)andbuildnew
cluster(FY’15storage)anyway– Mostinterestedingainingupgradeexperience– Decideddeclaretheservicedeadtothetestcommunityandbuild
newcluster• Butwhatwasthecause?
TheXFSBug• XFSreportspossiblememoryalloca.ondeadlockinkmem_allocon
storagenodes• CausesXFStohangwhichcausestheOSDtocrash• Bugalreadyreported:
– h_p://tracker.ceph.com/issues/6301– h_p://oss.sgi.com/pipermail/xfs/2013-December/032346.html
• BugpresentinkernelsshippedwithSL5andSL6,fixedinkernelsshippedwithSL7
• FilesystemsondatadiskswillneedtoberecreatedonSL7• Willaffectallstoragesystems• OurthankstoDanvanderSteratCERNforcommunica.ngthistous
DeploymentofFY’15Storage• Supermicro:36HDDsonHBAwithoneinternalOSHDD• SpecifiedSL6forITTandusedforpre-deploymenttes.ng• Pre-deployment‘hammer’tes.ngproceededsmoothly• DeploymentintoCephwithSL7problema.c:
– OSdisknowlastelementinlist,/dev/sdak– ConfigmanagementassumesOSdiskis/dev/sda– Deployhostprofilecontainsdisklayoutofallstoragenodes,read
fromstoragenodeprofiles– Changedconfigmanagementtoprogramma.callyiden.fyboot
diskforallnodesandexcludethebootdiskfromlistofdatadisks– Simplechangebutcanchangeprofileofeverynodeindatacentre– Test,testandtestagaintogetitright
• Deploymentdelayedbutproceededasexpectedonceable• Servicenowverystable
Reflec=ons• WecanbuildservicesuserswantbasedonCephstorage• Challengewasalwaysgoingtobetechnicalinnaturebut…• ...wedidnotappreciatethedevelopmenteffortrequired
earlyonintheproject• Cephadministra.onisalearningcurve• Onlinedocumenta.onisgenerallygoodbutthe“really
dangerous”featuresprovedusefulsocouldimprovefurther
• Outlookispromising
Acknowledgements• RALTeam:
– GeorgeVasilakakos,IanJohnson,AlisonPacker,JamesAdams,AlastairDewhurst
– Previouscontribu.ons:GeorgeRyall,TomByrne,ShaundeWi_• Collaborators:
– DanvanderSterandSébas.enPonceatCERN– AndrewHanushevskyatSLAC– BrianBockelmanatUniversityofNebraska-Lincoln
• Andeveryonewhomadethishappen:
Thankyou
BackupSlides
EchoCluster• Brief:ProvideareplacementforCASTORdisk-onlystorage• Mo.va.on:
– Reachinglimita.onsofCASTORperformance– CASTORrequiressysadminsandaDBAs– SeveralcomponentsoftheCASTORsystem– Smallcommunity:cannotrecruitexperts,.meconsumingtotrainthem– Reducedsupport:CERNhavemovedtoEOSfordiskstorage
• Requirements:– Performancemustcon.nuetoscalewithworkload– Onlyrequireonespecialism(sysadmins)– MustsupportestablisheddatatransferprotocolsusedbytheWLCG
• Benefits:– Largercommunity:canhireexperts,reducedtraining.me,Cephspecialisesindisk
storage– Lessefforttooperateandmaintain
ServicesOffered• Ceph10.2.2(Jewelrelease,hasLTS)runningScien.ficLinux7x• OnepoolperVOcreatedwith8+3erasurecoding• Customgateways:GridFTPandXrootD
– Pluginarchitecture,talkna.velytoCeph– Bothbuiltonlibradosstriper– Bothendpointsareinteroperableforputs/gets
• Authen.ca.onviax509cer.ficate,currentlygrid-mapfile• Authorisa.onviaXrootD’sAuthDBforbothGridFTPandXrootD
– Namingscheme:<pool_name>:<space_token>/<file_name>– Grantsrwaccesstoproduc.onusers,roaccesstoordinaryusers
fordataspaceandrwaccesstoallusersforscratchspace• AlsoofferingdirectS3/Swijaccesstotrustedusers• Tes.ngDynafed
NetworkingConsidera=ons• ExperiencewithCASTOR:Networkismuchlikeau.lityto
whichweconnectourservers,likepower• WithCeph,needtoconsidernetworkingaspartofyour
Cephinfrastructure• Willrequiredualnetworksonstoragenodesinlarge/busy
clusters– Publicorclientnetwork– Clusterorprivatenetwork
• NetworkingdiagramgiveninsitereportbyMar.nBly:– h_ps://indico.cern.ch/event/531810/contribu.ons/2302103/
TheSIRIUSIncident(1)• OccasionallyallOSDscrashonastoragenodethatneverwas“rightin
thehead”• Nopreviousimpactonservicebutanuisancetoservicemanager• Repairsundertakenandreturnedtoservicebuts.llnotfixed• OSDsshutdown,butnodenotremovedfromcluster• Rollingrebootsforkernelupdatesaccidentlybroughtnodebackinto
cluster• OSDscrashagainbutleaves4PGsinastale+incompletestate• ThesePGsareunabletoserveI/O• AffectsusabilityofeveryrunningVM,VMsneedtobepaused• TrytofixtheproblembutCephadminproceduresdonothelp• Problema.cstoragenoderemovedfromclusterandnewVMs
createdinnewpool• However,manypeople’sworkispoten.allylost
TheSIRIUSIncident(2)• Despitebestefforts,noresolu.on• Otherdeadlinesdiverta_en.on• DiscussincidentwithDanatCERNandsethimupwithrootaccess• DanisawareofundocumentedprocedurethattellsOSDtoignore
historyandusethebestinforma.onavailable• PerformschangeonOSDs• PGsspendsome.mebackfillingbuteventuallygoac.ve+clean• AlloldVMscannowrecovered,howeversixweekshavepassed• Manynotrequiredbutsomeusersgladtorecoverworktheyfeared
waslost• Reputa.onofcloudservicesomewhatdiminishedbutdid
demonstratewecanrecoverfromadisaster• Personallygladwetookthehardwayoutasconsidereddele.ngpool
Hardware• 3×monitornodes:
– DellR420,cpu:2×IntelXeonE5-2430v2,ram:64GiB.• 3×gatewaynodes:
– DellR430,cpu:2×IntelXeonE5-2650v3,ram:128GiB.• 63×storagenodes:
– XMA(SupermicroX10DRi),cpu:asgateways,ram:128GiB.– OSDisk:1×233GiB2.5”HDD,DataDisk:36×5.46TiBHDD(WD6001F9YZ)viaaSASHBA.
• TotalRawStorage=12.1PiB,13.6PB• UsableStorage=8.8PiB,9.9PB