Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale...

28
Experience of Development and Deployment of a Large-Scale Ceph- Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK Tier 1, STFC Rutherford Appleton Laboratory HEPiX Fall, 2016, LBNL [email protected]

Transcript of Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale...

Page 1: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Experience of Development and Deployment of a Large-Scale Ceph-Based Data Storage System at RAL

BrunoCanningScien.ficCompu.ngDepartment

UKTier1,STFCRutherfordAppletonLaboratoryHEPiXFall,2016,LBNL

[email protected]

Page 2: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Outline•  Abitaboutme•  Briefrecap:

–  CephworkinSCDatRAL–  Echoclusterdesignbrief–  Clustercomponentsandservicesoffered

•  OurexperienceofdevelopinganddeployingCeph-basedstoragesystems–  Notastatusreportbutapresenta.onofourexperience

Page 3: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Hello!•  LHCDataStoreSystemAdministrator

–  3yearsatRALinSCD–  BasedinDataServicesGroup,whoprovideTier1storage,yetembeddedinFabricTeam

–  PreviouslyworkedonCASTOR,nowworkonCeph–  Specialisein:

•  Linuxsystemadministra.on•  Configura.onmanagement•  Datastorage•  Serverhardware•  Fabricsupport

Page 4: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

TheStorysoFar•  Twopreproduc.onCephclustersinSCD:EchoandSirius•  Echodesignedforhighband-widthfile/objectstorage

–  LHCandnon-LHCVOs•  Siriusdesignedforlowlatencyblockdevicestorage

–  Departmentalcloudserviceandlocalfacili.es•  Usecaseandarchitectureofbothdiscussedpreviouslyin

greaterdetailbyourJamesAdams:–  h_ps://indico.cern.ch/event/466991/contribu.ons/2136880/

Page 5: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

EchoClusterComponents•  ClustermanagedviaQua_oranditsAquilonframework•  3×monitornodes(ceph-mon)

–  firstmonitorisalsoceph’sdeployhost–  ncm-cephautomatesceph-deploy

•  Preparesanddistributesceph.conftoallnodes•  ManagescrushmapandOSDdeployment

•  3×gatewaynodes(xrootd,globus-gridjp-server,ceph-radosgw)

•  63×storagenodes(ceph-osd)–  36×6.0TBHDDfordata,totalcount=2268–  Capacity=12.1PiB,13.6PBRaw,8.8PiB,9.9PBusable

•  NoSRM

Page 6: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

XrootDPlugin•  XrootDpluginforCephwri_enbySébas.enPonce,CERN

–  h_ps://indico.cern.ch/event/330212/contribu.ons/1718786/•  Useslibradosstriper,contributedbySébas.entoCephproject

–  AvailablesinceGiantrelease•  PluginitselfcontributedtoXrootDproject•  Firstdemonstra.onbyourGeorgeVasilakakoswithGiantreleasein

firsthalfof2015,butnotpackaged•  Thexrootd-cephRPMisnotdistributed,versionofCephinEPEL

(Fireflyrelease)predateslibradosstriper•  NeededtobuildxrootdRPMsagainstCephdevelopmentpackagesto

getxrootd-cephplugin

Page 7: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Star=ngtheXrootDDaemon•  FollowingupgradetoCephHammerrelease,xrootddaemon

willnotstart,reportsanLTTngerror•  Causedbytheregistra.onoftracepointprobeswiththesame

nameinbothradosandlibradosstripercodebytheirrespec.vecontributors

•  Patchcontributed(May2015),mergedintomasterajer0.94.5released(October2015)

•  NeededtobuildCephRPMstogetpatchesandmeettes.ngdeadlines

•  Demonstratedworkingxrdcpinandoutofcluster(November2015)withx509authen.ca.on

•  Patchesincorporatedinto0.94.6release(February2016)

Page 8: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

DataIntegrity-XrootD(1)•  Problemswithfilescopiedoutwithmul.plestreams•  Incorrectchecksumoflarge(4GiB,DVDiso)filesreturnedin

81%of300a_emptsbutalwayssamesizeasoriginal•  Createdatestfilecontaining1069×4MiBblocks•  Testfile:16bytelinesoftextin1069×4MiBblocks,can

iden.fylinenumberinablockandblocknumber•  Iden.fiedthefollowing:

–  1to3blockscontainedincorrectdata,appearingatrandom–  Incorrectblockscontained2MiBduplicatedfromanotherblock–  Overlappedexactlywitheither1stor2ndhalfofblock–  Typicallyfromnextblockbutcouldbefromupto11blocks

behindor12blocksinfrontofbadblock

Page 9: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

DataIntegrity–XrootD(2)•  CommunicatedfindingstoSébas.en•  Hewasquicklyabletoreproducetheproblem•  Hedeterminedaracecondi.onoccurredduetotheuse

ofnon-atomicopera.onsinathreadedcontext•  Patchcommi_edtoGitHub,rebuiltXrootDRPMsatRAL,

problemsolved•  Resolvedinoneweekajerini.alcontactwithSébas.en•  Greatteameffortandcollabora.onwithpartners•  HappysysadminsandrelievedprojectmanagersatRAL

Page 10: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

GridFTPPlugin•  StartedbySébas.enPonce,con.nuedbyIanJohnsonatRAL

–  h_ps://github.com/ijjorama/gridFTPCephPlugin•  Alsouseslibradosstriper•  UsesXrootD’sAuthDBforauthorisa.on•  Usesxrdacctesttoreturnauthorisa.ondecision•  Noproblemswithsinglestreamfunc.onality…•  …butout-of-orderdatadeliverywithparallelwritesinMODEEused

byFTStransfersproblema.c•  ErasureCodedpoolsinCephdon’tsupportpar.alwrites,hencethey

don’tsupportnon-con.guouswri.ng•  Wenowhaveafixthatisundergoingtes.ng

Page 11: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

PluginSummary•  XrootDandGridFTPavailableandmaturing•  PluginstalkdirectlytoCeph•  Pluginsareinteroperable•  Removedrequirementforleading‘/’inobjectnamefromboth

–  ComesfromhistoryofusewithPOSIXfilesystems–  ThiswouldrequireVOpoolnamestohavealeading‘/’–  Thischaracterissupported,butconcernedthismaychange–  GridFTPpluginworksaroundthis–  XrootDaddedsupportforobjectIDs

•  Wehaveworkingauthen.ca.onandauthorisa.onwithboth

Page 12: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

CephVersionUpgrades•  Typicalsequence*:Monitors,storagenodes,metadata

servers(ifpresent)thengateways•  Restartdaemonsonnodetype,thenupgradenexttype•  Performedwhilsttheserviceisonline•  Veryeasyfromoneversionwithinareleasetothenext•  JustchangeRPMversionnumber•  Canskipsomeintermediateversions•  Takesc.30minutes,canbeperformedbyonesysadmin•  MuchsimplerthanCASTORupgrades

–  *ChangeinHammerOSDMapformatwith0.94.7requiresdifferentorder,ourthankstoDanvanderSter–  h_ps://www.mail-archive.com/[email protected]/msg32830.html

Page 13: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

CephReleaseUpgrades•  ChangeRPMversionandrepository•  FireflytoGiant:Easy•  GianttoHammer:Easy,butintroducedLTTngproblem•  HammertoJewel:Moreinvolved:

–  JewelrequiresSL7–  Daemonsnowseparate,notonedaemonwithmanyroles–  Daemonsarenowmanagedbysystemd...–  …andrunas‘ceph’user,not‘root’user–  Neededtodefineulimitsfor‘ceph’user(nproc,nofile)–  Changeownershipof/var/lib/ceph

•  UpgradedtoSL7withHammerfirst,thenmovedtoJewel

Page 14: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

NewOpera=ngSystem-Installa=ons•  UpgradetoJewelrequiredustosupportanewOSmajor

version,beforewewerereallyreadyforit•  Workrequiredtoadaptdeploymentconfigura.on

–  PreviouslyreliantondhcpfornetworkconfigduringPXEandkickstart,qua_orthenupdatestoproper,sta.cconfig

–  New(again)NICnamingconven.onwithSL7,controlsNICnamesthroughouten.reinstalla.on

–  However,nodesinsubnetsotherthanthedeployserveralsoneedrou.nginforma.oninordertoinstall,currentlynotconfiguredinpxelinux.cfgorkickstart

•  SL7installa.onss.llneedsupervisingandojenneedhelp,SL6installa.onsjustwork

Page 15: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

NewOpera=ngSystem–SiteConfig•  Workrequiredtoadaptsite-wideconfigura.onforSL7andfor

Ceph–  /etc/rsyslog.conf

•  Wesendcoresystemlogsandcertainapplica.onlogstocentralloggers

•  FewermodulesloadedbydefaultwithSL7,requiredmodules(e.g.imuxsock)mustbedeclaredinfile,modulesforsystemdjournalmustbeadded(imjournal),asmustotherdirec.ves

–  /etc/sudoers•  Long-standingrequirementtosupportcephasasecondusecaseofsudo

•  UsefornagiostestsexecutedviaNRPE,sudoersconfiginanRPM•  Deployhostneedssudoersconfigonallnodesforceph-deploy•  ncm-cephprovidesthisviancm-sudo•  ConflictwithRPMandncm-sudo

Page 16: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

NewOpera=ngSystem-Aler=ng•  UsingnagiosRPMsfromEPEL•  Olderthanlatestversionbutwedon’tneednewfeaturesyet•  NSCA

–  VersiondifferencebetweenEPEL6(2.7)and7(2.9)–  Versionsareincompa.ble,packetsizeexpectedbyserverdiffers–  Packagednsca-client2.7forSL7hosts

•  NRPE–  NRPEdaemonrunsas‘nrpe’user,not‘nagios’user–  ReconfiguredNRPEtorunas‘nagios’userviasystemdunitfile–  Addunitfileunder/etc/systemd/system/nrpe.service.d/–  Noneedtomodifyexis.ngsudoersconfigforSL7

•  WouldhavepreferredtoupgradesiteinfrastructuretoSL7first

Page 17: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

UpgradetoSL7andJewel•  AjerupgradetoSL7,OSDdaemonsstarttocrash•  FirstOSDcrashesajerc.2daysofnormalopera.on•  Crashescon.nueoneata.me,buthundredscanbuildup•  HavetoperiodicallystartOSDs,butcanrestorefullopera.onwith

pa.ence•  ProblembecamemuchworseduringupgradetoJewel

–  CouldnotkeepallOSDsrunningdespitebesteffort–  OverhalftheOSDsstoppedrunning,clientI/Ohalted–  Readytodismantleexis.ngcluster(FY’14storage)andbuildnew

cluster(FY’15storage)anyway–  Mostinterestedingainingupgradeexperience–  Decideddeclaretheservicedeadtothetestcommunityandbuild

newcluster•  Butwhatwasthecause?

Page 18: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

TheXFSBug•  XFSreportspossiblememoryalloca.ondeadlockinkmem_allocon

storagenodes•  CausesXFStohangwhichcausestheOSDtocrash•  Bugalreadyreported:

–  h_p://tracker.ceph.com/issues/6301–  h_p://oss.sgi.com/pipermail/xfs/2013-December/032346.html

•  BugpresentinkernelsshippedwithSL5andSL6,fixedinkernelsshippedwithSL7

•  FilesystemsondatadiskswillneedtoberecreatedonSL7•  Willaffectallstoragesystems•  OurthankstoDanvanderSteratCERNforcommunica.ngthistous

Page 19: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

DeploymentofFY’15Storage•  Supermicro:36HDDsonHBAwithoneinternalOSHDD•  SpecifiedSL6forITTandusedforpre-deploymenttes.ng•  Pre-deployment‘hammer’tes.ngproceededsmoothly•  DeploymentintoCephwithSL7problema.c:

–  OSdisknowlastelementinlist,/dev/sdak–  ConfigmanagementassumesOSdiskis/dev/sda–  Deployhostprofilecontainsdisklayoutofallstoragenodes,read

fromstoragenodeprofiles–  Changedconfigmanagementtoprogramma.callyiden.fyboot

diskforallnodesandexcludethebootdiskfromlistofdatadisks–  Simplechangebutcanchangeprofileofeverynodeindatacentre–  Test,testandtestagaintogetitright

•  Deploymentdelayedbutproceededasexpectedonceable•  Servicenowverystable

Page 20: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Reflec=ons•  WecanbuildservicesuserswantbasedonCephstorage•  Challengewasalwaysgoingtobetechnicalinnaturebut…•  ...wedidnotappreciatethedevelopmenteffortrequired

earlyonintheproject•  Cephadministra.onisalearningcurve•  Onlinedocumenta.onisgenerallygoodbutthe“really

dangerous”featuresprovedusefulsocouldimprovefurther

•  Outlookispromising

Page 21: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Acknowledgements•  RALTeam:

–  GeorgeVasilakakos,IanJohnson,AlisonPacker,JamesAdams,AlastairDewhurst

–  Previouscontribu.ons:GeorgeRyall,TomByrne,ShaundeWi_•  Collaborators:

–  DanvanderSterandSébas.enPonceatCERN–  AndrewHanushevskyatSLAC–  BrianBockelmanatUniversityofNebraska-Lincoln

•  Andeveryonewhomadethishappen:

Thankyou

Page 22: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

BackupSlides

Page 23: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

EchoCluster•  Brief:ProvideareplacementforCASTORdisk-onlystorage•  Mo.va.on:

–  Reachinglimita.onsofCASTORperformance–  CASTORrequiressysadminsandaDBAs–  SeveralcomponentsoftheCASTORsystem–  Smallcommunity:cannotrecruitexperts,.meconsumingtotrainthem–  Reducedsupport:CERNhavemovedtoEOSfordiskstorage

•  Requirements:–  Performancemustcon.nuetoscalewithworkload–  Onlyrequireonespecialism(sysadmins)–  MustsupportestablisheddatatransferprotocolsusedbytheWLCG

•  Benefits:–  Largercommunity:canhireexperts,reducedtraining.me,Cephspecialisesindisk

storage–  Lessefforttooperateandmaintain

Page 24: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

ServicesOffered•  Ceph10.2.2(Jewelrelease,hasLTS)runningScien.ficLinux7x•  OnepoolperVOcreatedwith8+3erasurecoding•  Customgateways:GridFTPandXrootD

–  Pluginarchitecture,talkna.velytoCeph–  Bothbuiltonlibradosstriper–  Bothendpointsareinteroperableforputs/gets

•  Authen.ca.onviax509cer.ficate,currentlygrid-mapfile•  Authorisa.onviaXrootD’sAuthDBforbothGridFTPandXrootD

–  Namingscheme:<pool_name>:<space_token>/<file_name>–  Grantsrwaccesstoproduc.onusers,roaccesstoordinaryusers

fordataspaceandrwaccesstoallusersforscratchspace•  AlsoofferingdirectS3/Swijaccesstotrustedusers•  Tes.ngDynafed

Page 25: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

NetworkingConsidera=ons•  ExperiencewithCASTOR:Networkismuchlikeau.lityto

whichweconnectourservers,likepower•  WithCeph,needtoconsidernetworkingaspartofyour

Cephinfrastructure•  Willrequiredualnetworksonstoragenodesinlarge/busy

clusters–  Publicorclientnetwork–  Clusterorprivatenetwork

•  NetworkingdiagramgiveninsitereportbyMar.nBly:–  h_ps://indico.cern.ch/event/531810/contribu.ons/2302103/

Page 26: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

TheSIRIUSIncident(1)•  OccasionallyallOSDscrashonastoragenodethatneverwas“rightin

thehead”•  Nopreviousimpactonservicebutanuisancetoservicemanager•  Repairsundertakenandreturnedtoservicebuts.llnotfixed•  OSDsshutdown,butnodenotremovedfromcluster•  Rollingrebootsforkernelupdatesaccidentlybroughtnodebackinto

cluster•  OSDscrashagainbutleaves4PGsinastale+incompletestate•  ThesePGsareunabletoserveI/O•  AffectsusabilityofeveryrunningVM,VMsneedtobepaused•  TrytofixtheproblembutCephadminproceduresdonothelp•  Problema.cstoragenoderemovedfromclusterandnewVMs

createdinnewpool•  However,manypeople’sworkispoten.allylost

Page 27: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

TheSIRIUSIncident(2)•  Despitebestefforts,noresolu.on•  Otherdeadlinesdiverta_en.on•  DiscussincidentwithDanatCERNandsethimupwithrootaccess•  DanisawareofundocumentedprocedurethattellsOSDtoignore

historyandusethebestinforma.onavailable•  PerformschangeonOSDs•  PGsspendsome.mebackfillingbuteventuallygoac.ve+clean•  AlloldVMscannowrecovered,howeversixweekshavepassed•  Manynotrequiredbutsomeusersgladtorecoverworktheyfeared

waslost•  Reputa.onofcloudservicesomewhatdiminishedbutdid

demonstratewecanrecoverfromadisaster•  Personallygladwetookthehardwayoutasconsidereddele.ngpool

Page 28: Experience of Development and Deployment of a Large-Scale ... · Deployment of a Large-Scale Ceph-Based Data Storage System at RAL Bruno Canning Scien.fic Compu.ng Department UK

Hardware•  3×monitornodes:

–  DellR420,cpu:2×IntelXeonE5-2430v2,ram:64GiB.•  3×gatewaynodes:

–  DellR430,cpu:2×IntelXeonE5-2650v3,ram:128GiB.•  63×storagenodes:

–  XMA(SupermicroX10DRi),cpu:asgateways,ram:128GiB.–  OSDisk:1×233GiB2.5”HDD,DataDisk:36×5.46TiBHDD(WD6001F9YZ)viaaSASHBA.

•  TotalRawStorage=12.1PiB,13.6PB•  UsableStorage=8.8PiB,9.9PB