Post on 26-Jan-2017
FoundingaHadoopLabEVERYTHING YOUALWAYSWANTEDTOKNOW,BUT WEREAFRAIDTOASK,
ABOUT FINDING SUCCESSWITHHADOOP IN YOUR ORAGANIZAT ION
AndreLangevinlangevin@utilis.ca
AShortIntroductiontoYourSpeaker
MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ Advisory rolesonHadoop infinance
MyCareerinFinance◦ Fourbanks,onestockexchange,onepension fund◦ Capitalmarkets,retailbanking,enterprise riskroles◦ Founderof twoITdepartments◦ Technology leaderinRiskSystemsfor15years–◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE
AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,building ateamandforming partnerships◦ Foundationalworktosetapathtosuccess
Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning, charge-out,andthecentralcapitalaccount
Real-lifeLessonsLearned◦ Settingupinfrastructure totakeadvantageofHadoop’s uniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles
ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems
WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?
Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance
Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusing training,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption
FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership◦ Includeshareofexecutionqueues, directorystructuresandcascadingpermissions
◦ Setupself-serveuseron-boarding through yourorganization’sHelpDesk◦ Implementsingle signonforKerberos-securedclusters
Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforboth interactiveandapplicationuses◦ Use“showback”reporting tomonitorperformanceagainstobjectives
Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild
◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews
MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform◦ Investincontinuous integrationandautomatedregressiontestingforyourdevelopment teams◦ Establishabetter-than-quarterlyreleasecycle◦ Publish achecklist ofacceptableopen sourcelicenses (orblacklistofprohibitedones)
◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments
Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoop insideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaboration intheopensourcecommunity◦ Prohibitequipment “carveouts”fromyoursharedgrid◦ Includethecostofadditional equipmentinthebusiness case,co-locate,andchargeoutaccordingly
BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridof intermediatedeveloperandjuniordatascientist◦ Gooddataengineering acceleratesdatascience,andtheabilitytodeploydatasciencetoproduction
Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess
Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia
FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneed theadministrativeaccessthattheseteamsareallowed
YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholders isparticularlyvaluablefortechnology teams
EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolving theaccountingproblems togettheirexpertise
Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunity todothis
AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutions includeHAWQ,Vertica,Polybase*
UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity◦ Mostfront-endsystemshavedozensofoutbound feedsthattheyhavetosupportandmaintain– offerthemthechancetodropoff
asinglecomprehensive feedtoHadoopsothatconsumerscanbuild andmanagetheirownoutbound feeds◦ Consuming systemsalsohavesupport teamsmanaginginbound feeds,sotheywon’tseeasignificantchangeinsupport costs
◦ DataconsumerswillseeHadoopasimproving theircapabilities◦ Traditionaldatasupply chainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovision it,andthenfinally a
datascientist getstoconsume it◦ Givingdatascientists accesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoproviding thedata!
Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgrid shouldbesharedbytheconsumersofgrid services
SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore, “storage”isless
Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgrid inabankdatacentrearearoundUS$550/TBperyear◦ Capitalchargesforinfrastructurecosts,including serversanddedicatednetworkswitching, areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupport subscriptions foroperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded
◦ ThiscreatesaroundUS$450/TBperyearofbudget roomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear◦ Thisbudgetfunds astaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear
FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumption willbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution
CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters
CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadoop tofund intra-yearwork◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-ups onHadoopcanbecheaperthantapes
Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams◦ Mostapplicationshave“lightson”funding insufficient tosupport thePOCs neededtoexploreHadoopadoption◦ Setasidefunding topayforcross-teamchargesforparticipationinaPOC◦ UsethePOCs tosupportprojectproposals basedoncostreduction
◦ Staffing theHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability
Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER
SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone
Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine
AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipment failureinanapplianceisall-or-nothing◦ CentralizingtheHadoop gridintooneapplianceincreases theneedforexpensive faulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliances barelystayunderthe$1K/TBbenchmark
◦ Yourvirtualizationfarmduplicatesallof thefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarks showthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyou endupwithmorenode-count-driven Hadoopcosts
NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotof trafficintoandoutof thegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork
Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigher speedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost
Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers
DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boarding dataproducers, thegreaterthedifficultyofupdating theDataLake’sHadoopdistribution – theincentiveto“standpat”grows◦ Evenworseifyou’reusing third-partytoolsforingestion – itcreatesanexternal stakeholderwhichcanblockchange!
◦ Themoresuccessfulyouareinon-boarding dataconsumers, thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution – datascientistsalwayswantthemostcurrent nextversionofeverything
Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements
◦ Putallofthedatascientistsonto theirowngridthatupdateswiththeHadoopdistribution◦ Self-serve dataprovisioning tosmallgridsinacloud alsoworksreallywellfromtheconsumer’s view
◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegrids ispainless
HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop◦ ThinkofHadoopascontainerinstead,andre-architecttheapplication toruninsideHadoop
◦ DonotuseHadoop tohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodels aremuchmoreefficientonHadoop
◦ DonotcreateabstractionlayersusinglayeredHiveviews
ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”often turnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanode tohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs
InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions
Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess
Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned◦ Thereareawealthofopen sourcesocialmediaingestion andanalysis toolsavailable
◦ IVRsystemsarelinked tocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit
◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessing toolsareavailableinpython
◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors – butsettingupaninbound feediseasy
DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern◦ Datascientistsdon’t knowwhattheirrequirementsareuntilthey’ve donetheirwork– theirjob istoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem
◦ Don’twaste(toomuch) timeoncentraldataquality– they’rejustgoing tore-doitanyway◦ ”Correct”dataissubjective bystudy, sothereisn’tananswertoimplementcentrally◦ Preparingatimeseries includes dataquality suitabletodatascience– regardlessofhowgoodthestartingdatais
◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers
DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata–whichbreakslotsofpolicies
SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping” forcontent– evenwhengovernancewillinitiallyprohibit usingthecontent
Investinadvanceddatamasking◦ Investinadvanceddatamasking toprepareproduction dataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlying data
Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools◦ Thegoodtoolsturnthe”shopping trip”intodeployable codethatyoucanpackagefordeployment orautomationeasily
QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop
Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundreds ofcopiesinparallel
Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-based toolscanbepairedwithsomebasicdatasciencetofind“lifeevents”
TrendAnalysisonRiskData◦ Simulationoutputs fromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning
Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects
VolckerRule◦ VolckerRulemetricsrequireanenormous amountofdata,whichisexpensivetostore◦ Retentionisrequired forfiveyearsofcalendardays◦ Computationscanbeimplemented inSQLandwillrunwellinHive
Customer360◦ Hadoop isanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata
DailyLiquidityManagement◦ Running thecalculationsbeforepooling facilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards