Founding a Hadoop Data Science Lab

FoundingaHadoopLabEVERYTHING YOUALWAYSWANTEDTOKNOW,BUT WEREAFRAIDTOASK,

ABOUT FINDING SUCCESSWITHHADOOP IN YOUR ORAGANIZAT ION

AndreLangevinlangevin@utilis.ca

AShortIntroductiontoYourSpeaker

MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ Advisory rolesonHadoop infinance

MyCareerinFinance◦ Fourbanks,onestockexchange,onepension fund◦ Capitalmarkets,retailbanking,enterprise riskroles◦ Founderof twoITdepartments◦ Technology leaderinRiskSystemsfor15years–◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE

AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,building ateamandforming partnerships◦ Foundationalworktosetapathtosuccess

Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning, charge-out,andthecentralcapitalaccount

Real-lifeLessonsLearned◦ Settingupinfrastructure totakeadvantageofHadoop’s uniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles

ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems

WhatrolewillyourHadoopLabplay?“YOUCAN’TSHRINKYOURWAYTOGREATNESS”- TOM PETER S

WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?

Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance

Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusing training,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption

FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership◦ Includeshareofexecutionqueues, directorystructuresandcascadingpermissions

◦ Setupself-serveuseron-boarding through yourorganization’sHelpDesk◦ Implementsingle signonforKerberos-securedclusters

Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforboth interactiveandapplicationuses◦ Use“showback”reporting tomonitorperformanceagainstobjectives

Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild

◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews

MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform◦ Investincontinuous integrationandautomatedregressiontestingforyourdevelopment teams◦ Establishabetter-than-quarterlyreleasecycle◦ Publish achecklist ofacceptableopen sourcelicenses (orblacklistofprohibitedones)

◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments

Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoop insideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaboration intheopensourcecommunity◦ Prohibitequipment “carveouts”fromyoursharedgrid◦ Includethecostofadditional equipmentinthebusiness case,co-locate,andchargeoutaccordingly

BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridof intermediatedeveloperandjuniordatascientist◦ Gooddataengineering acceleratesdatascience,andtheabilitytodeploydatasciencetoproduction

Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess

Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia

FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneed theadministrativeaccessthattheseteamsareallowed

YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholders isparticularlyvaluablefortechnology teams

EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolving theaccountingproblems togettheirexpertise

Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunity todothis

AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutions includeHAWQ,Vertica,Polybase*

Whatisareasonablebudget?“PRICEISWHATYOUPAY. VALUEISWHATYOUGET.”- WARRENBUFFET

UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity◦ Mostfront-endsystemshavedozensofoutbound feedsthattheyhavetosupportandmaintain– offerthemthechancetodropoff

asinglecomprehensive feedtoHadoopsothatconsumerscanbuild andmanagetheirownoutbound feeds◦ Consuming systemsalsohavesupport teamsmanaginginbound feeds,sotheywon’tseeasignificantchangeinsupport costs

◦ DataconsumerswillseeHadoopasimproving theircapabilities◦ Traditionaldatasupply chainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovision it,andthenfinally a

datascientist getstoconsume it◦ Givingdatascientists accesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoproviding thedata!

Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgrid shouldbesharedbytheconsumersofgrid services

SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore, “storage”isless

Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgrid inabankdatacentrearearoundUS$550/TBperyear◦ Capitalchargesforinfrastructurecosts,including serversanddedicatednetworkswitching, areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupport subscriptions foroperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded

◦ ThiscreatesaroundUS$450/TBperyearofbudget roomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear◦ Thisbudgetfunds astaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear

FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumption willbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution

CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters

CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadoop tofund intra-yearwork◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-ups onHadoopcanbecheaperthantapes

Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams◦ Mostapplicationshave“lightson”funding insufficient tosupport thePOCs neededtoexploreHadoopadoption◦ Setasidefunding topayforcross-teamchargesforparticipationinaPOC◦ UsethePOCs tosupportprojectproposals basedoncostreduction

◦ Staffing theHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability

Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER

SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone

Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine

AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipment failureinanapplianceisall-or-nothing◦ CentralizingtheHadoop gridintooneapplianceincreases theneedforexpensive faulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliances barelystayunderthe$1K/TBbenchmark

◦ Yourvirtualizationfarmduplicatesallof thefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarks showthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyou endupwithmorenode-count-driven Hadoopcosts

NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotof trafficintoandoutof thegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork

Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigher speedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost

Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers

DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boarding dataproducers, thegreaterthedifficultyofupdating theDataLake’sHadoopdistribution – theincentiveto“standpat”grows◦ Evenworseifyou’reusing third-partytoolsforingestion – itcreatesanexternal stakeholderwhichcanblockchange!

◦ Themoresuccessfulyouareinon-boarding dataconsumers, thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution – datascientistsalwayswantthemostcurrent nextversionofeverything

Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements

◦ Putallofthedatascientistsonto theirowngridthatupdateswiththeHadoopdistribution◦ Self-serve dataprovisioning tosmallgridsinacloud alsoworksreallywellfromtheconsumer’s view

◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegrids ispainless

HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop◦ ThinkofHadoopascontainerinstead,andre-architecttheapplication toruninsideHadoop

◦ DonotuseHadoop tohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodels aremuchmoreefficientonHadoop

◦ DonotcreateabstractionlayersusinglayeredHiveviews

ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”often turnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanode tohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs

InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions

Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess

Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned◦ Thereareawealthofopen sourcesocialmediaingestion andanalysis toolsavailable

◦ IVRsystemsarelinked tocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit

◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessing toolsareavailableinpython

◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors – butsettingupaninbound feediseasy

DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern◦ Datascientistsdon’t knowwhattheirrequirementsareuntilthey’ve donetheirwork– theirjob istoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem

◦ Don’twaste(toomuch) timeoncentraldataquality– they’rejustgoing tore-doitanyway◦ ”Correct”dataissubjective bystudy, sothereisn’tananswertoimplementcentrally◦ Preparingatimeseries includes dataquality suitabletodatascience– regardlessofhowgoodthestartingdatais

◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers

DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata–whichbreakslotsofpolicies

SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping” forcontent– evenwhengovernancewillinitiallyprohibit usingthecontent

Investinadvanceddatamasking◦ Investinadvanceddatamasking toprepareproduction dataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlying data

Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools◦ Thegoodtoolsturnthe”shopping trip”intodeployable codethatyoucanpackagefordeployment orautomationeasily

ProjectsthatSucceed“RISKCOMESFROMNOTKNOWINGWHATYOU’REDOING”- WARRENBUFFET

QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop

Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundreds ofcopiesinparallel

Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-based toolscanbepairedwithsomebasicdatasciencetofind“lifeevents”

TrendAnalysisonRiskData◦ Simulationoutputs fromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning

Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects

VolckerRule◦ VolckerRulemetricsrequireanenormous amountofdata,whichisexpensivetostore◦ Retentionisrequired forfiveyearsofcalendardays◦ Computationscanbeimplemented inSQLandwillrunwellinHive

Customer360◦ Hadoop isanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata

DailyLiquidityManagement◦ Running thecalculationsbeforepooling facilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards

ThankYouforYourTime

Founding a Hadoop Data Science Lab

Business

Transcript of Founding a Hadoop Data Science Lab

Big Data and Hadoop: Lab at Innovate 2014

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

Prof. Alex (Sandy) Pentland Founding Director Media Lab Asia MIT Media Laboratory Global Partnerships.

Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

Information Visualization INFORMS Roundtable Ben Shneiderman (ben@cs.umd.edu) Founding Director (1983-2000), Human-Computer Interaction Lab Professor,

Cloud Computing with MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab matei@eecs.berkeley.edu.

UC Berkeley Introduction to MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab matei@eecs.berkeley.edu.

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование задач Hadoop."

Unit 2: Hadoop Architecturedocshare01.docshare.tips/files/29683/296830318.pdf · Lab 1 Hadoop Architecture The overwhelming trend towards digital services, combined with cheap storage,

Apache Hadoop Ecosystem - LIAS (Lab · Apache Hadoop Ecosystem ... Apache Drill, Cloudera Impala. Thank you for Your Attention Q & A Apache Hadoop Ecosystem ENSMA …

Setting Up a Hadoop System in Cloud A Lab Activity for …proc.iscap.info/2017/cases/4480.pdfHow to set up and configure the Hadoop system that consists of Hadoop Distributed File

Hadoop Installation Guide | Hadoop Configuration

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

SoftServe's Hadoop Demo Lab

Hands on Lab Introduction to Hadoop on the cloud using ... · PDF file1 Hands on Lab Introduction to Hadoop on the cloud using BigInsights on BlueMix dev@Pulse, Feb. 24 - 25, 2014

Analyzing Hadoop with Hadoop

CONNECT - Lab Guide - Broadcom Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters CONNECT - Lab Guide Hardware, software and configuration steps needed

Driven2Teach “Founding Fathers-- Founding Principles”

How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Hands on Lab Introduction to Hadoop with BigInsights · PDF file1 . Hands on Lab . Introduction to Hadoop with BigInsights dev@Pulse, Feb. 24 - 25, 2014 . Cindy Saracco, Senior Solutions