The Workflow Abstraction

download The Workflow Abstraction

of 77

  • date post

  • Category


  • view

  • download


Embed Size (px)


Strata 2013 talk, "The Workflow Abstraction" by Paco Nathan.

Transcript of The Workflow Abstraction

  • 1. The Workow Abstraction Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Friday, 01 March 13 1Background: dual in quantitative and distributed systems.Ive spent the past decade leading innovative Data teams responsible for many successful large-scale apps -

2. The Workflow AbstractionDocumentCollection Scrub Tokenize tokenM 1. Funnel HashJoin Regex Left tokenGroupByRStop Wordtoken List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. TrendlinesFriday, 01 March 132This talk is about the workow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data WorkowsWell consider some theory, examples, best practices, trendlines --what are the drivers that brought us, and where is this work heading toward?Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workows -- robust apps at scale -- for Hadoop and beyond. 3. Marketing Funnel overview In reference to Making Data Work Customers Almost every business uses a model similar to this give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get rened through several stages, then results ow out the bottom.Interest Evalutation ConversionReferral RepeatFriday, 01 March 133Lets consider one of the most fundamental predictive models used in business: a marketing funnel.This is an exercise which Ive had to run through at nearly every rm in recent years -- analytics for the marketing funnel. 4. Marketing Funnel clickstreamDifferent funnel stages get representedin ecommerce by events captured in Customerslog les, as a class of machine datacalled clickstream CampaignsImpression ad impressions Awareness URL clicksClick landing page viewsInterest new user registrationsSign Up Evalutation session cookiesPurchase online purchases Conversion social network activity "Like" etc.Referral RepeatFriday, 01 March 134Online advertising involves what we call clickstream data, lots of events in log les -- i.e., lots of unstructured data. 5. Marketing Funnel metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: CampaignsCPM: cost per thousandImpressionCTR: click-through rate Awareness CPMCPA: cost per action Clicketc. Interest CTRSign Up Evalutationbehaviors Purchase Conversion CPA"Like"ReferralNPS, social graph, etc. Repeatloyalty, win back, etc.Friday, 01 March 13 5The many different highly-nuanced metrics which apply are mind-boggling :) 6. Marketing Funnel example calculations Customers Campaigns AwarenessInterestmetric cost events formula rateEvalutation ConversionReferral Repeat$4,000CPM $4,000 10^6 $4.00 (10^6 10^3) 310^3 CTR- 310^3 10^6 0.3%$4,000 CPA- 20 $20020Friday, 01 March 136Here are examples of the kinds of calculations performed... 7. Marketing Funnel predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc.Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI (LTV CPP) CPP Interest As an example, after crunching lots of logs, Evalutation suppose thatConversion CPP $200 LTV $2000 Referral ROI ($2000 $200) $200Repeat for a 9x multipleFriday, 01 March 13 7For applications within a business, we can use these calculated metrics to create a predictive model for the protability of customers,which describes the efciency of the marketing funnel at different stages. 8. Marketing Funnel example architectureCustomersCampaignsCustomersAwareness Lets consider an example architectureInterestEvalutation for calculating, reporting, and taking actionWebConversion on funnel metrics, based on large-scaleApp ReferralRepeat clickstream datalogs CachelogsLogsSupport source trapsink taptaptap Data ModelingPMMLWorkflowsource sinktap tap AnalyticsCubescustomerCustomerprofile DBsPrefsHadoopCluster ReportingFriday, 01 March 13 8Heres an example architecture of using clickstream metrics within an online business. 9. Marketing Funnel complexities Multiple ad partners, different contracts terms, reporting different metrics atCustomers different times, click scrubs, etc.Campaigns Campaigns target specic geo/demo, Impression test alternate landing pages, probably Awareness CPM need to segment customer baseClick These issues make clickstream dataInterest CTR large and yet sparse. Sign UpEvalutationbehaviors Other issues:Purchase seasonal variation Conversion CPA uctuating currency exchange rates"Like" ReferralNPS, social graph, etc. distortions due to credit card fraud diminishing returnsRepeatloyalty, win back, etc. forecasting requirementsFriday, 01 March 139However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.scrubsmany vendors, data sources, different metrics to be alignedlots of roll-upsBayesian point estimatesforecasts and dashboardssocial dimension makes this convolutednot simple 10. Marketing Funnel very large scale Even a small start-up may need to make decisions about billions ofCustomers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add.Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutationbehaviors Purchase Conversion CPA"Like"ReferralNPS, social graph, etc. Repeatloyalty, win back, etc.Friday, 01 March 13 10The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related Big Data technologies. 11. Marketing Funnel very large scaleEven a small start-up may need tomake decisions about billions of Customersevents, many millions of users, andmillions of dollars in annual ad spend.Campaigns ImpressionAd networks attempt to simplify andAwareness CPMoptimize parts of the funnel processClickas a value-add.funnel modeling and optimizationInterest CTRThe need for these insights has been aSign Updriver for Hadoop-relatedrequires complex data workows technologies. Evalutationbehaviorsto obtain the required insightsPurchase Conversion CPA"Like"ReferralNPS, social graph, etc. Repeatloyalty, win back, etc.Friday, 01 March 13 11These needs imply complex data workows.Its not about doing a BI query or a pivot table;thats how retailers were thinking when Amazon came along. 12. The Workflow AbstractionDocumentCollection Scrub Tokenize tokenM1. Funnel HashJoin Regex Left tokenGroupByRStop Wordtoken List RHS Count Word Count2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. TrendlinesFriday, 01 March 1312A personal history of ad networks, Apache Hadoop apps, and Enterprise data workows, circa 2008. 13. Circa 2008 Hadoop at scaleCustomers Scenario: Analytics team at a large ad networkCampaignsAwareness Company had invested $MM capex in a Interest large data warehouse across LOBs EvalutationConversion Mission-critical app had been written as Referral collab Repeat a large SQL workow in the DWroll-ups lter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers billions of calculations daily query/load Predictive models matched publisher ~ advertiserclickstream RDBMS and campaign ~ user, to optimize marketing funnel performanceFriday, 01 March 13 13Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood. 14. Circa 2008 Hadoop at scale Customers Issues: Campaigns Awareness critical app had hit hard limits for scalabilityInterest several Tb data, 100s of servers Evalutation Conversion batch window length vs. failure rate vs. SLAcollabReferral Repeatin the context of business growth posedroll-upslteran existential risk We built out a team to address these issuesper-userrecommends as rapidly as possible Needed to re-create that data workows query/load based on Enterprise requirements.clickstream RDBMSFriday, 01 March 1314Marching orders:5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;5 weeks to reverse engineer the mission-critical app without any access to its author;5 weeks to implement a Hadoop version which could scale-out on EC2.We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc. 15. Circa 2008 Hadoop at scaleApproach: roll-ups collab lter reverse-engineered business process from ~1500 lines of undocumented SQL per-user created a large, multi-step Apache Hadoop recommends app on AWSHDFS leveraged cloud strategy to trade $MM capex for lower, scalable opex Amazon identied our app as one of the msg queue largest Hadoop deployments on EC2 our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstreamFriday, 01 March 13 15Our solution involved dependencies among more than a dozen Hadoop job steps. 16. Circa 2008 Hadoop at scale Unresolved: roll-upscollablter ETL was still a separate app difcult to handle exceptions, notications,per-userdebugging, etc., across the entire workowrecommendsHDFS data scientists wore beepers since Ops lacked visibility into business process coding directly in MapReduce createda stafng bottleneck msgqueuequery/loadclickstream RDBMSFriday, 01 March 1316This underscores the need for a unied space for the entire data workow, visible to the compiler and JVM --for troubleshooting, handling exceptions, notications, etc.Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, whi