Download - The Workflow Abstraction

Transcript
Page 1: The Workflow Abstraction

Copyright @2013, Concurrent, Inc.

Strata SC2013-02-28

Paco NathanConcurrent, Inc.San Francisco, CA@pacoid

“The Workflow Abstraction”

1Friday, 01 March 13Background: dual in quantitative and distributed systems.I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -

Page 2: The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

The Workflow Abstraction

1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

2Friday, 01 March 13This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines -- what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.

Page 3: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Marketing Funnel – overview

In reference to Making Data Work…

Almost every business uses a model similar to this – give or take a few steps.

Customer leads go in at the top, those get refined through several stages,then results flow out the bottom.

3Friday, 01 March 13Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.

Page 4: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Impression

Sign Up

Click

Purchase

"Like"

Marketing Funnel – clickstream

Different funnel stages get represented in ecommerce by events captured in log files, as a class of machine datacalled clickstream

• ad impressions

• URL clicks

• landing page views

• new user registrations

• session cookies

• online purchases

• social network activity

• etc.

4Friday, 01 March 13Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.

Page 5: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Impression

Sign Up

Click

Purchase

"Like"

CTR

CPA

CPM

NPS, social graph, etc.

loyalty, win back, etc.

behaviors

Marketing Funnel – metrics

A variety of clickstream metrics canbe used as performance indicatorsat different stages of the funnel:

• CPM: cost per thousand

• CTR: click-through rate

• CPA: cost per action

• etc.

5Friday, 01 March 13The many different highly-nuanced metrics which apply are mind-boggling :)

Page 6: The Workflow Abstraction

Marketing Funnel – example calculationsCampaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

metric cost events formula rate

CPM $4,000 10^6$4,000

÷(10^6 ÷ 10^3)

$4.00

CTR - 3∙10^3 3∙10^3÷ 10^6 0.3%

CPA - 20$4,000

÷20

$200

6Friday, 01 March 13Here are examples of the kinds of calculations performed...

Page 7: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Marketing Funnel – predictive model

Given these metrics, we can go furtherto estimate cost per paying user (CPP) customer lifetime value (LTV), etc.

Then we can build a predictive model for return on investment (ROI) per customer, summarizing the funnel performance:

ROI = (LTV - CPP) ∕ CPP

As an example, after crunching lots of logs,suppose that…

CPP = $200 LTV = $2000 ROI = ($2000 - $200) ∕ $200

for a 9x multiple

7Friday, 01 March 13For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers, which describes the efficiency of the marketing funnel at different stages.

Page 8: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Marketing Funnel – example architecture

Let’s consider an example architecturefor calculating, reporting, and taking actionon funnel metrics, based on large-scale clickstream data…

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

8Friday, 01 March 13Here’s an example architecture of using clickstream metrics within an online business.

Page 9: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Impression

Sign Up

Click

Purchase

"Like"

CTR

CPA

CPM

NPS, social graph, etc.

loyalty, win back, etc.

behaviors

××

×××

Marketing Funnel – complexities

Multiple ad partners, different contracts terms, reporting different metrics at different times, click scrubs, etc.

Campaigns target specific geo/demo, test alternate landing pages, probablyneed to segment customer base…

These issues make clickstream data large and yet sparse.

Other issues:

• seasonal variation

• fluctuating currency exchange rates

• distortions due to credit card fraud

• diminishing returns

• forecasting requirements

9Friday, 01 March 13However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

scrubsmany vendors, data sources, different metrics to be alignedlots of roll-upsBayesian point estimatesforecasts and dashboards

social dimension makes this convolutednot simple

Page 10: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Impression

Sign Up

Click

Purchase

"Like"

CTR

CPA

CPM

NPS, social graph, etc.

loyalty, win back, etc.

behaviors

Marketing Funnel – very large scale

Even a small start-up may need to make decisions about billions of events, many millions of users, and millions of dollars in annual ad spend.

Ad networks attempt to simplify and optimize parts of the funnel process as a value-add.

The need for these insights has been a driver for Hadoop-related technologies.

10Friday, 01 March 13The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.

Page 11: The Workflow Abstraction

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

Impression

Sign Up

Click

Purchase

"Like"

CTR

CPA

CPM

NPS, social graph, etc.

loyalty, win back, etc.

behaviors

Marketing Funnel – very large scale

Even a small start-up may need to make decisions about billions of events, many millions of users, and millions of dollars in annual ad spend.

Ad networks attempt to simplify and optimize parts of the funnel process as a value-add.

The need for these insights has been a driver for Hadoop-related technologies.

funnel modeling and optimization requires complex data workflows to obtain the required insights

11Friday, 01 March 13These needs imply complex data workflows.

It’s not about doing a BI query or a pivot table; that’s how retailers were thinking when Amazon came along.

Page 12: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

12Friday, 01 March 13A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.

Page 13: The Workflow Abstraction

query/load clickstream RDBMS

roll-ups collabfilter

per-userrecommends

Circa 2008 – Hadoop at scale

Scenario: Analytics team at a large ad network…

Company had invested $MM capex in alarge data warehouse across LOBs

Mission-critical app had been written as a large SQL workflow in the DW

Marketing funnel metrics were estimatedfor many advertisers, many campaigns, many publishers, many customers – billions of calculations daily

Predictive models matched publisher ~ advertiserand campaign ~ user, to optimize marketing funnel performance

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

13Friday, 01 March 13Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.

Page 14: The Workflow Abstraction

×query/load clickstream RDBMS

roll-ups collabfilter

per-userrecommends

Circa 2008 – Hadoop at scale

Issues:

• critical app had hit hard limits for scalability

• several Tb data, 100’s of servers

• batch window length vs. failure rate vs. SLA in the context of business growth posed an existential risk

We built out a team to address these issues as rapidly as possible…

Needed to re-create that data workflowsbased on Enterprise requirements.

Campaigns

Customers

Awareness

Interest

Evalutation

Conversion

Referral

Repeat

14Friday, 01 March 13Marching orders: 5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City; 5 weeks to reverse engineer the mission-critical app without any access to its author; 5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.

Page 15: The Workflow Abstraction

HDFS

query/load clickstream RDBMS

roll-ups collabfilter

per-userrecommends

msgqueue

Circa 2008 – Hadoop at scale

Approach:

• reverse-engineered business process from ~1500 lines of undocumented SQL

• created a large, multi-step Apache Hadoop app on AWS

• leveraged cloud strategy to trade $MM capex for lower, scalable opex

• Amazon identified our app as one of the largest Hadoop deployments on EC2

• our app became a case study for AWS prior to Elastic MapReduce launch

15Friday, 01 March 13Our solution involved dependencies among more than a dozen Hadoop job steps.

Page 16: The Workflow Abstraction

HDFS

query/load clickstream RDBMS

roll-ups collabfilter

per-userrecommends

msgqueue

Circa 2008 – Hadoop at scale

Unresolved:

• ETL was still a separate app

• difficult to handle exceptions, notifications, debugging, etc., across the entire workflow

• data scientists wore beepers since Ops lacked visibility into business process

• coding directly in MapReduce created a staffing bottleneck ××

×

16Friday, 01 March 13This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve

Page 17: The Workflow Abstraction

Circa 2008 – Hadoop at scale

Unresolved:

• ETL was still a separate app

• difficult to handle exceptions, notifications, debugging, etc., across the entire workflow

• data scientists wore beepers since Ops lacked visibility into the app’s business logic

• coding directly in MapReduce created a staffing bottleneck

HDFS

query/load clickstream RDBMS

roll-ups collabfilter

per-userrecommends

msgqueue

a good solution for a large, commercial Apache Hadoop deployment, but workflow management lacked crucial features…

which led to a search for a better workflow abstraction

17Friday, 01 March 13While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.

Page 18: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

18Friday, 01 March 13Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.

Page 19: The Workflow Abstraction

Cascading – origins

API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products.

Wensel was following the Nutch open source project – before Hadoop even had a name.

He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.

19Friday, 01 March 13Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.

Page 20: The Workflow Abstraction

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without an need to create an entirely new language

• allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clusters

20Friday, 01 March 13Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.

Page 21: The Workflow Abstraction

quotes…

“Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.”

CIO, Thor Olavsrud2012-06-06cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

“Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.”

2012 BOSSIE Awards, James Borck2012-09-18infoworld.com/slideshow/65089

21Friday, 01 March 13Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues: * staffing bottleneck * operational complexity * system integration

Page 22: The Workflow Abstraction

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.

• partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera

• 5+ history of Enterprise production deployments,ASL 2 license, GitHub src, http://conjars.org

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.

22Friday, 01 March 13Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.

Page 23: The Workflow Abstraction

examples…

• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki

github.com/twitter/scalding/wiki

23Friday, 01 March 13Many case studies, many Enterprise production deployments now for 5+ years.

Page 24: The Workflow Abstraction

examples…

• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki

github.com/twitter/scalding/wiki

Cascading as the basis for workflow abstractions atop Hadoop and more, with a 5+ year history of production deployments across multiple verticals

24Friday, 01 March 13Cascading as a basis for workflow abstraction, for Enterprise data workflows

Page 25: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

25Friday, 01 March 13Code samples in Cascading / Cascalog / Scalding, based on Word Count

Page 26: The Workflow Abstraction

void map (String doc_id, String text):

for each word w in segment(text):

emit(w, "1");

void reduce (String word, Iterator group):

int count = 0;

for each pc in group:

count += Int(pc);

emit(word, String(count));

The Ubiquitous Word Count

Definition: count how often each word appears in a collection of text documents

This simple program provides an excellent test case for parallel processing, since it illustrates:

• requires a minimal amount of code

• demonstrates use of both symbolic and numeric values

• shows a dependency graph of tuples as an abstraction

• is not many steps away from useful search indexing

• serves as a “Hello World” for Hadoop apps

Any distributed computing framework which can run Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems.

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

count how often each word appears in a collection of text documents

26Friday, 01 March 13Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.

Page 27: The Workflow Abstraction

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

word count – conceptual flow diagram

cascading.org/category/impatient

27Friday, 01 March 13Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.

Page 28: The Workflow Abstraction

word count – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

28Friday, 01 March 13Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram

Page 29: The Workflow Abstraction

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

word count – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

29Friday, 01 March 13As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.

Page 30: The Workflow Abstraction

(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))

(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[\[\]\\\(\),.)\s]+"))

(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

word count – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

30Friday, 01 March 13Here is the same Word Count app written in Clojure, using Cascalog.

Page 31: The Workflow Abstraction

github.com/nathanmarz/cascalog/wiki

• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language

• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL

• composable subqueries, used for test-driven development (TDD) practices at scale

• Leiningen build: simple, no surprises, in Clojure itself

• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog

• has a learning curve, limited number of Clojure developers

• aggregators are the magic, and those take effort to learn

word count – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

31Friday, 01 March 13From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.

Page 32: The Workflow Abstraction

import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

32Friday, 01 March 13Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.

Page 33: The Workflow Abstraction

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale

• less learning curve than Cascalog, not as much of a high-level language

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

33Friday, 01 March 13If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.

Page 34: The Workflow Abstraction

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale (imagine SOA infra @ Google as an open source project)

• less learning curve than Cascalog, not as much of a high-level language

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Cascalog and Scalding DSLs leverage the functional aspects of MapReduce, helping to limit complexity in process

34Friday, 01 March 13Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…

Page 35: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

35Friday, 01 March 13Tracking back to the Marketing Funnel as an example workflow…Let’s consider how Cascading apps incorporate other components beyond Hadoop

Page 36: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

Back to our marketing funnel, let’s consider an example app… at the front end

LOB use cases drive demand for apps

36Friday, 01 March 13LOB use cases drive the demand for Big Data apps

Page 37: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

An example… in the back office

Organizations have substantial investmentsin people, infrastructure, process

37Friday, 01 March 13Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes

Page 38: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

An example… for the heavy lifting!

“Main Street” firms are migratingworkflows to Hadoop, for cost savings and scale-out

38Friday, 01 March 13“Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.

Page 39: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – taps

• taps integrate other data frameworks, as tuple streams

• these are “plumbing” endpoints in the pattern language

• sources (inputs), sinks (outputs), traps (exceptions)

• text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc.

• data serialization: Avro, Thrift, Kryo, JSON, etc.

• extend a new kind of tap in just a few lines of Java

schema and provenance get derived from analysis of the taps

39Friday, 01 March 13Speaking of system integration,taps provide the simplest approach for integrating different frameworks.

Page 40: The Workflow Abstraction

Cascading workflows – taps

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

source and sink tapsfor TSV data in HDFS

40Friday, 01 March 13Here are the taps in the WordCount source

Page 41: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – topologies

• topologies execute workflows on clusters

• flow planner is like a compiler for queries

- Hadoop (MapReduce jobs)

- local mode (dev/test or special config)

- in-memory data grids (real-time)

• flow planner can be extended to support other topologies

blend flows in different topologies into the same app – for example,batch (Hadoop) + transactions (IMDG)

41Friday, 01 March 13Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.

Page 42: The Workflow Abstraction

Cascading workflows – topologies

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

flow planner for Apache Hadoop topology

42Friday, 01 March 13Here is the flow planner for Hadoop in the WordCount source

Page 43: The Workflow Abstraction

example topologies…

43Friday, 01 March 13Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.

Page 44: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – ANSI SQL

• collab with Optiq – industry-proven code base

• ANSI SQL parser/optimizer atop Cascading flow planner

• JDBC driver to integrate into existing tools and app servers

• relational catalog over a collection of unstructured data

• SQL shell prompt to run queries

• enable analysts without retraining on Hadoop, etc.

• transparency for Support, Ops, Finance, et al.

• a language for queries – not a database,but ANSI SQL as a DSL for workflows

44Friday, 01 March 13ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.

Page 45: The Workflow Abstraction

ANSI SQL – CSV data in local file system

cascading.org/lingual

45Friday, 01 March 13The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.

Page 46: The Workflow Abstraction

ANSI SQL – shell prompt, catalog

cascading.org/lingual

46Friday, 01 March 13Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.

Page 47: The Workflow Abstraction

ANSI SQL – queries

cascading.org/lingual

47Friday, 01 March 13Here’s an example SQL query on that “employee” test database from MySQL.

Page 48: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – machine learning

• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML

• Cascading creates parallelized models to run at scale on Hadoop clusters

• Random Forest, Logistic Regression,GLM, Decision Trees, K-Means, Hierarchical Clustering, etc.

• integrate with other libraries(Matrix API, etc.) and great opensource tools (R, Weka, KNIME, RapidMiner, etc.)

• 2 lines of code or pre-built JAR

Run multiple variants of models as customer experiments

48Friday, 01 March 13PMML has been around for a while, and export is supported by nearly every commercial analytics platform, covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern

Page 49: The Workflow Abstraction

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)print(fit)

predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

model creation in R

cascading.org/pattern

49Friday, 01 March 13Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.

Page 50: The Workflow Abstraction

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

model run at scale as a Cascading app

cascading.org/pattern

50Friday, 01 March 13Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.

Page 51: The Workflow Abstraction

public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];

  Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

  // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "\t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "\t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

  // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );

  // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); }}

model run at scale as a Cascading app

51Friday, 01 March 13Source code for a simple Cascading app that runs PMML models in general.

Page 52: The Workflow Abstraction

PMML support…

52Friday, 01 March 13Popular tools which can create predictive models for export as PMML

Page 53: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – test-driven development

• assert patterns (regex) on the tuple streams

• adjust assert levels, like log4j levels

• trap edge cases as “data exceptions”

• TDD at scale:

1.start from raw inputs in the flow graph

2.define stream assertions for each stage of transforms

3.verify exceptions, code to remove them

4.when impl is complete, app has full test coverage

• TDD follows from Cascalog’s composable subqueries

• redirect traps in production to Ops, QA, Support, Audit, etc.

53Friday, 01 March 13TDD is not usually high on the list when people start discussing Big Data apps.

The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.

Page 54: The Workflow Abstraction

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – TDD meets API principles

• specify what is required, not how it must be achieved

• plan far ahead, before consuming cluster resources – fail fast prior to submit

• fail the same way twice – deterministicflow planners help reduce engineeringcosts for debugging at scale

• same JAR, any scale – app does notrequire a recompile to change data taps or cluster topologies

54Friday, 01 March 13Some of the design principles for the pattern language

Page 55: The Workflow Abstraction

Two Avenues…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

55Friday, 01 March 13Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity

Page 56: The Workflow Abstraction

Two Avenues…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

Hadoop almost never gets used in isolation; data workflows define the “glue” required for system integration of Enterprise apps

56Friday, 01 March 13Hadoop is almost never used in isolation.Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.

Page 57: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

57Friday, 01 March 13Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.

Page 58: The Workflow Abstraction

Cascading workflows – pattern language

Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations within the tuple flows bring functional programming aspects into Java apps.

In formal terms, this provides a pattern language.

58Friday, 01 March 13A pattern language, based on the metaphor of “plumbing”

Page 59: The Workflow Abstraction

references…

pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices.

amazon.com/dp/0195019199

design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”.

amazon.com/dp/0201633612

59Friday, 01 March 13Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.

Page 60: The Workflow Abstraction

Cascading workflows – pattern language

Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations within the tuple flows bring functional programming aspects into Java apps.

In formal terms, this provides a pattern language.

design principles of the pattern language ensure best practices for robust, parallel data workflows at scale

60Friday, 01 March 13The pattern language provides a structured method for solving large, complex design problems where the syntax of the language promotes use of best practices – which also addresses staffing issues

Page 61: The Workflow Abstraction

Cascading workflows – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

In formal terms, flow diagrams leverage a methodology called literate programming

Provides intuitive, visual representations for apps, greatfor cross-team collaboration.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

61Friday, 01 March 13Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first

Page 62: The Workflow Abstraction

references…

by Don Knuth

Literate ProgrammingUniv of Chicago Press, 1992

literateprogramming.com/

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

62Friday, 01 March 13Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.

Page 63: The Workflow Abstraction

examples…

• Scalding apps have nearly 1:1 correspondence between function calls and the elements in theirflow diagrams – excellent elision and literate representation

• noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated as a DOT file), sometimes in lieu of showing code

In formal terms, a flow diagram is a directed, acyclic graph (DAG) on which lots of interesting math applies for query optimization, predictive models about app execution, parallel efficiency metrics, etc.

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

63Friday, 01 March 13Literate programming examples observed on the email list are some of the best illustrations of this methodology.

Page 64: The Workflow Abstraction

Cascading workflows – business process

Following the essence of literate programming, Cascading workflows provide statements of business process

This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

As a separation of concerns between business process and implementation details (Hadoop, etc.)

This is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.

64Friday, 01 March 13Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)

Page 65: The Workflow Abstraction

references…

by Edgar Codd

“A relational model of data for large shared data banks”Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685

Rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks… this approach focuses on:

the process of structuring data

That’s what apps do – Making Data Work

65Friday, 01 March 13Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)

Page 66: The Workflow Abstraction

Cascading workflows – functional relational programming

The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.

Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:

Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn

66Friday, 01 March 13A more contemporary statement along similar lines...

Page 67: The Workflow Abstraction

Cascading workflows – functional relational programming

The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.

Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:

Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn

several theoretical aspects converge into software engineering practices which mitigates the complexity of building and maintaining Enterprise data workflows

67Friday, 01 March 13

Page 68: The Workflow Abstraction

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R1. Funnel2. Circa 20083. Cascading4. Sample Code5. Workflows6. Abstraction7. Trendlines

68Friday, 01 March 13Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?

Page 69: The Workflow Abstraction

Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware.

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this.

69Friday, 01 March 13Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.

Page 70: The Workflow Abstraction

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

70Friday, 01 March 13Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos

Page 71: The Workflow Abstraction

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

71Friday, 01 March 13Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.

Page 72: The Workflow Abstraction

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

72Friday, 01 March 13Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Mesos, etc.

Page 73: The Workflow Abstraction

Asymptotically…

• long-term trends toward more instrumentation of Enterprise data workflows:

- workflow abstraction enables business cases

- more machine data collected about apps

- flow diagram (DAG) as unit of work (abstract type for machine data)

- evolving feedback loops convert machine data into actionable insights and optimizations

• industry moves beyond common needs of ad-hoc queries on logs and basic reporting, as a new class of complex data workflows emerges to provide the insights required by Enterprise

• end game is less about “bigness” of data, more about managing complexity in the process of structuring data

Workflow

ClusterScheduler

Planner/Optimizer

Cluster

DSL

AppHistory

73Friday, 01 March 13In summary…

Page 74: The Workflow Abstraction

by Leo Breiman

Statistical Modeling: The Two CulturesStatistical Science, 2001

bit.ly/eUTh9L

references…

74Friday, 01 March 13Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)

Page 75: The Workflow Abstraction

Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtube.com/watch?v=E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtube.com/watch?v=qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

“The Birth of Google” – John Battellewired.com/wired/archive/13.08/battelle.html

references…

75Friday, 01 March 13In their own words…

Page 76: The Workflow Abstraction

by Paco Nathan

Enterprise Data Workflowswith Cascading

O’Reilly, 2013amazon.com/dp/1449358721

references…

76Friday, 01 March 13Some of this material comes from an upcoming O’Reilly book:“Enterprise Data Workflows with Cascading”

This should be in Rough Cuts soon -scheduled to be out in print this June.

Many thanks to my wonderful editor, Courtney Nash.

Page 77: The Workflow Abstraction

blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities:

cascading.org

zest.to/group11

github.com/Cascading

conjars.org

goo.gl/KQtUL

concurrentinc.com

join us for very interesting work!

drill-down…

Copyright @2013, Concurrent, Inc.

77Friday, 01 March 13Links to our open source projects, developer community, etc…

contact me @pacoidhttp://concurrentinc.com/(we're hiring too!)