State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

STATE OF PLAY

SEAN OWENDIRECTOR OF DATA SCIENCE CLOUDERA

State of PlayData Science on Hadoop in 2015

Sean Owen // Director, Data Science @ Cloudera

2

About …• Engineer • Data Science @ Cloudera• Oryx project founder• Committer, erstwhile VP Apache

Mahout• Apache Spark contributor /

personality• Co-author, Mahout in Action /

Advanced Analytics on Spark• [email protected] /

@sean_r_owen

3

Where Is My Magic Wand?

4

We Like Hadoop Because …

• (Was) Shiny New Toy• Be Like Yahoo, Google,

FB• Data as Strategy

• Free – Just Add Hardware• Open, Standard• Cost-Savings Projects

• Bigger and Faster is Better

• Fewer Hacks to Survive Scale

• Do The Previously Impossible

It’s Aspirational It Costs Less We Get MoreComputing

www.avalonconsulting.net/blog/485-thinking-beyond-shiny-and-new

www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/

5

Incremental Today vs. Revolutionary Tomorrow• We set up a prototype Hadoop

cluster as part of a big data POC• We cut our IT budget by 22% by

moving some operations to Hadoop• Our SQL queries are 3 times faster

and overnight reports finish in 39 minutes now

• We do the same things with data, but do them notably better.

• We want to become a real-time product business that reacts to new machine sensor data in seconds, not days

• We want to predict which merchants will take out a business loan this month

• We want a complete customer profile that “understands” what they want at any time

• We think there is a magic wand available?

6

Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!

7

Demystifying with Data Science• Machine Learning is not new• Big Machine Learning is qualitatively different

– More data beats algorithm improvement– Scale trumps noise and sample size effects– Can brute-force manual tasks

• Feature selection• Hyperparameter tuning

• Engineering “Big” is Difficult– Build new scalable data platforms– Re-engineering parallel algorithms

8

What is Data Science?What skill sets does it require?

What tools are commonly used?How do we architect data products?

How do we get started?

9

Three Camps

10

s3.a

maz

onaw

s.co

m/a

ws.

drew

conw

ay.c

om/v

iz/v

enn_

diag

ram

/dat

a_sc

ienc

e.ht

ml

11

Business

12

Business

13

Engineering vs. Statistics

Programming languagesSystems languagesLatency, throughput

Huge dataOnline problems

AutomatedDevelopers, Engineers

Statistical environments, BI toolsHigh-level languagesAccuracyMedium-sized dataOffline workAd-hocStatisticians, Analysts

vs.

14

Data Science + Hadoop

15

Engineering, Statistics & Hadoop: Before

Gap.

16

Engineering, Statistics & Hadoop: 2014

YAR

N R

M

17

Apache Spark: Something for Everyone• Now Apache TLP

– From UC Berkeley AMPLab– … inspired by MS DryadLINQ

• Scala-based– Expressive, efficient– JVM-based

• Scala-like abstractions– RDD: Resilient Distributed (immutable)

Dataset– Distributed works like local– Like Apache Crunch is Collection-like

• Read-Evaluate-Print-Loop– Interactive– No compile/deploy cycle needed

• Python API too• Natively Distributed• Hadoop-friendly

– Integrate with where data already is– ETL no longer separate

• Subprojects: MLlib and more

18

Statisticians: Shell, Concise Syntax

<row Id="4"...Tags="...c#...winforms..."/>

(4,"c#")(4,"winforms")...

(4,3104,1.0)(4,2148819,1.0)...

scala> val postIDTags = postsXML.flatMap { line =>val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r

val tagRegex = "<([^&]+)>".ridTagRegex.findFirstMatchIn(line) match {case None => None case Some(m) => {val postID = m.group(1).toIntval tagsString = m.group(2)val tags = tagRegex.findAllMatchIn(tagsString)

.map(_.group(1)).toListtags.map((postID,_))

}}

}

19

Engineers: Distributed, Manageable

20

2015 is Time to Operationalize

21

From Exploratory to Operational

Exploratory Analytics Operational Analytics

Explore DataPick Model

Build Model at Scale, Offline

Continuously Update Model

Score Model inReal-Time

22

Lambda λArchitecture noun. 1. Name of a design idea you’ve had before but didn’t realize was a thing that needed a name.

23

Lambda Architecture

λ:Streaming

• Lambda Architecture– Batch Layer: compute full answer offline,

in batch– Speed Layer: compute approximate

answer online, in near-real-time– Serving Layer: stitch speed/batch

answers together in real-time

• Great fit for big, real-time ML• Ecosystem has right components

now– Batch: Spark + MLlib– Speed: Spark Streaming– Serving: Tomcat / Jetty– Data Fabric: Kafka, HDFS

24

Oryx 2: Lambda for ML (alpha)

github.com/OryxProject/oryx

Thank [email protected]@sean_r_owen

17TH ~ 18th NOV 2014MADRID (SPAIN)

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Technology

Transcript of State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014