State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

27
STATE OF PLAY SEAN OWEN DIRECTOR OF DATA SCIENCE CLOUDERA

Transcript of State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Page 1: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

STATE OF PLAY

SEAN OWENDIRECTOR OF DATA SCIENCE CLOUDERA

Page 2: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

State of PlayData Science on Hadoop in 2015

Sean Owen // Director, Data Science @ Cloudera

Page 3: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

2

About …• Engineer • Data Science @ Cloudera• Oryx project founder• Committer, erstwhile VP Apache

Mahout• Apache Spark contributor /

personality• Co-author, Mahout in Action /

Advanced Analytics on Spark• [email protected] /

@sean_r_owen

Page 4: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

3

Where Is My Magic Wand?

Page 5: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

4

We Like Hadoop Because …

• (Was) Shiny New Toy• Be Like Yahoo, Google,

FB• Data as Strategy

• Free – Just Add Hardware• Open, Standard• Cost-Savings Projects

• Bigger and Faster is Better

• Fewer Hacks to Survive Scale

• Do The Previously Impossible

It’s Aspirational It Costs Less We Get MoreComputing

www.avalonconsulting.net/blog/485-thinking-beyond-shiny-and-new

www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/

Page 6: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

5

Incremental Today vs. Revolutionary Tomorrow• We set up a prototype Hadoop

cluster as part of a big data POC• We cut our IT budget by 22% by

moving some operations to Hadoop• Our SQL queries are 3 times faster

and overnight reports finish in 39 minutes now

• We do the same things with data, but do them notably better.

• We want to become a real-time product business that reacts to new machine sensor data in seconds, not days

• We want to predict which merchants will take out a business loan this month

• We want a complete customer profile that “understands” what they want at any time

• We think there is a magic wand available?

Page 7: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

6

Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!

Page 8: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

7

Demystifying with Data Science• Machine Learning is not new• Big Machine Learning is qualitatively different

– More data beats algorithm improvement– Scale trumps noise and sample size effects– Can brute-force manual tasks

• Feature selection• Hyperparameter tuning

• Engineering “Big” is Difficult– Build new scalable data platforms– Re-engineering parallel algorithms

Page 9: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

8

What is Data Science?What skill sets does it require?

What tools are commonly used?How do we architect data products?

How do we get started?

Page 10: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

9

Three Camps

Page 11: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

10

s3.a

maz

onaw

s.co

m/a

ws.

drew

conw

ay.c

om/v

iz/v

enn_

diag

ram

/dat

a_sc

ienc

e.ht

ml

Page 12: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

11

Business

Page 13: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

12

Business

Page 14: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

13

Engineering vs. Statistics

Programming languagesSystems languagesLatency, throughput

Huge dataOnline problems

AutomatedDevelopers, Engineers

Statistical environments, BI toolsHigh-level languagesAccuracyMedium-sized dataOffline workAd-hocStatisticians, Analysts

vs.

Page 15: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

14

Data Science + Hadoop

Page 16: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

15

Engineering, Statistics & Hadoop: Before

Gap.

Page 17: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

16

Engineering, Statistics & Hadoop: 2014

YAR

N R

M

Page 18: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

17

Apache Spark: Something for Everyone• Now Apache TLP

– From UC Berkeley AMPLab– … inspired by MS DryadLINQ

• Scala-based– Expressive, efficient– JVM-based

• Scala-like abstractions– RDD: Resilient Distributed (immutable)

Dataset– Distributed works like local– Like Apache Crunch is Collection-like

• Read-Evaluate-Print-Loop– Interactive– No compile/deploy cycle needed

• Python API too• Natively Distributed• Hadoop-friendly

– Integrate with where data already is– ETL no longer separate

• Subprojects: MLlib and more

Page 19: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

18

Statisticians: Shell, Concise Syntax

<row Id="4"...Tags="...c#...winforms..."/>

(4,"c#")(4,"winforms")...

(4,3104,1.0)(4,2148819,1.0)...

scala> val postIDTags = postsXML.flatMap { line =>val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r

val tagRegex = "&lt;([^&]+)&gt;".ridTagRegex.findFirstMatchIn(line) match {case None => None case Some(m) => {val postID = m.group(1).toIntval tagsString = m.group(2)val tags = tagRegex.findAllMatchIn(tagsString)

.map(_.group(1)).toListtags.map((postID,_))

}}

}

Page 20: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

19

Engineers: Distributed, Manageable

Page 21: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

20

2015 is Time to Operationalize

Page 22: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

21

From Exploratory to Operational

Exploratory Analytics Operational Analytics

Explore DataPick Model

Build Model at Scale, Offline

Continuously Update Model

Score Model inReal-Time

Page 23: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

22

Lambda λArchitecture noun. 1. Name of a design idea you’ve had before but didn’t realize was a thing that needed a name.

Page 24: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

23

Lambda Architecture

λ:Streaming

• Lambda Architecture– Batch Layer: compute full answer offline,

in batch– Speed Layer: compute approximate

answer online, in near-real-time– Serving Layer: stitch speed/batch

answers together in real-time

• Great fit for big, real-time ML• Ecosystem has right components

now– Batch: Spark + MLlib– Speed: Spark Streaming– Serving: Tomcat / Jetty– Data Fabric: Kafka, HDFS

Page 25: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

24

Oryx 2: Lambda for ML (alpha)

github.com/OryxProject/oryx

Page 26: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Thank [email protected]@sean_r_owen

Page 27: State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

17TH ~ 18th NOV 2014MADRID (SPAIN)