Extending the Data Warehouse with Hadoop - Hadoop world 2011
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014
-
Upload
big-data-spain -
Category
Technology
-
view
211 -
download
1
Transcript of State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014
STATE OF PLAY
SEAN OWENDIRECTOR OF DATA SCIENCE CLOUDERA
State of PlayData Science on Hadoop in 2015
Sean Owen // Director, Data Science @ Cloudera
2
About …• Engineer • Data Science @ Cloudera• Oryx project founder• Committer, erstwhile VP Apache
Mahout• Apache Spark contributor /
personality• Co-author, Mahout in Action /
Advanced Analytics on Spark• [email protected] /
@sean_r_owen
3
Where Is My Magic Wand?
4
We Like Hadoop Because …
• (Was) Shiny New Toy• Be Like Yahoo, Google,
FB• Data as Strategy
• Free – Just Add Hardware• Open, Standard• Cost-Savings Projects
• Bigger and Faster is Better
• Fewer Hacks to Survive Scale
• Do The Previously Impossible
It’s Aspirational It Costs Less We Get MoreComputing
www.avalonconsulting.net/blog/485-thinking-beyond-shiny-and-new
www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/
5
Incremental Today vs. Revolutionary Tomorrow• We set up a prototype Hadoop
cluster as part of a big data POC• We cut our IT budget by 22% by
moving some operations to Hadoop• Our SQL queries are 3 times faster
and overnight reports finish in 39 minutes now
• We do the same things with data, but do them notably better.
• We want to become a real-time product business that reacts to new machine sensor data in seconds, not days
• We want to predict which merchants will take out a business loan this month
• We want a complete customer profile that “understands” what they want at any time
• We think there is a magic wand available?
6
Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
7
Demystifying with Data Science• Machine Learning is not new• Big Machine Learning is qualitatively different
– More data beats algorithm improvement– Scale trumps noise and sample size effects– Can brute-force manual tasks
• Feature selection• Hyperparameter tuning
• Engineering “Big” is Difficult– Build new scalable data platforms– Re-engineering parallel algorithms
8
What is Data Science?What skill sets does it require?
What tools are commonly used?How do we architect data products?
How do we get started?
9
Three Camps
10
s3.a
maz
onaw
s.co
m/a
ws.
drew
conw
ay.c
om/v
iz/v
enn_
diag
ram
/dat
a_sc
ienc
e.ht
ml
11
Business
12
Business
13
Engineering vs. Statistics
Programming languagesSystems languagesLatency, throughput
Huge dataOnline problems
AutomatedDevelopers, Engineers
Statistical environments, BI toolsHigh-level languagesAccuracyMedium-sized dataOffline workAd-hocStatisticians, Analysts
vs.
14
Data Science + Hadoop
15
Engineering, Statistics & Hadoop: Before
Gap.
16
Engineering, Statistics & Hadoop: 2014
YAR
N R
M
17
Apache Spark: Something for Everyone• Now Apache TLP
– From UC Berkeley AMPLab– … inspired by MS DryadLINQ
• Scala-based– Expressive, efficient– JVM-based
• Scala-like abstractions– RDD: Resilient Distributed (immutable)
Dataset– Distributed works like local– Like Apache Crunch is Collection-like
• Read-Evaluate-Print-Loop– Interactive– No compile/deploy cycle needed
• Python API too• Natively Distributed• Hadoop-friendly
– Integrate with where data already is– ETL no longer separate
• Subprojects: MLlib and more
18
Statisticians: Shell, Concise Syntax
<row Id="4"...Tags="...c#...winforms..."/>
(4,"c#")(4,"winforms")...
(4,3104,1.0)(4,2148819,1.0)...
scala> val postIDTags = postsXML.flatMap { line =>val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r
val tagRegex = "<([^&]+)>".ridTagRegex.findFirstMatchIn(line) match {case None => None case Some(m) => {val postID = m.group(1).toIntval tagsString = m.group(2)val tags = tagRegex.findAllMatchIn(tagsString)
.map(_.group(1)).toListtags.map((postID,_))
}}
}
19
Engineers: Distributed, Manageable
20
2015 is Time to Operationalize
21
From Exploratory to Operational
Exploratory Analytics Operational Analytics
Explore DataPick Model
Build Model at Scale, Offline
Continuously Update Model
Score Model inReal-Time
22
Lambda λArchitecture noun. 1. Name of a design idea you’ve had before but didn’t realize was a thing that needed a name.
23
Lambda Architecture
λ:Streaming
• Lambda Architecture– Batch Layer: compute full answer offline,
in batch– Speed Layer: compute approximate
answer online, in near-real-time– Serving Layer: stitch speed/batch
answers together in real-time
• Great fit for big, real-time ML• Ecosystem has right components
now– Batch: Spark + MLlib– Speed: Spark Streaming– Serving: Tomcat / Jetty– Data Fabric: Kafka, HDFS
24
Oryx 2: Lambda for ML (alpha)
github.com/OryxProject/oryx
Thank [email protected]@sean_r_owen
17TH ~ 18th NOV 2014MADRID (SPAIN)