Cloudera User Group - From the Lab to the Factory
-
Upload
clouderausergroups -
Category
Technology
-
view
474 -
download
0
description
Transcript of Cloudera User Group - From the Lab to the Factory
![Page 1: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/1.jpg)
1
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera
![Page 2: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/2.jpg)
One Other Thing About Me
2
![Page 3: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/3.jpg)
Data Science: Another Definition
3
![Page 4: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/4.jpg)
Data Scientists Build Data Products.
4
![Page 5: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/5.jpg)
A Shift In Perspective
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
• Metric-driven
• Automated
• Systematic
• Fluid data
• Focus on transparency and reliability
• Output is a production system that makes customer-facing decisions
5
![Page 6: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/6.jpg)
All* Products Become Data Products
6
![Page 7: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/7.jpg)
Identifying the Bottlenecks
7
![Page 8: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/8.jpg)
Oryx: Model Building and Serving
• Algorithms
• ALS Recommenders
• K-Means Parallel
• RDF
• Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
8
![Page 9: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/9.jpg)
Oryx Design
9
![Page 10: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/10.jpg)
Generational Thinking
10
![Page 11: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/11.jpg)
The Limits of Our Models
11
![Page 12: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/12.jpg)
Space Exploration
12
![Page 13: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/13.jpg)
Data Science Needs DevOps
13
![Page 14: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/14.jpg)
Introducing Gertrude
• Multivariate Testing
• Define and explore a
space of parameters
• Overlapping
Experiments
• Tang et al. (2010)
• Runs multiple
independent
experiments on every
request
14
![Page 15: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/15.jpg)
Simple Conditional Logic
• Declare experiment
flags in compiled code
• Settings that can vary per request
• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
15
![Page 16: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/16.jpg)
Separate Data Push from Code Push
• Validate config files and
push updates to servers
• Zookeeper via Curator
• File-based
• Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
16
![Page 17: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/17.jpg)
The Experiments Dashboard
17
![Page 18: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/18.jpg)
A Few Links I Love
• http://research.google.com/pubs/pub36500.html
• The original paper on the overlapping experiments
infrastrucure at Google
• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on
their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-
than-lossless-in-online-bootstrapping/
• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
18
![Page 19: Cloudera User Group - From the Lab to the Factory](https://reader033.fdocuments.in/reader033/viewer/2022052910/559b53811a28ab9b4e8b4819/html5/thumbnails/19.jpg)
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!