Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale
Building Data Products
-
Upload
cloudera-inc -
Category
Documents
-
view
3.312 -
download
1
Transcript of Building Data Products
![Page 1: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/1.jpg)
1
Building Data Products Josh Wills, Senior Director of Data Science
![Page 2: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/2.jpg)
About Me
2
![Page 3: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/3.jpg)
3
What Do Data Scien<sts Do?
![Page 4: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/4.jpg)
What I Think I Do
4
![Page 5: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/5.jpg)
What Other People Think I Do
5
![Page 6: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/6.jpg)
What I Actually Do
6
![Page 7: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/7.jpg)
Data Science and Data Products
7
![Page 8: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/8.jpg)
8
Thinking About Data Products
![Page 9: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/9.jpg)
The Best Way To Find Insights
9
![Page 10: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/10.jpg)
Build A Team
10
![Page 11: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/11.jpg)
Measure Everything
11
![Page 12: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/12.jpg)
Solve the Right Problem
12
![Page 13: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/13.jpg)
13
Building Data Products with Hadoop
![Page 14: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/14.jpg)
Hadoop as a PlaMorm for Data Products
14
![Page 15: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/15.jpg)
ETL, Data Science, and Machine Learning
15
![Page 16: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/16.jpg)
Changing the Unit of Analysis
16
![Page 17: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/17.jpg)
Machine Learning and You
17
![Page 18: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/18.jpg)
The Five Ques<ons
1. When should I use it? 2. What does the input look like?
3. What does the output look like?
4. How many parameters do I have to tune?
5. Why will it fail?
18
![Page 19: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/19.jpg)
1. Collabora<ve Filtering
19
![Page 20: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/20.jpg)
Collabora<ve Filtering (cont.)
1. To see things that are hidden.
2. <user_id>,<item_id>,<weight>
3. <item1>,<item2>,<score>
4. The distance metric and the weight calcula<ons.
5. If the input data is too sparse.
20
![Page 21: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/21.jpg)
Collabora<ve Filtering on Hadoop
21
![Page 22: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/22.jpg)
2. K-‐Means Clustering
22
![Page 23: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/23.jpg)
K-‐Means Clustering (cont.)
1. To find anomalous events.
2. Vectors of normally distributed values.
3. Cluster centroids.
4. The choice(s) of K.
5. The points aren’t even remotely normally distributed.
23
![Page 24: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/24.jpg)
K-‐Means on Hadoop
24
![Page 25: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/25.jpg)
3. Random Forests
25
![Page 26: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/26.jpg)
Random Forests (cont.)
1. To classify and predict.
2. A dependent variable and many independent variables.
3. Lots and lots of liale trees.
4. The number of variables to consider at each level.
5. Too many independent variables.
26
![Page 27: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/27.jpg)
Random Forests on Hadoop
• R’s randomForest and rhadoop tools
• Map: par<<on the input data among the reducers
• Reduce: fit the random forests to each par<<on
• Re-‐combine the resul<ng trees in the client
27
![Page 28: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/28.jpg)
The Art of Model Design
28
![Page 29: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/29.jpg)
Cau<on: Mind the Gap
29
![Page 30: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/30.jpg)
The Joy of Experiments
30
![Page 31: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/31.jpg)
31
Introduc<on to Data Science: Building Recommender Systems hap://university.cloudera.com/
![Page 32: Building Data Products](https://reader031.fdocuments.in/reader031/viewer/2022032217/55a963d91a28ab4a108b4624/html5/thumbnails/32.jpg)
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!