MLconf NYC 0xdata

17
4/23/13 Movie Night: Data Science - Even On Our Night Off May 27, 2014

description

 

Transcript of MLconf NYC 0xdata

Page 1: MLconf NYC 0xdata

4/23/13

Movie Night: Data Science - Even On Our Night Off

May 27, 2014

Page 2: MLconf NYC 0xdata

Anqi and Irene – (H2O)

• Anqi is the in-house R expert and is responsible for K-means and PCA

• Irene is the pencil and paper stats nerd and technical writer • Part of a data science team that’s 75% women, and on a technical team that’s

23% women (well above average).

Sergei- (Collective) VP, Data Sciences at Collective, where he is responsible for the architecture, development and scaling of data-driven technology products for digital advertising.

Page 3: MLconf NYC 0xdata

What is H2O?

• Same statistics - new volumes of data

• On a distributed cluster models on a terabyte of data can finish in minutes.

• Provide an interface to give more people the power of data science.

• Also hook H2O into R and Scala

Page 4: MLconf NYC 0xdata

Overview

Walk through the practical problem of what movie to go see together.

Examine work flow from data to prediction, and let the best model inform our choice

Extend to production setting applications with a customer use case

Page 5: MLconf NYC 0xdata

Movie Lens Data

Data is the 100,000 observation MovieLens data set

Demographic Features:

State Age Occupation Gender

Factor Integer Factor Factor

Levels: 62 Range (7,73)

Levels: 21 Levels: 2

Largest class: California

Mean: 32.9 Largest Class: Student

M:F is about 3:1

Page 6: MLconf NYC 0xdata

Movie Classes

Movies are classified by types, types are not exclusive.

Page 7: MLconf NYC 0xdata

Dependent Variable

Users rated movies on a Likert scale of 1 to 5.

We converted this to a binomial indicator:

Ratings >= 4: recoded to 1, indicating liked movie

Ratings < 4: recoded to 0, indicating disliked the movie

Page 8: MLconf NYC 0xdata

Super Models

Both models are predicting the same dependent variable as a function of the same set of features.

First model with tree based GBM - start simple and let the model get as complex as it needs to with depth

Alternative model with regularized GLM - start with complexity

and let model generalize with regularization

Page 9: MLconf NYC 0xdata

WWIMUsing Gradient Boosted Classification on two classes

GBM is nonparametric, great when there’s no theoretical model.

Accounts for complex interaction

Control overfitting with learning rate

Page 10: MLconf NYC 0xdata

WWAM: Alternative – Logistic GLM

Logistic binomial regression

End model has interpretability

Control for overfitting introducing penalty into objective function - aids in feature selection and generalizability

Ridge regression- all L2 Penalty

Page 11: MLconf NYC 0xdata

Rubber; Meet RoadComparison of error rates on holdout set

GBM Model GLM Model

Error on Dislike (0) 28% 30%

Error on Like (1) 18% 50%

Overall 22% 40%

Page 12: MLconf NYC 0xdata

GBM Predictions GLM Predictions

Like: 300, Her, Need For Speed

Dislike: Frozen, Pebody

Like: 300, Her, Capt. America

Dislike: Frozen, Divergent

Page 13: MLconf NYC 0xdata

Lights Out - Some Closing Points

We didn't address a serious problem here - but this is the general process used in a production environment.

To give you a sense for the real world implementation, we’ve asked one of our users to share his use case with you.

Page 14: MLconf NYC 0xdata

Stories change people, while statistics gives them something to argue about

- Bernie Siegel

Page 15: MLconf NYC 0xdata

Ad Server(publisher)

Ad Server(advertiser)

AgencyBrowser

BrandsPublishers

Content

Invento

ry

Ads

Audience

Page 16: MLconf NYC 0xdata

Audience Modeling

1. Build the Audience Cloud of stable cookies.

2. Define target audience using Cookie level data.

3. Assemble 1,000s of features on every cookie.

4. Build a predictive model using machine learning.

5. Score every cookie in the Audience Cloud.

6. Create a targetable segment with the top X users.

7. Adjust X daily to optimize delivery & performance.

8. Rebuild models weekly (daily if warranted).

Audience Cloud(200M+ Stable Cookies)

Target Audience

(100K Cookies)

1M Cookies3M Cookies

bit.ly/MLatScalePreprint of paper submitted to KDD’14

Audience Extension: audiences (age 25-40, buys toys, watches TNT)Audience Optimization: actions (clicks, online purchases)

Page 17: MLconf NYC 0xdata

Modeling Platform

MODEL BUILDINGComputing predictive models

on

Current Future

DATA SIZESSize of data

ALGORITHMComplexity and performance

GBMglmnet

1 million

1,000

1 billion

100,000

SCORINGPredicting outcomes

BatchReal Time

+ H2O