FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook...

25
FeatureHub: towards collaborative data science Micah J. Smith, Roy Wedge, Kalyan Veeramachaneni MIT IEEE DSAA 2017 Tokyo, Japan

Transcript of FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook...

Page 1: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

FeatureHub: towards collaborative data scienceMicah J. Smith, Roy Wedge, Kalyan VeeramachaneniMIT

IEEE DSAA 2017Tokyo, Japan

Page 2: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

A tale of two systems

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 2

Page 3: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Massive Open Data Science

Thousands of

collaborators

Single solution

Range of expertise

Natural abstractions

Machine-driven

automation

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 3

Page 4: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

The state of collaborative systems

� ease of use� share results

� no collaboration� not scalable

� integrated solution� ecosystem of collaboration

� wrong abstractions� difficult to use

� ease of use� bookkeeping

�not open�expensive

� many competitors � many solutions� no additional structure

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 4

Page 5: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Current collaborative approaches

Massive open data science

Towards this vision

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 5

Page 6: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

The FeatureHub paradigm

Towards collaboration at scale through feature engineering• Isolate and structure feature engineering• Parallelize across people and features• Minimize redundant work• Automate everything else

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 6

Page 7: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

What is a feature?

A feature is a quantitative, measurable property of a particular entity.

id Closest traffic light (meters)

Beacon St @ Prentiss 470

Vassar St @ Main 25

Newbury St @ Mass Ave 0

…Memorial Drive @ Ames 130

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 7

Page 8: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

What is a feature?

feature

feature semantics

feature values

feature function

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 8

Page 9: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

What is feature engineering?

Feature engineering is the process of ideating feature semantics, and writing feature functions to extract feature values from a raw data source.

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 9

Page 10: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Why feature engineering?

• Features very important to modeling success• Challenging!

▫ Needs human intuition and domain expertise▫ Automation difficult in many circumstances▫ Collaboration can help uncover key ideas

• Can structure into more natural units of work

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 10

Page 11: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Our goal

Develop a system to enable collaborative data science under the FeatureHub paradigm.

11

Page 12: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

How it works

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 12

Page 13: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

LAUNCH

• setup: Setup problem and platform• prepare_dataset: Minimal cleaning, extract metadata• preextract_features: Preprocess features

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 13

Page 14: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

CREATE: Scaffolding feature functions

• Input: single collection of data tables• Output: single column of values – one value per entity

Bookkeeping• Actually “works”• Self-contained

1 def hi_lo_age(dataset):2 """Whether users are older than 30 years"""3 from sklearn.preprocessing import binarize4 threshold = 305 return binarize(dataset["users"]["age"], threshold)

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 14

Page 15: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

CREATE

• Log in to hosted Jupyter Notebook environment• get_dataset: Acquire dataset• discover_features: Collaborate on new features at integrated forum, “fork” existing features• evaluate: Write and evaluate features• submit: Submit feature functions (source code) to evaluation system and feature database

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 15

Page 16: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

COMBINE

• extract_features: Automatically execute feature functions to extract values on train and test sets

• learn_model: Automatically build and evaluate models using AutoML

• Automatically produce solution (predictions on new data points)

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 16

Page 17: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Implementation challenges

• Integrating untrusted source code▫ Quality▫ Security

• High-quality contributions▫ Metrics to reward good work▫ Adversarial behavior

• Minimize redundant work while scaling• Appropriate use of automation technologies

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 17

Page 18: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Platform architecture

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 18

Page 19: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

ExperimentsHired 41 crowd data scientist workers from Upwork• Beginner to intermediate experience/skill, hourly rates between 7 to 45 USD per hour• Write features on FeatureHub: two prediction problems, five hours total

▫ airbnb: Predict the destination country of Airbnb users (Source: Kaggle)▫ sberbank: Predict selling price for houses and apartments (Source: Kaggle)

• Assign to experimental groups to assess different collaborative functionality• Bonus payments for high quality features

Data collected• 171 hours spent on platform• 1952 features submitted• Detailed survey administered

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 19

Page 20: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Experiments

Combined model competes with expert data scientists• Pitted FeatureHub predictions against those of “expert” data scientists on Kaggle

• Model uses combined feature matrix with 6 hours of auto-sklearn

• With these limited resources, beats 25% of experts and scores within 0.03 to 0.05 points of winning solution

airbnb sberbank

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 20

Page 21: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Experiments

Substantially decreases “time to solution”• Achieve potential turnaround time of <1 day

Competition launches

Competitor submits

solution 1

Competitor downloads materials

Competitor submits

solution N

Competition ends

t=0 10 weeks5 days 2 weeks 4 weeks

What can we accomplish with FeatureHub?

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 21

Page 22: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni)

Experiments

Substantially decreases “time to solution”• Achieve potential turnaround time of <1 day

Competitor downloads materials

Competition ends

LAUNCH CREATE COMBINE

Competitor submits

solution N

Competitor submits

solution 1

+3 hours +2.5 hours +6 hours

5 days 2 weeks

12 hours

Competition launches

21

Page 23: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Experiments

Substantially decreases “time to solution”• (Very conservatively) 47% of experts are not able to achieve

FeatureHub-level performance as quickly

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 22

Page 24: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

Summary• Propose a new approach to collaborative feature engineering• The approach is simple but powerful:

1. Focus creative effort of data scientists working in parallel on feature engineering

2. Integrate source code contributions into a single model3. Automate everything else and produce output quickly

• Engineer a cloud platform to do crowdsourced feature engineering with automated modeling

• Experimental results show we can leverage crowd data scientists using FeatureHub to generate competitive predictive models using limited resources

FeatureHub: towards collaborative data science (Smith, Wedge, Veeramachaneni) 23

Page 25: FeatureHub : towards collaborative data science … · • Log in to hosted Jupyter Notebook environment ... Automate everything else and produce output quickly • Engineer a cloud

FeatureHub: towards collaborative data scienceMicah J. Smith, Roy Wedge, Kalyan VeeramachaneniMIT

Source code: https://github.com/HDI-Project/FeatureHubCorrespondence: Micah Smith ([email protected], @micahjsmith)