From data to deployment- full stack data science

127
From Data to Deployment: Full Stack Data Science

Transcript of From data to deployment- full stack data science

Page 1: From data to deployment- full stack data science

From Data to Deployment:

Full Stack Data Science

Page 2: From data to deployment- full stack data science

Ben LinkData Scientist

Page 3: From data to deployment- full stack data science

Indeed is the #1 external

source of hire

64% of US job searchers search

on Indeed each month

80.2Munique US visitors per month

16Mjobs

50+countries

28languages

200Munique visitors

2010

Unique Visitors (millions)

2009 2011 2012 2013 2014 2015

0

40

80

120

160

200

Page 4: From data to deployment- full stack data science

We helppeopleget jobs.

Page 5: From data to deployment- full stack data science

Data Science @ Indeed

Page 6: From data to deployment- full stack data science
Page 7: From data to deployment- full stack data science
Page 8: From data to deployment- full stack data science
Page 9: From data to deployment- full stack data science
Page 10: From data to deployment- full stack data science

Applicant Quality

Page 11: From data to deployment- full stack data science

Job / Employer

Application Model

Resume / Job

Seeker

Good Fit?

Page 12: From data to deployment- full stack data science
Page 13: From data to deployment- full stack data science
Page 14: From data to deployment- full stack data science
Page 15: From data to deployment- full stack data science

What does a data scientist do at Indeed?

Page 16: From data to deployment- full stack data science

Gather Data

Label

Data

Prototype

Models

Generate

Features

Model

Review

Choose final

parameters

A/B Test

Model

Hypothesis

Formulation

Explore

Data

Analyze

Labels

Analyze

FeaturesLabel Hold-out

Data

Deploy

Model

Monitor

Model

Repeat

Evaluate

Model

Page 17: From data to deployment- full stack data science

Gather Data

Label

Data

Prototype

Models

Generate

Features

Model

Review

Choose final

parameters

A/B Test

Model

Hypothesis

Formulation

Explore

Data

Analyze

Labels

Analyze

FeaturesLabel Hold-out

Data

Deploy

Model

Monitor

Model

Repeat

Evaluate

Model

Page 18: From data to deployment- full stack data science

Gather Data

Label

Data

Prototype

Models

Generate

Features

Model

Review

Choose final

parameters

A/B Test

Model

Hypothesis

Formulation

Explore

Data

Analyze

Labels

Analyze

FeaturesLabel Hold-out

Data

Deploy

Model

Monitor

Model

Repeat

Evaluate

Model

Page 19: From data to deployment- full stack data science

Gather Data

Label

Data

Prototype

Models

Generate

Features

Model

Review

Choose final

parameters

A/B Test

Model

Hypothesis

Formulation

Explore

Data

Analyze

Labels

Analyze

FeaturesLabel Hold-out

Data

Deploy

Model

Monitor

Model

Repeat

Evaluate

Model

Page 20: From data to deployment- full stack data science

Gather

Data

Label

Data

Hypothesis

Formulation

Explore

Data

Page 21: From data to deployment- full stack data science

Prototype ModelsGenerate

Features

Analyze

Labels

Analyze

Features

Page 22: From data to deployment- full stack data science

Evaluate

Model

Model

Review

Deploy

Model

Label Hold-out

Data

Page 23: From data to deployment- full stack data science

Choose Final

Parameters

A/B Test

Model

Monitor

Model

Repeat

Page 24: From data to deployment- full stack data science

Full-stack data scientists

Prevent handoff

mistakes

Can contribute on

any team

Have big picture

in mind

1 2 3

Page 25: From data to deployment- full stack data science

Prevent handoff mistakes

1

Page 26: From data to deployment- full stack data science

IpythonModel

Feature

Extraction

Model Building

DBRaw

Data

Page 27: From data to deployment- full stack data science

DB

Web Infrastructure

Model

Feature

Extraction

Raw

Data

Page 28: From data to deployment- full stack data science

DB

Web Infrastructure

Model

Feature

Extraction

JSON

Data

Data

Service

Page 29: From data to deployment- full stack data science

DB

NoSQL

Web Infrastructure

Model

Feature

Extraction

JSON

Data

Data

Service

Page 30: From data to deployment- full stack data science

Web Infrastructure

Model

Feature

Extraction

JSON

Data

Data

ServiceNoSQL

Page 31: From data to deployment- full stack data science

New

Service

Web Infrastructure

Model

Feature

Extraction

JSON

Data

Data

ServiceNoSQL

Page 32: From data to deployment- full stack data science

Web Infrastructure

New

Service

Model

JSON

DataNoSQL

Feature

Extraction

Page 33: From data to deployment- full stack data science

Web Infrastructure

New

Service

Model

JSON

DataNoSQL

Java Feature

Extraction

Page 34: From data to deployment- full stack data science

Contribute on any team

2

Page 35: From data to deployment- full stack data science

Drive logging of data

Page 36: From data to deployment- full stack data science

Drive product decisions

using external data

Page 37: From data to deployment- full stack data science

Get first data science solution

into production quickly

Page 38: From data to deployment- full stack data science

Iterate on existing solutions

Page 39: From data to deployment- full stack data science

Recognize deployment costs during

feature / model development

Page 40: From data to deployment- full stack data science

Think Big

3

Page 41: From data to deployment- full stack data science

Focus on right problem

Page 42: From data to deployment- full stack data science

Aware of big picture

Page 43: From data to deployment- full stack data science

Practical Data Science

Page 44: From data to deployment- full stack data science

Job Description Classifiers

Page 45: From data to deployment- full stack data science
Page 46: From data to deployment- full stack data science

Predicting (min) years of experience

from a job description

Page 47: From data to deployment- full stack data science

Simple features for first models

{ ‘regex:5+’:1, ‘tfidf:expert’:1.75, ‘tfidf:advanced’:0.93, ‘tfidfBigram:5

years’:2.25 }

Page 48: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 49: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 50: From data to deployment- full stack data science

The best way to understand your

problem is to label your own data

The fastest way to get labels for your

data is to label your own data

The easiest way to know your labels are

consistent is to label your own data

Page 51: From data to deployment- full stack data science

Labeling encourages

feature development

Page 52: From data to deployment- full stack data science

Labeling creates a human

performance benchmark

Page 53: From data to deployment- full stack data science

Labeling throughout gives you

indications of shifting data

Page 54: From data to deployment- full stack data science
Page 55: From data to deployment- full stack data science

Is the job part time, full time, or both?

Page 56: From data to deployment- full stack data science

Sometimes you don’t need much data

Page 57: From data to deployment- full stack data science

Need to only do better

than a simple heuristic

Page 58: From data to deployment- full stack data science

Training Samples

Sco

reLearning Curve

0 1000 3000 70005000

1.00

0.98

0.94

0.88

0.84

0.92

0.96

0.90

0.86

Training

score

Cross-validation score

Page 59: From data to deployment- full stack data science

Now train others to label

Page 60: From data to deployment- full stack data science

Or use experts

Page 61: From data to deployment- full stack data science

Check their consistency

Page 62: From data to deployment- full stack data science

Can build next generation model quickly

Page 63: From data to deployment- full stack data science

Always flag weird data

Page 64: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 65: From data to deployment- full stack data science

Model

Feature Extraction

Features PredictionsModel

Builder

Model

Predictor

Page 66: From data to deployment- full stack data science

Prevents feature inconsistency

between train / serve time

Page 67: From data to deployment- full stack data science

Allows faster feature iteration

Page 68: From data to deployment- full stack data science

Encourages feature extraction reuse

Page 69: From data to deployment- full stack data science

Deploy feature extraction services

Page 70: From data to deployment- full stack data science

Features ModelModel Builder

Feature Extraction

Page 71: From data to deployment- full stack data science
Page 72: From data to deployment- full stack data science

Job Description

Feature Extractor

Page 73: From data to deployment- full stack data science

"tfidf:experience"0.007

"bigramTfidif:5 years"0.049

"bigramTfidf:experience in"0.006

"tfidf:expert"0.026

"averageWordLength"5.506

"tfidf:2" 0.017

"tfidf:5" 0.029

"tfidf:years"0.017

... }

Page 74: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 75: From data to deployment- full stack data science

Features Model

● feature sampling

● feature scaling

● feature selection

Model Builder

● test/train splits

● cross validation

● generate plots

● email results

● export model

Page 76: From data to deployment- full stack data science

input_file=job_decription_years_exp.gzoutput_dir=output/job_description_years_exp_model_builds

model_name=JobExperiencemodel_version=1.2

model_type=RandomForestClassifiermodel_params=[{`n_estimators`:[100, 125, 150], `max_depth`:[3, 4, 5, 6]}]

downsampling_ratio=1.75use_feature_selection=Truefeature_selection_variance_retained=0.9plot_learning_curve=True

[email protected]

Page 77: From data to deployment- full stack data science

Tru

e P

osit

ive R

ate

False Positive Rate

ROC Curve

0.0 0.2 1.00.0 0.80.60.4

1.00

0.2

0.4

0.6

0.8

Page 78: From data to deployment- full stack data science

Feature

Name

Feature

Importance

experience 0.27

5 years 0.19

experience in 0.17

expert 0.16

averageWordLength 0.11

years 0.08

... ...

ClassPrecisio

nRecall

F1-

ScoreSupport

1.0 0.92 0.90 0.91 353

2.0 0.87 0.92 0.90 310

5.0 0.90 0.86 0.88 213

avg

/total0.90 0.90 0.90 876

Page 79: From data to deployment- full stack data science

Output your models into

a standard format

Page 80: From data to deployment- full stack data science

Deploy quickly

Page 81: From data to deployment- full stack data science

Model

Model Predictor

Feature Extraction

Predictions

Page 82: From data to deployment- full stack data science

Putting it all together

Page 83: From data to deployment- full stack data science

Model

Feature Extraction

Features PredictionsModel

Builder

Model

Predictor

Page 84: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 85: From data to deployment- full stack data science
Page 86: From data to deployment- full stack data science

viewjobeval_en_US JUDY-419: Proctor test for viewjob evaluation test

editdetails

control test1control test1

Page 87: From data to deployment- full stack data science

viewjobeval_en_US JUDY-419: Proctor test for viewjob evaluation test

editdetails

control test1

Page 88: From data to deployment- full stack data science

viewjobeval_en_US JUDY-419: Proctor test for viewjob evaluation test

test1 - 50%editdetails

Page 89: From data to deployment- full stack data science

Log everything

Page 90: From data to deployment- full stack data science

uid=1b0un002j1jfi8mp&type=judyQoaEvalFeatures&appdcname=aus&appinstance=judy&tk=1b0un002d1jfid0o&locale=en_US&f.jdTfidf%3A794=0.079

31499364678474&f.candidateResumeRead=0.0&f.trigramJDTfidf%3A2365=0.03493229123324494&f.trigramJDTfidf%3A1135=0.03964128705308954

&f.jdTfidf%3A1618=0.08411276446891801&f.jdTfidf%3A2025=0.07554196313862578&f.jdTfidf%3A796=0.10368340560564313&f.trigramJDTfidf%3A

1324=0.023586131767642488&f.trigramJDTfidf%3A1300=0.013675981072748583&f.jobApplicantDistance=25000.0&f.tfidfBestFitJobsJobDescription

Similarity=0.0&f.jdTfidf%3A2357=0.12212208847891733&f.jdTfidf%3A1786=0.24798453870628528&f.jdTfidf%3A1583=0.11102969484158107&f.trigra

mJDTfidf%3A440=0.009580278396637679&f.bestFitJobsJobDescriptionSimilarity=0.0&f.jdTfidf%3A16=0.09676734768924529&f.trigramJDTfidf%3A3

42=0.052695755493244574&f.jdTfidf%3A2961=0.12933227874206563&f.jdTfidf%3A2559=0.0781937359029168&f.coverLetterJobTitleSimilarity=0.0&f

.jdTfidf%3A313=0.13274661170267346&f.trigramJDTfidf%3A2844=0.011672658147330478&f.jdTfidf%3A1228=0.0826878541112167&f.jdTfidf%3A38

6=0.09321074430754722&f.jdTfidf%3A587=0.09338485474725206&f.trigramJDTfidf%3A2007=0.03398987646377408&f.jdTfidf%3A25=0.0848508555

3898714&f.trigramJDTfidf%3A743=0.052044363109186274&f.trigramJDTfidf%3A742=0.00936380975357828&f.jdTfidf%3A21=0.08956959630539192

&f.trigramJDTfidf%3A1465=0.05695667014121465&f.trigramJDTfidf%3A170=0.019054361889691666&f.trigramJDTfidf%3A2041=0.078672252220736

76&f.jdTfidf%3A178=0.06740515563149391&f.trigramJDTfidf%3A1348=0.020307558998175355&f.yearsOfWorkExperience=0.0&f.trigramJDTfidf%3A2

874=0.021452684048600148&f.trigramJDTfidf%3A2739=0.008846404277542146&f.jtYrsExpRegex%3A0=0.0&f.pastJobTitleSimilarity%3A0=0.0&f.pas

tJobTitleSimilarity%3A1=0.0&f.tfidfResumeJobDescriptionSimilarity=0.020420184609032756&f.jdTfidf%3A276=0.0865108192737853&f.pastJobTitleSi

milarity%3A2=0.0&f.jdTfidf%3A882=0.09227660841710272&f.trigramJDTfidf%3A904=0.028517392545983834&f.applicantsPerJob=0.0&f.majorJobDe

scriptionSimilarity=0.018518518518518517&f.jobDescriptionCharacterLength=501.0&f.trigramJDTfidf%3A221=0.03856671987843533&f.jdSupervisorTi

tleRegex%3A3=1.0&f.jdSupervisorTitleRegex%3A1=0.0&f.jdSupervisorTitleRegex%3A2=0.0&f.jdSupervisorTitleRegex%3A0=0.0&f.jdTfidf%3A1937=0.

10276933510059638&f.jdTfidf%3A2240=0.16550210190515535&f.jdTfidf%3A264=0.1061544307504775&f.jdTfidf%3A1933=0.08140883446275106&f.

trigramJDTfidf%3A2932=0.04909455318062527&f.jdTfidf%3A1082=0.09783192017828135&f.jdTfidf%3A2454=0.08232280250175841&f.jdLicenceReg

exp%3A2=0.0&f.tfidfCoverLetterJobDescriptionSimilarity=0.0&f.jdTfidf%3A485=0.11773996424853242&f.trigramJDTfidf%3A1942=0.03500133306124

8&f.jdLicenceRegexp%3A0=0.0&f.jdLicenceRegexp%3A1=0.0&f.jdTfidf%3A299=0.08046452951090553&f.trigramJDTfidf%3A2261=0.0539089291266

305&f.jdTfidf%3A872=0.08711259378092336&f.trigramJDTfidf%3A1377=0.037898645513041965&f.trigramJDTfidf%3A487=0.022278961460829243&

f.trigramJDTfidf%3A485=0.029495461171052794&f.numMonthsExperience=134.0&f.trigramJDTfidf%3A207=0.040840685741050896&f.trigramJDTfidf

Page 91: From data to deployment- full stack data science

Reuse logs for future models

Page 92: From data to deployment- full stack data science

Logs give us insight

into changing data

Page 93: From data to deployment- full stack data science

Logs allow us to see

what went wrong

Page 94: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 95: From data to deployment- full stack data science

Quantitative Validation

Page 96: From data to deployment- full stack data science

Training Setclass precision recall f1-score

support0.0 1.00 1.00 1.00

4481.0 0.99 1.00 1.00

6632.0 1.00 0.98 0.99

269

avg / total 1.00 1.00 1.00 1380

[ 2015-12-15 21:42:27,537 INFO ] [indeed.model_builder]

Test Setclass precision recall f1-score

support0.0 0.85 0.90

0.87 1461.0 0.92 0.96

0.94 2262.0 0.91 0.70

0.79 88

Page 97: From data to deployment- full stack data science

Tru

e P

osit

ive R

ate

False Positive Rate

ROC Curve

0.0 0.2 1.00.0 0.80.60.4

1.00

0.2

0.4

0.6

0.8

Page 98: From data to deployment- full stack data science

Qualitative Validation

Page 99: From data to deployment- full stack data science
Page 100: From data to deployment- full stack data science
Page 101: From data to deployment- full stack data science
Page 102: From data to deployment- full stack data science

Review your Models

Page 103: From data to deployment- full stack data science

Another perspective

Page 104: From data to deployment- full stack data science

Transparency and Reproducibility

Page 105: From data to deployment- full stack data science

Awareness

Page 106: From data to deployment- full stack data science

1 Context

2 Data

3 Response variable

4 Features

5 Model selection and performance

6 Transparency and recommendations

Page 107: From data to deployment- full stack data science

Context

What should this model enable us to do

(highlighting, filtering, sorting, etc.)?

What products / interfaces / workflows

will initially use this model ?

Page 108: From data to deployment- full stack data science

Data

What queries and filters were used?

From what time range did your data originate?

Did you sample your dataset?

Page 109: From data to deployment- full stack data science

Response variable

How was the response variable

labeled or collected?

What the model outputs (predictions) represent

and how they should be scaled or thresholded?

Page 110: From data to deployment- full stack data science

Features

How were your features generated?

Which features were most important?

Page 111: From data to deployment- full stack data science

Model selection and performance

Performance reports on train / test sets

Overall CV search strategy and scoring function

Other performance tests

(e.g. newer hold out sets, stress testing)

Expected model performance

Page 112: From data to deployment- full stack data science

Transparency and recommendations

Properties files for Model Builder

Link to branch of Model Builder code

Examples of Model Predictions

Possible directions for future improvements

A couple sentences on why you think the

model is ready for production

Page 113: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 114: From data to deployment- full stack data science

Features and data are hard dependencies

Page 115: From data to deployment- full stack data science

Need a post deploy plan

Page 116: From data to deployment- full stack data science

Use log data to check for feature changes

Page 117: From data to deployment- full stack data science

Bu

cke

t co

un

t

tfidf:`excel`

Test Name ttest_ind ks_2samp mannwhit levene ranksums

p-value 3.79e-09 0.00021 8.41e-05 3.79e-09 0.00017

Page 118: From data to deployment- full stack data science

Check prediction class distributions

Page 119: From data to deployment- full stack data science

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed

Page 120: From data to deployment- full stack data science

Every model should be validated,

retraining is time expensive

Page 121: From data to deployment- full stack data science

Use feature monitoring to

determine feature stability

Page 122: From data to deployment- full stack data science

Choose less sensitive features

Page 123: From data to deployment- full stack data science

Avoid counts

Page 124: From data to deployment- full stack data science

Full stack data scientists

Page 125: From data to deployment- full stack data science

Full stack data science organizations

Page 126: From data to deployment- full stack data science

More Indeed Engineering

Careers

indeed.jobs

Twitter

@IndeedEng

Engineering Blog &

Talks

indeed.tech

Open Source

opensource.indeedeng.io

Page 127: From data to deployment- full stack data science

Questions?

Label data before, during, and after you build a model

Extract features in one place

Reuse your model building code

Release softly and log everything

Validate and review every model

Monitor after deploying

Retrain when needed