Post on 19-Aug-2015
©2013 LinkedIn Corporation. All Rights Reserved.
Hacking Data SciencePatrick PhilipsVitaly Gordon
©2013 LinkedIn Corporation. All Rights Reserved. 2
Overview of ML pipeline
Gather data
Feature engineering
Model fitting
Evaluation
©2013 LinkedIn Corporation. All Rights Reserved. 3
Understanding Seniority
©2013 LinkedIn Corporation. All Rights Reserved. 4
Companies are not standard
©2013 LinkedIn Corporation. All Rights Reserved. 5
Titles are not enough
©2013 LinkedIn Corporation. All Rights Reserved. 6
Things change
©2013 LinkedIn Corporation. All Rights Reserved. 7
Learning to target better
©2013 LinkedIn Corporation. All Rights Reserved. 8
Classifying names to genders
©2013 LinkedIn Corporation. All Rights Reserved. 9
Let’s look at Monica again
©2013 LinkedIn Corporation. All Rights Reserved. 10
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 11
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 12
Even slower …
©2013 LinkedIn Corporation. All Rights Reserved. 13
Sometime the answer is just under your nose
©2013 LinkedIn Corporation. All Rights Reserved. 14
Comment Spam on Influencer content
©2013 LinkedIn Corporation. All Rights Reserved. 15
Challenge 1: Binary tasks are too guessable
©2013 LinkedIn Corporation. All Rights Reserved. 16
Challenge 2: Context matters
©2013 LinkedIn Corporation. All Rights Reserved. 17
Spam Comment Annotation Task
©2013 LinkedIn Corporation. All Rights Reserved. 18
Quality: Gold distributions and skewed datasets
©2013 LinkedIn Corporation. All Rights Reserved. 19
Using results to evaluate new features
Model ΔP ΔR ΔPRC
Baseline - - -
Variation 1 + - +
Variation 2 - + +
Variation 3 - ++ - -
Variation 4 - +++ ++
Variation 5 - +++ ++
Variation 6 - +++ ++
Variation 7 - ++++ +++
Variation 8 - ++++ +++
Variation 9 - ++++ +++
Variation 10 - ++++ +++
©2013 LinkedIn Corporation. All Rights Reserved. 20
“As simple as possible, but not simpler”
©2013 LinkedIn Corporation. All Rights Reserved. 21
Linkedin Channels
©2013 LinkedIn Corporation. All Rights Reserved. 22
Labels aren’t free
©2013 LinkedIn Corporation. All Rights Reserved. 23
Suggest likely candidates for topics then expand
©2013 LinkedIn Corporation. All Rights Reserved. 24
Evaluate suggested article-topic pairs
Using results to evaluate new implementations of spam classifier– Improve Prec without drop in Rec
18k comments labeled in 54 hrs for $180
©2013 LinkedIn Corporation. All Rights Reserved. 25
Quality: Not by Gold alone
©2013 LinkedIn Corporation. All Rights Reserved. 26
Using results to evaluate existing classification framework
©2013 LinkedIn Corporation. All Rights Reserved. 27
“Help your helpers”
©2013 LinkedIn Corporation. All Rights Reserved. 28
Search is a major portal to information
©2013 LinkedIn Corporation. All Rights Reserved. 29
LI Search is personalized
©2013 LinkedIn Corporation. All Rights Reserved. 30
Evaluation is still possible
©2013 LinkedIn Corporation. All Rights Reserved. 31
Search Evaluation – WTF@1
©2013 LinkedIn Corporation. All Rights Reserved. 32
Quality: Behavioral metrics are good too!
©2013 LinkedIn Corporation. All Rights Reserved. 33
“Pick a solvable problem”
©2013 LinkedIn Corporation. All Rights Reserved. 34
Standardizing titles
©2013 LinkedIn Corporation. All Rights Reserved. 35
©2013 LinkedIn Corporation. All Rights Reserved. 36
Which question is easier
1. Find a better name for the title “account executive”?
2. How similar are “account executive” and “sales executive”?
©2013 LinkedIn Corporation. All Rights Reserved. 37
©2013 LinkedIn Corporation. All Rights Reserved. 38
Notable Experts
©2013 LinkedIn Corporation. All Rights Reserved. 39
First attempt
©2013 LinkedIn Corporation. All Rights Reserved. 40
Second attempt
©2013 LinkedIn Corporation. All Rights Reserved. 41
Third attempt
©2013 LinkedIn Corporation. All Rights Reserved. 42
What makes the best data mining expert?
Education?
Industry experience?
Amount of publications?
Communication skills?
Hacking skills?
Knowledge of statistics?
Number of endorsements?
©2013 LinkedIn Corporation. All Rights Reserved. 43
“More bad data != better data”
©2013 LinkedIn Corporation. All Rights Reserved. 44
Summary
1. Use the data you already have
2. Keep it simple, but not too simple
3. Pick a solvable problem
4. Help your helpers
5. Sample intelligently
6. More (bad) data != better data
©2013 LinkedIn Corporation. All Rights Reserved. 45
Questions?