Earth Ed Sustainable Me Competition - Primary Science Specialists
Data Science Competition
-
Upload
jeong-yoon-lee -
Category
Technology
-
view
1.237 -
download
0
Transcript of Data Science Competition
![Page 1: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/1.jpg)
Data Science Competition
2. 25. 2017
The 27th Annual KSEA South-Western Regional Conference
Jeong-Yoon Lee, Ph.D.
![Page 2: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/2.jpg)
Chief Data Scientist, Conversion Logic
Ph.D. in Computer Science, USC
M.S. in Electrical Engineering, USC
B.S. in Electrical Engineering, SNU
KDD Cup Winner 2012 & 2015
Top 10, Kaggle 2015
Jeong-Yoon Lee, Ph.D.
![Page 3: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/3.jpg)
Why Data Science Competition
![Page 4: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/4.jpg)
Why Compete
• For fun
• For experience
• For learning
• For networking
4
![Page 5: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/5.jpg)
Fun
• Competing with others
• Incremental improvement
5
![Page 6: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/6.jpg)
Experience
6
![Page 7: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/7.jpg)
Learning
7
![Page 8: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/8.jpg)
Learning
8
![Page 9: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/9.jpg)
Networking
9
![Page 10: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/10.jpg)
10
![Page 11: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/11.jpg)
Data Science Competition
![Page 12: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/12.jpg)
Data Science Competitions
Since 1997
2006 - 2009
Since 2010
![Page 13: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/13.jpg)
Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score
![Page 14: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/14.jpg)
Kaggle
• 250+ competitions since 2010
• 500K+ users
• 50K+ competitors
• $3MM+ prize paid out
![Page 15: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/15.jpg)
Kaggle
![Page 16: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/16.jpg)
Kaggle
![Page 17: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/17.jpg)
Misconceptions on Competitions
![Page 18: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/18.jpg)
Misconceptions on Competitions
• No ETL
• No EDA
• Not worth it
• Not for production
18
![Page 19: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/19.jpg)
No ETL? - Deloitte Western Australia Rental Prices
19
![Page 20: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/20.jpg)
No ETL? - Outbrain Click Prediction
202B page views. 16.9MM clicks. 700MM users. 560 sites
![Page 21: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/21.jpg)
No ETL? - YouTube-8M Video Understanding Challenge
21
1.7TB feature-level data. 31GB video-level data.
![Page 22: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/22.jpg)
No ETL?
22
![Page 23: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/23.jpg)
No EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.
23
![Page 24: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/24.jpg)
No EDA?
• Anonymized data - more creative EDA
24
![Page 25: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/25.jpg)
Not worth it?
• Performance matters
• You walk easier when you can run
25
![Page 26: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/26.jpg)
Not for Production?
• Kaggle Kernelo Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
26
![Page 27: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/27.jpg)
Ensemble Pipeline at Conversion Logic
27
![Page 28: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/28.jpg)
Best Practices
![Page 29: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/29.jpg)
Best Practices
• Feature Engineering
• Algorithms
• Cross Validation
• Ensemble
29
![Page 30: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/30.jpg)
Feature Engineering
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD, Autoencoder
* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
30
![Page 31: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/31.jpg)
AlgorithmsAlgorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)31
![Page 32: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/32.jpg)
Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
32
![Page 33: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/33.jpg)
![Page 34: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/34.jpg)
Ensemble
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/34
![Page 35: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/35.jpg)
KDDCup 2015 Solution
35
![Page 36: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/36.jpg)
Why Competition
• For fun
• For experiences
• For learning
• For networking
36
![Page 37: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/37.jpg)
37
One Last Thing
Google: 20K applications per week
Conversion Logic: 200 applications per week
![Page 38: Data Science Competition](https://reader036.fdocuments.in/reader036/viewer/2022070517/58cea68c1a28abb26e8b6335/html5/thumbnails/38.jpg)
Thank You