How to get started in Kaggle competition

31
How to get started in Merja Kajava September 1, 2015 Helsinki Data Analytics and Science Meetup

Transcript of How to get started in Kaggle competition

Page 1: How to get started in Kaggle competition

How to get started in

Merja Kajava September 1, 2015

Helsinki Data Analytics and Science Meetup

Page 2: How to get started in Kaggle competition

is a competition platform for (aspiring) data

scientists

Page 3: How to get started in Kaggle competition
Page 4: How to get started in Kaggle competition

Why participate in Kaggle The data Competition steps Tips

Page 5: How to get started in Kaggle competition

Why participate in Kaggle competition?

Page 6: How to get started in Kaggle competition

1 Learn from the best.

Forums Scripts Solutions from prize winners

Page 7: How to get started in Kaggle competition

2 Work with cool datasets.

Flights in GE Flights Quest

Driver telematic analysis

Amazon employee access rights

Page 8: How to get started in Kaggle competition

+1 You can also win money

Page 9: How to get started in Kaggle competition

What kinds of competitions Kaggle has?

Page 10: How to get started in Kaggle competition

Public

In-class

Private

Page 11: How to get started in Kaggle competition

What languages can you use?

Any open-source language (sometimes also sponsor’s

proprietary languages)

Gnu Octave (no Matlab)

Page 12: How to get started in Kaggle competition

What is the data like?

Page 13: How to get started in Kaggle competition

Data comes from companies and non-profit organizations

Page 14: How to get started in Kaggle competition

Data sizes vary

Zip ~1 MB

Zip ~6 GB

Page 15: How to get started in Kaggle competition

Data comes in all shapes

Customer data Log files Timeseries HTML pages Images Documents

Page 16: How to get started in Kaggle competition

How does Kaggle competition work?

Page 17: How to get started in Kaggle competition

Competition flow

Duration typically 4 to 8 weeks Max. 5 entries per day

Page 18: How to get started in Kaggle competition

test

Model

train

submission.csvPredict

Build prediction model

Calculate CV to cross-validate

Page 19: How to get started in Kaggle competition

Evaluate submission

Typical evaluations

Area under the ROC curve Normalized Gini coefficient RMSLE …

Page 20: How to get started in Kaggle competition

Public leaderboard

Private leaderboard

~10-30% of test data

submission.csv

Submit entry

Page 21: How to get started in Kaggle competition

Choose two entries for final

Practical choice

Best entry in public leaderboard +

Best CV from local entries

Page 22: How to get started in Kaggle competition

Tips

Page 23: How to get started in Kaggle competition

Look at data. Visualize it.

Source https://www.kaggle.com/justfor/liberty-mutual-group-property-inspection-prediction/explore-data/notebook https://www.kaggle.com/odiseo1982/liberty-mutual-group-property-inspection-prediction/compare-variables-between-train-and-test/files

Page 24: How to get started in Kaggle competition

Focus on feature engineering

Feature selection Feature construction

Dates Locations Categories Segmentation Statistics

Page 25: How to get started in Kaggle competition

Build different models

3 target variables 4 cities = Build 12 models

Source https://www.kaggle.com/c/see-click-predict-fix/visualization/1390

Page 26: How to get started in Kaggle competition

Try different algorithms

Random forest Vowpal Wabbit GBM Xgboost

Page 27: How to get started in Kaggle competition

Build ensembles

Average of submissions Weighted average of submissions Ranked average of submissions

Stacked generalization Blending

Page 28: How to get started in Kaggle competition

Keep track of your submissions

Submission id

Page 29: How to get started in Kaggle competition

Next steps

Page 30: How to get started in Kaggle competition

Start competing

Create Kaggle account Choose competition Go for it!

Page 31: How to get started in Kaggle competition

Useful linksKaggle Blog

http://blog.kaggle.com

Kaggle Competitions: Where to begin

http://www.analyticsvidhya.com/blog/2015/06/start-journey-kaggle/

Kaggle Feature Engineering

http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

Kaggle Ensembling Guide

http://mlwave.com/kaggle-ensembling-guide/