Machine learning for developers

83
@SrcMinistry @MariuszGil Machine Learning for Developers Data processing

Transcript of Machine learning for developers

Page 1: Machine learning for developers

@SrcMinistry @MariuszGil

Machine Learning for Developers

Data processing

Page 2: Machine learning for developers

@SrcMinistry

Page 3: Machine learning for developers

My story

Page 4: Machine learning for developers

Data classification

Page 5: Machine learning for developers

Bot detection

Page 6: Machine learning for developers

Minimize risk of error

Page 7: Machine learning for developers

Predictions

Page 8: Machine learning for developers

Click probability

Page 9: Machine learning for developers

Maximize CTR or eCPM

Page 10: Machine learning for developers

A lot of data

Page 11: Machine learning for developers

data + algo = result

Page 12: Machine learning for developers

Real problem

Page 13: Machine learning for developers
Page 14: Machine learning for developers

+ value estimator

Page 15: Machine learning for developers

+ chance of sell

Page 16: Machine learning for developers

+ $ optimization

Page 17: Machine learning for developers

Tens of thousands historical transactions

Page 18: Machine learning for developers

Tens of data components

Page 19: Machine learning for developers

Hundreds of data components

Page 20: Machine learning for developers

HOW?

Page 21: Machine learning for developers

Machine Learning Theory

Page 22: Machine learning for developers

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

Tom M. Mitchell

Page 23: Machine learning for developers

Task

Page 24: Machine learning for developers

Typical ML techniquesClassification Regression Clustering Dimensionality reduction Association learning

Page 25: Machine learning for developers

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 26: Machine learning for developers

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 27: Machine learning for developers

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 28: Machine learning for developers

Experience

Page 29: Machine learning for developers

Typical ML paradigmsSupervised learning Unsupervised learning Reinforcement learning

Page 30: Machine learning for developers

Accuracy

Page 31: Machine learning for developers

Substantive Expertise

Hacking skillsMath & Statistics

Knowledge

TraditionalResearch

DangerZone!

MachineLearning

DataScience

Page 32: Machine learning for developers

Substantive

Expertise

Hacking skillsMath & Statistics

Knowledge

Evil

Outside CommitteeMember

Not that dangerous, in retrospect

MachineLearning

James BondVillain

NSA

Data Science

That Guy Who Stole Your Online Identity Thesis Advisor

Grad School Mate

Page 33: Machine learning for developers

Machine Learning Practice

Page 34: Machine learning for developers
Page 35: Machine learning for developers

data + algo = result

Page 36: Machine learning for developers

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

Page 37: Machine learning for developers

Learning Data

Algorithm Learning

Classifier ModelReal Data Classification

Page 38: Machine learning for developers

Failure recipe

Page 39: Machine learning for developers

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

Page 40: Machine learning for developers

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

Page 41: Machine learning for developers

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

Page 42: Machine learning for developers

+-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+

Page 43: Machine learning for developers

+-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+

Page 44: Machine learning for developers

Understand your data first

Page 45: Machine learning for developers

Exploratory analysis

Page 46: Machine learning for developers

http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg

Page 47: Machine learning for developers

ML pipeline

Page 48: Machine learning for developers

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Post-processing

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Data

Page 49: Machine learning for developers

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Post-processing

Data

Page 50: Machine learning for developers

Classification algorithmsLinear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis

Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes

Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0

Page 51: Machine learning for developers

Regression algorithmsLinear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression

Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network

Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist

Page 52: Machine learning for developers

Algorithm is only element in the ML chain

Page 53: Machine learning for developers

Demo #1

Page 54: Machine learning for developers

> dataset(iris)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa

> tail(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica

> plot(iris[,1:4])

Page 55: Machine learning for developers
Page 56: Machine learning for developers

> library(mclust)

> class = iris$Species

> mod2 = MclustDA(iris[,1:4], class, modelType = „EDDA")

> table(class)

class setosa versicolor virginica 50 50 50

Page 57: Machine learning for developers

> summary(mod2)

------------------------------------------------ Gaussian finite mixture model for classification ------------------------------------------------

EDDA model summary:

log.likelihood n df BIC -187.7097 150 36 -555.8024 Classes n Model G setosa 50 VEV 1 versicolor 50 VEV 1 virginica 50 VEV 1

Training classification summary:

Predicted Class setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 0 50

Training error = 0.02

> plot(mod2, what = "scatterplot")

Page 58: Machine learning for developers
Page 59: Machine learning for developers

Demo #2

Page 60: Machine learning for developers

> head(titanic.raw)

Class Sex Age Survived 1 3rd Male Child No 2 3rd Male Child No 3 3rd Male Child No 4 3rd Male Child No 5 3rd Male Child No 6 3rd Male Child No

> tail(titanic.raw)

Class Sex Age Survived 2196 Crew Female Adult Yes 2197 Crew Female Adult Yes 2198 Crew Female Adult Yes 2199 Crew Female Adult Yes 2200 Crew Female Adult Yes 2201 Crew Female Adult Yes

> summary(titanic.raw)

Class Sex Age Survived 1st :325 Female: 470 Adult:2092 No :1490 2nd :285 Male :1731 Child: 109 Yes: 711 3rd :706 Crew:885

Page 61: Machine learning for developers

> library(arules)

Ładowanie wymaganego pakietu: Matrix Dołączanie pakietu: ‘arules’

> rules <- apriori(titanic.raw)

Apriori

Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 220

set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [27 rule(s)] done [0.00s]. creating S4 object ... done [0.00s].

Page 62: Machine learning for developers

> rules <- apriori(titanic.raw, + parameter = list(minlen=2, supp=0.005, conf=0.8), + appearance = list(rhs=c("Survived=No", "Survived=Yes"), + default="lhs"), + control = list(verbose=F))

> rules.sorted <- sort(rules, by="lift")

> subset.matrix <- is.subset(rules.sorted, rules.sorted)

> subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA

> redundant <- colSums(subset.matrix, na.rm=T) >= 1

> which(redundant)

{Class=2nd,Sex=Female,Age=Child,Survived=Yes} 2 {Class=1st,Sex=Female,Age=Adult,Survived=Yes} 4 {Class=Crew,Sex=Female,Age=Adult,Survived=Yes} 7 {Class=2nd,Sex=Female,Age=Adult,Survived=Yes} 8

> rules.pruned <- rules.sorted[!redundant]

Page 63: Machine learning for developers

> inspect(rules.pruned)

lhs rhs support confidence lift 1 {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.095640 4 {Class=1st,Sex=Female} => {Survived=Yes} 0.064061790 0.9724138 3.010243 2 {Class=2nd,Sex=Female} => {Survived=Yes} 0.042253521 0.8773585 2.715986 5 {Class=Crew,Sex=Female} => {Survived=Yes} 0.009086779 0.8695652 2.691861 9 {Class=2nd,Sex=Male,Age=Adult} => {Survived=No} 0.069968196 0.9166667 1.354083 3 {Class=2nd,Sex=Male} => {Survived=No} 0.069968196 0.8603352 1.270871 12 {Class=3rd,Sex=Male,Age=Adult} => {Survived=No} 0.175829169 0.8376623 1.237379 6 {Class=3rd,Sex=Male} => {Survived=No} 0.191731031 0.8274510 1.222295

Page 64: Machine learning for developers

Applications

Page 65: Machine learning for developers
Page 66: Machine learning for developers
Page 67: Machine learning for developers
Page 68: Machine learning for developers
Page 69: Machine learning for developers

Tools

Page 70: Machine learning for developers
Page 71: Machine learning for developers
Page 72: Machine learning for developers
Page 73: Machine learning for developers
Page 74: Machine learning for developers
Page 75: Machine learning for developers
Page 76: Machine learning for developers
Page 77: Machine learning for developers
Page 78: Machine learning for developers

Benefits & Problems

Page 79: Machine learning for developers

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

o

o

Page 80: Machine learning for developers

Does it do well onthe training data?

Does it do well onthe test data?

Better features /Better parameters

More data

Done!

No No

Yes

by Andrew Ng

Page 81: Machine learning for developers

Understand your needs first

Page 82: Machine learning for developers

Tools will change Ideas are immortal

Page 83: Machine learning for developers

@SrcMinistry

Thanks!

@MariuszGil