QCon Rio - Machine Learning for Everyone

Post on 21-Apr-2017

7.220 views 3 download

Transcript of QCon Rio - Machine Learning for Everyone

@dhianadeva

MACHINE LEARNINGFOR EVERYONE

Demystifying machine learning!

AGENDA

Goal:

Encourage you to start a machine learning project. Today!

● About me● About you● Machine Learning● Problems● Design● Algorithms● Evaluation● Code snippets● Pay-as-you-go● Competitions

ABOUT MEElectronics Engineering, Software Development and Data Science… Why not?

DHIANA DEVA

NEURALTB

CERN

NEURALRINGER

DJBRAZIL

HIGGS CHALLENGE

ABOUT YOUYou can do it!

FOR ALL

MASSIVE ONLINEOPEN COURSES

OPEN SOURCE TOOLS

OPEN SOURCEPYTHON TOOLS

PAY-AS-YOU-GO SERVICES

MACHINE LEARNINGLearning, machine learning!

EXPECTATIONS

REALITY

FEATUREEXTRACTION

Item { Feature 1Feature 2…Feature N

FEATURESPACE

SUPERVISEDLEARNING

458316,86.513,24.312,64.983,65.8

58317,135.493,2.204,101.966,46.5

458318,-999.0,91.803,113.007,120.

Feature VectorsItems

New item

Machine Learning Algorithm

458316,86.513,24.312,64.983,65.8

Feature Vector

133

Expected Label

Predictive Model

150

623

74

Labels

UNSUPERVISEDLEARNING

Items

New item

Machine Learning Algorithm

458316,86.513,24.312,64.983,65.8

Feature VectorPredictive Model

Better Representation

458316,86.513,24.312,64.983,65.8

58317,135.493,2.204,101.966,46.5

458318,-999.0,91.803,113.007,120.

Feature Vectors

MODELS

BIOLOGICAL MOTIVATION

PROBLEMSI've got 99 problems, but machine learning ain't one!

CLASSIFICATION

?

A

B

CLASSIFICATION

REGRESSION

? 8 15 7 1

11 13 6 3

REGRESSION

CLUSTERING

CLUSTERING

DIMENSIONALITY REDUCTION

DIMENSIONALITY REDUCTION

DESIGN DECISIONS1, 2 steps!

NORMALIZATION

● min-max

● z-score

TRAINING

REGULARIZATION

RELEVANCE ANALYSIS

CROSS VALIDATION

ALGORITHMSCheat sheet included!

CHEAT SHEET

CHEAT SHEET

ALGORITHMS PT. I

Linear Regression

Decision Trees Random Forest

ALGORITHMS PT. II

K-Nearest Neighbors K-Means

NEURAL NETWORKS

SELF-ORGANIZING MAPS

PRINCIPAL COMPONENTS ANALYSIS

T-SNE

EVALUATIONHow you doin'?

PRECISION AND ACCURACY

CONFUSION MATRIX

Precision =TP

Recall =

F1-score =2 * precision * recall

precision + recall

TP + FP

TP

TP + FN

TP = True PositivesTN = True NegativesFP = False PositivesFN = False Negatives

ROC CURVE

A/B TESTS

Code Snippets"Hello, Machine Learning"

MATLAB 101

[x,y] = ovarian_dataset;net = patternnet(5);[net,tr] = train(net,x,y);testX = x(:,tr.testInd);testY = net(testX);

MATLAB 201

net = patternnet(14);net.input.processFcns = {'mapminmax', 'fixunknowns', 'processpca'};net.inputs{1}.processParams{3}.maxfrac = 0.02;net.trainFcn = 'trainlm';net.performFcn = 'mse';net.divideParam.trainRatio = 70/100;net.divideParam.valRatio = 15/100;net.divideParam.testRatio = 15/100;[net, tr] = train(net_config, test_inputs, train_targets);outputs = net(test_inputs);

R

library(randomForest)raw.orig < - read.csv(file="train.txt", header=T, sep="\t")frmla = Metal ~ OTW + AirDecay + Kocfit.rf = randomForest(frmla, data=raw)print(fit.rf)importance(fit.rf)

SCIKIT LEARN

dataset = pd.read_csv('Data/train.csv')target = dataset.Activity.valuestrain = dataset.drop('Activity', axis=1).valuestest = pd.read_csv('Data/test.csv').valuesrf = RandomForestClassifier(n_estimators=100, n_jobs=-1)rf.fit(train, target)predicted_probs = [x[1] for x in rf.predict_proba(test)]importances = rf.feature_importances_

PAY-AS-YOU-GOSERVICESAmazon Machine Learning

AMAZONMACHINE LEARNING

Five easy steps1. Upload csv dataset to Amazon S32. Create Datasource with metadata about uploaded

dataset3. Create ML Model with configurations for model training4. Create Evaluation to analyse and tune model efficiency5. Create Prediction to use trained model with new data

EVALUATION

SDKs

DATA SCIENCE COMPETITIONSChallenge accepted!

KAGGLE

SPONSORED

END TO END

TRAINTRAIN.CSV

TEST.CSV

TRAINED.DAT

RUN SOLUTION.CSV

THANK YOUQuestions?

Dhiana Deva

ddeva@thoughtworks.com