Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data....

86
Data science using Big Data. Pragmatic approach. Dmitri Babaev Pavel Mesentsev

Transcript of Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data....

Page 1: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

 Data science using Big Data. Pragmatic approach.

Dmitri Babaev Pavel Mesentsev

Page 2: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Predictive models design

Page 3: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Predictive models design

Page 4: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Predictive models design

Page 5: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Predictive models design

Page 6: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Predictive models design

Page 7: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Small data case

Page 8: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 9: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 10: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 11: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 12: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 13: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 14: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 15: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 16: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach
Page 17: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 18: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 19: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 20: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 21: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 22: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Big data case

Page 23: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Apache Spark case

Page 24: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 25: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 26: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 27: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 28: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 29: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 30: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 31: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 32: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 33: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 34: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 35: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 36: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 37: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 38: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 39: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 40: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 41: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 42: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 43: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 44: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 45: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

How does its works

Page 46: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 47: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 48: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 49: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 50: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 51: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 52: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 53: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 54: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 55: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 56: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 57: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 58: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 59: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 60: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 61: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark Sql

Page 62: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Apache Spark case

Page 63: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Apache Spark in

Large Scale Machine Learning

Page 64: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

LSML issues•Too much samples to classify

•Training data does not fit in memory

•Too much training samples

•Too much models to train

Page 65: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Why not MLLib?• MLLib is less stable

• too few algorithms comparing to scikit-learn

• ML pipelines are not so mature than in scikit-learn

• e. g. there is no simple way to use logistic regression for feature selection

• MLLib python API falls behind Java/Scala API

• MLLib is actively developed and may be feasible choice in near future

Page 66: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark + scikit-learn = ?• Parallel training

• meta-parameter grid search

• parallel one-vs-rest for multi-class models

• same features but different targets

• parallel bagging and ensembles

• parallel learning for multi-step classification

• Parallel prediction

Page 67: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 68: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 69: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 70: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 71: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 72: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 73: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 74: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 75: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed learning

Page 76: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 77: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 78: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 79: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 80: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 81: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 82: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Distributed prediction

Page 83: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Models management•A complex ML task can be expressed as scikit-

learn pipeline

•e. g. feature scaling, then LR feature selection

then GBT classification

•Trained models can be stored on HDFS / S3 along

with training report and loaded for prediction

Page 84: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Alternative approaches to LSML•partial / online learning

•e. g. stochastic gradient descent

•distributed stochastic gradient descent

• average sub-gradients to get the gradient

• is implemented in MLLib

Page 85: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Spark is everywhere

•Mahout on spark aka Samsara

•H20 Sparkling water

Page 86: Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic approach

Thank you!Questions?

Dmitri Babaev [email protected] Pavel Mezentsev [email protected]