Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine...
Transcript of Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine...
![Page 1: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/1.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Spark & Machine Learning WorkflowsJuliet Hougland @j_houg
![Page 2: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/2.jpg)
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
![Page 3: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/3.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
![Page 4: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/4.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 5: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/5.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining
Data
Test Data
Model Pipeline: Featurization, Model Fitting
Persisted Model Evaluation
Historic Data
![Page 6: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/6.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Pipelines
![Page 7: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/7.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Real ExampleChurn Prediction for a Telco
![Page 8: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/8.jpg)
‹#›© Cloudera, Inc. All rights reserved.
![Page 9: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/9.jpg)
‹#›© Cloudera, Inc. All rights reserved.
![Page 10: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/10.jpg)
‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset
![Page 11: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/11.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Scikit-learn Pipelines
from sklearn.ensemble import GradientBoostingClassifier
X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)
![Page 12: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/12.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder
X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)
![Page 13: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/13.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark MLLib Pipelines
![Page 14: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/14.jpg)
‹#›© Cloudera, Inc. All rights reserved.
MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier
label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')
assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])
![Page 15: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/15.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Deploy!
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 16: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/16.jpg)
‹#›© Cloudera, Inc. All rights reserved.
You have a few options: • Pickle • Joblib • PMML • Custom
Well, how did you save your model?
![Page 17: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/17.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Insecure Not Portable Big Slow
“Pickles are for delis”
http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html
![Page 18: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/18.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Storing Models as PMML
// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)
// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)
![Page 19: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/19.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Spark PMML Export Supported Models
![Page 20: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/20.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Model Fitting
![Page 21: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/21.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 22: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/22.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining
Data
Test Data
Model Pipeline: Featurization, Model Fitting
Persisted Model Evaluation
Historic Data
![Page 23: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/23.jpg)
‹#›© Cloudera, Inc. All rights reserved.
MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier
label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')
assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])
![Page 24: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/24.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Grid Search
![Page 25: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/25.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 26: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/26.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
![Page 27: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/27.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… Serially
![Page 28: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/28.jpg)
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_
![Page 29: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/29.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… in Parallel
![Page 30: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/30.jpg)
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_
![Page 31: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/31.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… Distributed
https://bigdatapix.tumblr.com/
![Page 32: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/32.jpg)
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom spark_sklearn import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_
![Page 33: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/33.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Model Scoring
![Page 34: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/34.jpg)
‹#›© Cloudera, Inc. All rights reserved.
What do you mean by “Deploy?”
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 35: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/35.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Scoring with REST Server
Persisted Model
Model Scoring
HTTP Request
HTTP Response
![Page 36: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/36.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With REST server
![Page 37: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/37.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 38: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/38.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With REST Server
![Page 39: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/39.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With Spark + JPMML
File pmmlFile = ...;
Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);
TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);
Transformer pmmlTransformer = pmmlTransformerBuilder.build();
![Page 40: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/40.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
![Page 41: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg](https://reader035.fdocuments.in/reader035/viewer/2022062920/5f029a247e708231d4051493/html5/thumbnails/41.jpg)
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland @j_houg
Thank You!