Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML...
Transcript of Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML...
![Page 1: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/1.jpg)
Machine Learning with Apache Spark
Example of a complete ML pipeline
Matteo MiglioriniUniversity of Padua, CERN IT-DB
!1
![Page 2: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/2.jpg)
Use case
!2
• Topology classification with deep learning to improve real time event selection at the LHC
[https://arxiv.org/abs/1807.00083]
• Improve the purity of data samples selected in real time at the Large Hadron Collider
• Different data representation has been considered to train different multi-class classifiers:
• Both raw data and high-level features are utilised
![Page 3: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/3.jpg)
Machine Learning Pipeline
!3
The goals of this work are:
• Produce an example of a ML pipeline using Spark
• Test the performances of Spark at each stage
![Page 4: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/4.jpg)
Machine Learning Pipeline
!3
The goals of this work are:
• Produce an example of a ML pipeline using Spark
• Test the performances of Spark at each stage
Data Ingestion
• Read Root Files from EOS
• Produce HLF and LLF datasets
![Page 5: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/5.jpg)
Machine Learning Pipeline
!3
The goals of this work are:
• Produce an example of a ML pipeline using Spark
• Test the performances of Spark at each stage
Data Ingestion
• Read Root Files from EOS
• Produce HLF and LLF datasets
Feature Preparation
• Produce the input for each classifier
![Page 6: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/6.jpg)
Machine Learning Pipeline
!3
The goals of this work are:
• Produce an example of a ML pipeline using Spark
• Test the performances of Spark at each stage
Data Ingestion
• Read Root Files from EOS
• Produce HLF and LLF datasets
Feature Preparation
• Produce the input for each classifier
Model Development
• Find the best model using Grid/Random search
![Page 7: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/7.jpg)
Machine Learning Pipeline
!3
The goals of this work are:
• Produce an example of a ML pipeline using Spark
• Test the performances of Spark at each stage
Data Ingestion
• Read Root Files from EOS
• Produce HLF and LLF datasets
Feature Preparation
• Produce the input for each classifier
Model Development
• Find the best model using Grid/Random search
Training
• Train the best model on the entire dataset
![Page 8: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/8.jpg)
Data Ingestion
Feature Preparation
Model Development Training
EOS
Input Size: ~2 TBs
!4
![Page 9: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/9.jpg)
Data Ingestion
Feature Preparation
Model Development Training
EOSEvents filtering
+ HLF and LLF dataframes
• Electron or Muon with pT>23 GeV• Particle-based isolation<0.45• …• LLF: list of 801 particles, each
characterised by 19 features • 14 HLF features
Input Size: ~2 TBs
!4
![Page 10: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/10.jpg)
Data Ingestion
Feature Preparation
Model Development Training
EOSEvents filtering
+ HLF and LLF dataframes
• Electron or Muon with pT>23 GeV• Particle-based isolation<0.45• …• LLF: list of 801 particles, each
characterised by 19 features • 14 HLF features
Input Size: ~2 TBs
Output Size: 750 GBsElapsed time: 4h
!4
![Page 11: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/11.jpg)
Data Ingestion
Feature Preparation
Model Development Training
!5
Start from the output of the previous stage
![Page 12: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/12.jpg)
Data Ingestion
Feature Preparation
Model Development Training
!5
Start from the output of the previous stage
Feature Preparation Prepare the input for each classifier and shuffle the dataframe
![Page 13: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/13.jpg)
Data Ingestion
Feature Preparation
Model Development Training
!5
Start from the output of the previous stage
Feature Preparation Prepare the input for each classifier and shuffle the dataframe
Elapsed time: 2h
Produce samples of different sizes
Undersampled dataset
Test dataset
![Page 14: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/14.jpg)
Data Ingestion
Feature Preparation
Model Development Training
100k events
!6
Test dataset
![Page 15: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/15.jpg)
Data Ingestion
Feature Preparation
Model Development Training
Model #1
Model #2
Tests made with the HLF classifier:• Trained 162 different models
changing topology and training parameters
• 3-fold cross validation
Each node (executor) has two cores
100k events
!6
Test dataset
![Page 16: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/16.jpg)
Data Ingestion
Feature Preparation
Model Development Training
Best Model
Model #1
Model #2
100k events
!6
Test dataset
![Page 17: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/17.jpg)
Data Ingestion
Feature Preparation
Model Development Training
+
Different tools that can be used to train the
best model
Once the best model is found we can train it
on the full dataset
!7
Full dataset
![Page 18: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/18.jpg)
Result of the first tests
!8
• Trained HLF classifier on the “Undersampled dataset” (Equal number of events for each class)
• Training with ~4M events took 17 mins using dist-keras with 20 executors (2 cores each)
![Page 19: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/19.jpg)
Result of the first tests
!8
• Trained HLF classifier on the “Undersampled dataset” (Equal number of events for each class)
• Training with ~4M events took 17 mins using dist-keras with 20 executors (2 cores each)
Promising results!
![Page 20: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/20.jpg)
Comments
!9
• The pipeline works!
• For the first three stages Spark works very well
✓ Data Ingestion, Feature Preparation, Model Development
• There is still room for improvement: Ongoing studies on UDF performances
๏ Training at scale
• It is possible to deploy an end to end ML pipelines using Spark
![Page 21: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/21.jpg)
Easy to use!
!10
• Use industry standard tools
• Python & Spark
• Notebooks [https://github.com/Mmiglio/SparkML]
• Notebooks are a great tool!
• They help keeping the code organised, embed documentation and graphs
• Easy to share and collaborate
![Page 22: Machine Learning with Apache Spark · Machine Learning with Apache Spark Example of a complete ML pipeline Matteo Migliorini University of Padua, CERN IT-DB!1. Use case!2 • Topology](https://reader033.fdocuments.in/reader033/viewer/2022051814/603a3f6653bc076e9338761f/html5/thumbnails/22.jpg)
Further work
!11
• Train the HLF classifier on a bigger sample and test different configurations (#Executor/#Cores)
• Test Particle-Sequence and Inclusive classifiers on a bigger dataset
• Results obtained using a small sample are consistent with the ones from the paper
• Train them on a bigger dataset
• Add the image classifier