Overview of ML and Big data tools at HEP
experiments
LHCP2019, Puebla, Mexico
Alfredo CastanedaUniversity of Sonora
Overview of ML and Big data Tools 1
With results from ATLAS, CMS, LHCb and ALICE
How Machine Learning and Big data are connected?• Machine learning (ML) is a branch of Artificial
Intelligence (AI)
• ML uses algorithms and statistical models to train acomputer to perform a task without being explicitlyprogrammed to do so
• ML needs large datasets (big data) in order toachieve a good accuracy on its predictions
• Big data experiments are an excellent laboratory totest ML algorithms in order to solve complexproblems that could not be solved by traditionalmethods
• To manipulated big data several tools are needed inorder to process the information (most of themdeveloped by data scientists)
Overview of ML and Big data Tools 2
Overview of ML and Big data Tools 3
The LHC: A big data experiment
• The Large Hadron Collider (LHC) is the world largest and mostpowerful particle accelerator
• Two high energy particle beams travel at close the speed of light• The beams are made to collide at four locations around the
accelerator ring, corresponding to the position of the four particledetectors (ATLAS, CMS, ALICE, LHCb)
• The volume of data collected at the LHC represents a considerablechallenge for processing and storage
• The dataflow from all experiments during Run 2 was about 25 GB/s
On June 2017,the CERN passedthe milestone of200 petabytes ofdatapermanentlyarchived in itstape libraries
Overview of ML and Big data Tools 4
Challenges for the HL-LHC
• A factor of 10 increase on the particle rateexpected during the High luminosity LHC era
• The vast amount of data will be used to performprecision Higgs boson measurements and searchesfor new physics phenomena
The 10x increase on particle rate will demand newtools for data processing, particle identification,offline analysis and many other tasks.Machine learning and Big data tools will be crucialduring this period
Representation of a pp collision during the HL-LHC era
ML and Big data methods currently used in HEP
Overview of ML and Big data Tools 5
Machine LearningArtificial Neural Networks (AN)
Boosted Decision Trees (BDTs)
Convolutional Neural Networks (CNNs)
Recursive Neural Networks (RNN)
Generative Adversarial Networks (GANs)
Big dataWLCG world wide LHC Computing GRID
Parallel computing
Databases (Oracle)
GPU resources
Hadoop services at CERN
Since already few years ago, several methods/tools in ML and Big data have been implemented inHEP experiments, in particular for the LHC experiments. In the case of ML the most important arerelated to artificial neural networks, where several architectures have been used. In case of big datathe parallelization of processes, distributed analysis and the recent introduction of GPU resourcesare becoming crucial for the day to day activities
Overview of ML and Big data Tools 6
ML and big data areas of development in HEP
PID, reconstruction and triggering
Simulation DQM
Hardware triggeringComputing
PID, reconstruction, Triggering
Overview of ML and Big data Tools 7
Overview of ML and Big data Tools 8
Jet classification
• Jet reconstruction/classification is one of themost important tasks in modern HEPexperiments
• Jets could reveal important information aboutthe nature of the process they were originatedfrom, whether it was:
• Quark initiated jet• Heavy quarks (b,c)• Light quarks (u,d,s)
• Gluon initiated jet• Hadronic decay of vector bosons
(W,Z), top’s and Higgs• Methods based on Multivariate analysis
and deep learning have shown a gooddiscrimination power between thedifferent type of jets
Overview of ML and Big data Tools 9
• Differentiating between a quark-initiated jet from agluon initiated one has broad applications in SMmeasurements and beyond SM
• Approach is based on a state-of-the-art imageclassification technique (CNNs) which is comparedagainst classical jet tagging schemes
Quark initiated jetGluon initiated jet
Jet substructure using CNNs
ATL-PHYS-PUB-2017-017
Overview of ML and Big data Tools 10
Heavy flavor jet tagging (b-tagging)
• The tasks of separating jets that contain b-hadrons from those initiated by lighter quarks iscalled b-tagging
• The significance of b-tagging is due to theprominence of b-hadrons jets in final states ofinteresting physics processes
• Typical heavy hadron topology representsone or more vertices that are displaced fromthe hard-scatter interaction point
• The longer b-hadron lifetime correspond to adecay that will be associated with asecondary vertex
Overview of ML and Big data Tools 11
b-tagging using deep learning in ATLAS
• RNNIP is a Recurrent Neural Network-based-b-tagging classifier
• It is based on the traditional IP3D tagger that usesthe impact parameter as the principal discriminator
• RNNIP represents jets as a sequence of tracksordered by absolute transverse impact parametersignificance
• Each track described by a vector of features:transverse and longitudinal impact parameters,parameter significance, jet pt fraction, etc..
• This four-class tagger predicts a flavor probabilityassociated with each flavor (b,c,tau and light)
Diagram of the RNNIP taggerMichela Paganini and ATLAS Collaboration 2018 J. Phys.: Conf. Ser. 1085 042031
Light jet rejectionRNNIP vs IP3D
c jet rejectionRNNIP vs IP3D
Overview of ML and Big data Tools 12
• DeepCSV is a common algorithm toperform b-tagging
• It uses a deep neural network• The input is the same set of observables
used by the CSVv2 b-tagger, secondaryvertex, track-based lifetime information
• Using four hidden layers (of a width of100 nodes each)
CMS PAS BTV-15-001
b-tagging using deep learning in CMS
Overview of ML and Big data Tools 13
ML for object reconstruction in LHCb
• LHCb is a single-arm forward spectrometer• The main goal is to search for indirect
evidence of new physics in CP violation andrare decays or beauty and charm baryons
• Since 2015 (Run-2) the experiment hasemployed a real-time analysis strategy for alarge fraction of its physics programme
• Full physics analysis are performed directly onthe objects reconstructed in the final stage ofthe software trigger, reducing the output ofthe event size
Overview of ML and Big data Tools 14
Deep learning in LHCb
Fake Track Rejection with Multilayer Perceptron
LHCb-PUB-2017-011
• Neural network based algorithm toidentify fake tracks in the LHCbpatter recognition
• This algorithm (ghost probability)retains more than 99% of wellreconstructed tracks whilereducing the number of fake tracksby 60%
Jet tagging Charge particle ID
• Using Boosted Decision Trees (BDTs) toidentify b,c and light jets
• Using kinematic observables ofsecondary vertex as input
• The efficiency for identifying a b,c jet isabout 65%,25% with a probability formisidentifying a light-parton jet of 0.3%
• ProbNN, neutral network with one hidden layer
JINST 10 (2015) P06013
LHCb detector performancehttps://doi.org/10.1142/S0217751X15300227
Overview of ML and Big data Tools 15
Charmed Baryon production in ALICE
• ALICE is the dedicated heavy-ion experiment at the LHC
• Study of the Quark Gluon Plasma (QGP)• Due to the short lifetime of the Λc baryons and
statistical limitation the reconstruction was particularly challenging
• BDT were used to separate background from signal
JHEP 1804 (2018) 108
Overview of ML and Big data Tools 16
Top tagging with deep learning
• Tagging of highly energetic jets from top quark decay based onimage techniques
• Using high level jet substructure variables• The jet classification method achieves a background rejections
of 45 at a 50% efficiency operating point for reconstructionlevel jets with pT in the range of 600 to 2500 GeV
Signal
Background
https://arxiv.org/abs/1704.02124v2
Simulation
Overview of ML and Big data Tools 17
Overview of ML and Big data Tools 18
Simulation production in HEP
• Production of Monte Carlo simulation samples is a crucial task in HEP experiments
• Detailed simulation is needed to optimize the events selection for SM searches and new physics searches
• To measure the performance of new detector technologies
• Usually samples with large statistic are needed in order to reduce the analysis uncertainties
• A lot of resources (computational, human) are needed to fulfill the requirements of the entire experimental collaborations
For the past few years Monte Carlo simulations have represented more than 50% of the WLCG (LHC computing grid) workload
Overview of ML and Big data Tools 19
GANs for fast simulation
• Generative Adversarial Networks (GANs) are a type of deep neural networks with an architecture comprised of two nets, pitting one against the other (“adversarial”)
• One neural network called “generator” generate new data instances, while the other“discriminator” evaluates the authenticitycomparing with the training data
• The goal is that the generator is trained in a way that the discriminator will not identify ifthe information received is not the truth one
Introduced by Ian Goodfellow in 2014, Referring asthe “most interesting idea in the last 10 years of ML”
Overview of ML and Big data Tools 20
CaloGAN: a calorimetry fast simulation
Average gamma Geant4 shower from Geant4 (top) and CaloGAN (bottom)
Shower shape variables for e+, gamma and pi+, comparison Geant4 and CaloGAN
• A simulation of the calorimeter using GANs (CaloGAN)• From a series of simulated showers, the CaloGAN is tasked with learning from the simulated data
distribution for gammas, e+ and pi+• The training dataset is represented in image format by three figures of dimensions 3x96, 12x12 and
12x6, each representing the shower energy depositions
Phys. Rev. D 97, 014021 (2018)
Overview of ML and Big data Tools 21
Calorimeter FastSim in LHCb
Showers generated with Geant4 (top) and showers simulated with GANs (bottom)
• New Deep Learning framework based on GANs• Faster than traditional simulated methods by 5
orders of magnitude• Reasonable accuracy• This approach could allow physicist to produce
enough simulated data needed during the HL-LHC
https://arxiv.org/abs/1812.01319
DQM
Overview of ML and Big data Tools 22
Overview of ML and Big data Tools 23
Data Quality monitoring
• Data quality monitoring is a crucial task for every large-scale High-Energy Physics experiment
• The system still relies on the evaluation of experts• An automated system can save resources and improve the quality of the
collected data• A detector measures physical properties of proton collisions products• When a subdetector exposes abnormal behavior, it is reflected in
measured or reconstructed properties
Overview of ML and Big data Tools 24
doi :10.1088/1742-6596/898/9/092041
Data Quality monitoring
• The primary goal of the system is to assist the Data Quality managers by filtering mostobvious cases, both positive and negative
• The system objective is minimization of the fraction of data samples passed for the humanevaluation (i.e. the rejection rate)
• The Evaluation of the algorithm was done using CMS experimental data published in theCERN open data portal
Hardware triggering
Overview of ML and Big data Tools 25
Overview of ML and Big data Tools 26
Data processing with FPGAs
• L1 trigger decision (hardware, FPGA based) to be done in 4 microsec
• Could a ML model be fast enough to make a decision in such short time window?
Field Programmable Gate Arrays (FPGAs)
• Reprogrammable fabric of logic cells embedded with DSPs, BRAMs, high speed, IO
• Low power consumption compared to CPU/GPU
• Massively parallel
Overview of ML and Big data Tools 27
• Regular neural network model trained in CPU/GPU
• Model loaded in firmware via the hls4ml interface
• Decision made in the FPGA
https://arxiv.org/abs/1804.06913
ML interface for FPGAs
Overview of ML and Big data Tools 28
FPGAs for Phase-2 ATLAS Muon barrel
ATLAS-TDR-026
• ATLAS Level-0 muon trigger will face a complete upgrade for HL-LHC
• New trigger processor• FPGA based system
• New trigger station• New RPC Layer
• Trigger algorithms improved • Need to be very fast and flexible
Neural Network a good candidate
From strip maps to imagesCNNs
GRID, Parallel computing, GPUs, Hadoop
Overview of ML and Big data Tools 29
Overview of ML and Big data Tools 30
The Worldwide LHC computing (WLGC)
• The WLCG is the world’s largest computing grid• Supported by many associated national and international grids across the world and
coordinated by CERN
• 42 countries • 170 computing centers • Over 2 million tasks run
every day • 1 million computer cores • 1 exabyte of storage
Overview of ML and Big data Tools 31
Hadoop and Spark services at CERN
• Hadoop is a collection of open source programs which carries out a particulartask essential for a computer system designed for big data analytic
• Hadoop services CERN started in 2013• Evolution of the systems in increase the requirement for data store backend
• More data generated (frequencies > 100 GB/day)• Analytics, machine learning (initial design is not optimal for that)
Overview of ML and Big data Tools 32
Big data projects with Hadoop and Spark
New CERN IT monitoring infrastructure Next Generation archiver for accelerators logs
The ATLAS event index
Swan services• Fully integrated with
Sparks and Hadoop at CERN
• Modern, powerful and scalable platform for data analysis
Overview of ML and Big data Tools 33
GPU resources at CERN
• Graphical Processing Units (GPUs) are fundamental forthe training of models in machine learning
• With GPUs the training time can be reduced severalorders of magnitude with respect to normal CPU
• HEP workloads can benefit from from massiveparallelism
• Current challenges as the TrackML motivates CERN toconsider GPU provisioning in the OpenStack cloud
• GPUs can be accessible via virtual machines
• The user can access normal CUDA applications as tensorflow
• The resources are so far quite limited but is intended to grow in the coming years
Overview of ML and Big data Tools 34
Conclusions and Perspectives
• ML and Bigdata toosl will be crucial to solve many of the during the HL-LHC• Expert supervision will not be removed but rather complemented with a new set of tools• Current tools are used to do PID, trigger, reconstruction and many other processing data
tasks• MC simulation production could benefit with the use of GANs which can speed up the
simulation time allowing a much faster and optimal delivery of results • The integration of ML algorithm at the level of hardware (FPGAs) will potentially allow a
better decision at the L1 trigger • CERN is investing on new devices (hardware, software) to integrate all these tools • To accomplish the different tasks a joint effort from physicist, engineers and IT people
will be needed
Overview of ML and Big data Tools 35
Backup
Overview of ML and Big data Tools 36
Multilayer neural network
Overview of ML and Big data Tools 37
Deep tagging for double quark jet
• DeepDoubleBvL tagger is a tool that is used to distinguish between jets that originate from the merge decay products of resonances from jets that originate from single partons
• Simulated sample of H->bb, H->cc and QCD multijets were used for the performance studies
DP-2018/033
Overview of ML and Big data Tools 38
CaloGAN neural network architectures
Overview of ML and Big data Tools 39
GANs for data-simulation corrections
CMS DP -2019/003• Scale factor (SF) for b tag discriminant (DeepCSV) shape is
derived from data/simulation ratio• “deep scale factor” (DeepSF) approach uses a neural network• Discriminator network models residual data/simulation ratio• DeepSF network trains through adversarial feedback• Final scale factor build from ensemble (unweighted mean) of 25
separate training (different seeds)• Ensemble fluctuations (standard error mean) indicate procedure
stability
Overview of ML and Big data Tools 40
Cherenkov Detectors FastSim with GANs
• A tipycal Cherenkov detector simulation is quite slow : propagating the Cherenkov photons through the media requires expensive calculations
• This work concentrate on simulating the Detector of Internally Reflected Cherenkov light (DIRC) of BaBar and SuperB
Cherenkov Detectors Fast Simulations Using Neural Networks10th International Workshop in Ring Imaging Cherenkov
Overview of ML and Big data Tools 41
CERN openlab initiative
• Most of the new developments on big data and ML tools are now carried by theCERN openlab initiative
• CERN openlab is a public-private partnership that works to accelerate thedevelopment of cutting-edge ICT solutions for the worldwide LHC community
• Some of the ongoing projects are:
• Oracle cloud• REST services, Javascript, and JVM performance • Quantum computing for high-energy physics • High-throughput computing collaboration • Code modernization: fast simulation • Intel big-data analytics• Oracle big-data analytics • Yandex data popularity and anomaly detection
Top Related