Optimising CERN systems through ML & DA using …...1 Optimising CERN systems through ML & DA using...
Transcript of Optimising CERN systems through ML & DA using …...1 Optimising CERN systems through ML & DA using...
1
Optimising CERN systems through ML & DA
using controls data
BE ML and DA Forum Workshop
Filippo Tilaro, M. Bengulescu, Fernando Varela (BE/ICS)
Manuel Gonzalez (BE-BI)
in collaboration with Siemens AG CT Munich, St. Petersburg, Brasov
CERN – 28th May 2019
2
CERN Industrial Control Systems
3
BE-ICS interest in Big Data
Exploit the Big Data volume produced by our systems to:
• Extend the monitoring capabilities of the control systems• By detecting symptomatic effects in the data which do not trigger alarms
• Reduce operational and maintenance costs• Increase availability, stability and performance of the processes• Detect anomalous behavior and over-usage of sensors, actuators, industrial
devices
• Predictive maintenance• Anticipate when sensors and actuators have to be replace
• Render Industrial Control Systems smarter• Guide engineers and operators in order to take corrective actions
The work of BE-ICS in Big Data spans well beyond the Department boundaries
Control System
Data Analytics
4
BE-ICS Collaborations
• CERN collaboration with Siemens since 2011 to work in areas of common interest in the domains of:• Evolution of SCADA systems
• 1 Fellow entirely paid by Siemens
• Big Data• 1 LD post, which will now be converted temporary into 2 Fellow positions
• Access to a network of researches in Siemens specialized in ML techniques
• Collaboration with University of Valladolid, Spain for PID performance monitoring
• Potential collaboration with University of Marburg, Germany on Distributed Complex Event Processing
5
Selected work done so far
Some of the work done so far in the area of Big Data:
• Identifications of CERN use-cases (>40) for Big Data:• Offline
• Stream-data
• Model trainedon archived databut works on stream-data
• Implementation of algorithms to tackle some selected cases
• Joint development of a platform for Distributed Complex Event Processing
• Evaluation of Siemens solutions• Mindsphere, IoT and edge devices, …
} Need forEdge Computing
• Anomaly Detection based on ML• Process optimization based on ML• Distributed Complex Event
Processing• Root-cause analysis
6
Our vision
• Combining cloud and edgecomputing into a single analytical framework
• Stream analysis
• Central rules deployment
• Distributed computational load across multiple nodes
• Support for multiple data ingestion protocols
• Users• Advanced: Jupyter & Python
• Dummy: Simple web UIs to select type of analysis, time-windows and sets of signals for the analysis
Fieldbus
Middleware
WinCC
OA
Archiving
NXCALS
VM
Driver
VM
Driver
BrokerAnalytics
worker
Analytics
worker
Cloud & Edge Link
Edge computing
Broker
Analytics
master
UI
Rules
Cloud computing
7
Identified Use-Cases: Partial list
Control Systems Online Monitoring Faults diagnosis Engineering design
Cryogenics1. Detection of valves oscillation2. Anomaly detection for sensors and actuators
1. Compensation of heat excess in LHC magnets due to e- cloud
2. Vibration analysis for cryocompressors
PID performance evaluation
Cooling & Ventilation Detection of tanks leak Detect PLC anomaliesAssessment of dynamic vs fixed thresholds
Machine protection LHC Circuit monitoring Identify causes for QPS data loss Data loss detection
CERN Power GridForecast of the control systembehaviour
1. Electrical power quality of service
2. Analysis of electrical power cuts
Recommendation system for WinCCOA users
Vacuum Vacuum leaksUnderstanding the degradation of the vacuum system due to leaks
Anomalies in process regulation
LHC Experiments(Gas Control Systems)
Alarms flood managementRoot-cause fault analysis in Gas Control Systems
Analysis of OPC-CAN middleware
8
Goal: detect anomalous oscillation of valves
• Lifespan of valves in km!
Impact on:
• Process system stability and safety
• Communication load
• Maintenance (overuse of valves)
• Performance (Physic time)
Why data analytics?
• General algorithm to detect different oscillations
• Monitoring several thousands of signals (not manually!)• Over 34000 physical instrumentations and channels
• 12136 AI, 4856 AO,4536 DI,1568 DO,8000 spare and virtual channels, ~4000 analogical control loops
• More than 120 PLCs • Siemens S7-416-2DP,30000 conceptual objects/parameters
UC1: Oscillation analysis for cryogenics valves
Examples of abnormal behaviour of valves
9
UC1: Oscillation detection to minimize operational
and maintenance cost
Results:
› ~10% of the CRYO valves showed abnormal oscillations
› Multiple anomalies per valve
up to 20 hours/month
› Wide range of anomalous frequencies:
from 2 hours to sec
Actions:
Improve tuning of control loops to deal with external disturbances & unexpected interoperations
Achievements
Reduce maintenance cost by extending valves’ operational life
More stable system
CALSHadoop cluster
Status: In production
~5000 signal analysed every 24h
Continuous anomaly detection analysis
Web
report
System
expert
10
UC2: Anomaly detection in CRYO signals
Presence of different anomalies not detected by the control systems!
Possible causes:
• hardware failures/degradations
• wrong tuning/structure
• false measurements…
Impact
• Process stability and safety
• Maintenance (overuse of valves)
• Performance and downtime
Why data analytics?
• Too complex to embed calculations into the control systems
• Learn from historical data the group of signals with similar behaviour
Valves CV910 positions in L2 (26th June 2017)
Direct impact on the operational cost!!!
Beam dump!
11
Signal offset detection
UC2: Anomaly detection in CRYO signals
Flipping fault detection
Oscillation detection Faulty amplitude detection
Signals Correlation and K-NN in action!
› Multi-purpose algorithms
Avoid lots of specialized analyses difficult to maintain
› ~5000 signal analysed continuously
› Able to detect faults not foreseen by experts
12
UC2: Anomaly detection in CRYO signals
• Learn the groups of sensors/actuators which behave similarly
• Physical and logical relations
• Exploit historical data (~4GB/day for Cryo)
• Combine Machine Learning techniques with Experts’ knowledge
• Build a model to detect abnormal system behaviours
Challenges:
• Model not specific to a domain/system
• Different types of anomalies, duration and noise
• Not precise boundaries between normal/anomalous
• Mostly unsupervised training: no database of faults!
• Dynamic system => dynamic model
Model
building
13
UC4: Root-cause analysis in
Experiments Gas Control Systems
28 gas systems deployed around LHC
4 Data Server, 51 PLCs (29 for process control, 22 for flow-cells handling)
Essential for particle detection
Reliability and stability are critical
Any variation in the gas composition can affect the accuracy of the acquired data
~18 000 physical sensors / actuators
7 Apps
9 Apps
6 Apps
6 Apps
14
Fault in the distribution system Alarms flooding
Domino effect
› Diagnosing a fault is complex: it may take weeks!
Alarms flood: a single fault can generate up to thousands of events
The 1st alarm is not necessarily the most relevant for the diagnosis
The same fault generates different events sequence depending on the system status
A single fault can stop the whole control process
UC4: Root-cause analysis in Experiments Gas Control Systems
Diagnose Alarm flood
Misleading feedback!Actual problem in the
distribution and not in the Pump
15
Identify and detect fault / abnormal pattern for Diagnosis and Prognostics
Analyze
Provide experts with Root-cause and Gap Analysis using Rules and Patterns Mining
Learn
Forecasts, Trends and Early-Warnings to increase Operating Hours
X T C D F A A E D N D B K D F A A B K D
АА B A A B Alarm Pattern
Diagnose
Dat
a
Event lists generated by the same fault
UC4: Root-cause analysis in Experiments Gas Control Systems
Event stream analysis
Achievements:
Identification of the root of the problem
Algorithm learns patterns and use them to forecast possible faults
Early warning to operators to intervene
16
UC5: Evaluation of PID performance
BE-ICS in collaboration with the University of Valladolid (not an openlab activity with Siemens)Based on: “Performance monitoring of industrial controllers based on the predictability of controller behaviour”, R. Ghraizi, E. Martinez, C. de Prada
› Impact on the regulation of the entire control systems
› Too many PIDs to check manually!
› A general method to assess different PIDs structure
› Many sources of faults/malfunctions System status dependency
External disturbances/factors
Bad tuning/Wrong controller type/structure
Slow degradation
ProcessControlleruw y
SP CV
v
MV
Assist system engineering
1717
Bad
Good
› PID anomaly detection: Assess each PID model based on the
historical data
Simple performance index
› Efficiency of control process: Time/actions taken/energy consumed
to reach steady points
Stability of the controlled variable
› Improvement of ~10% of the analysed control loops
UC5: Evaluation of PID performance
18
UC7: LHC circuit monitoring
Condition monitoring analysis (in collaboration with TE-MPE)
› Main Goal: evaluation of the superconducting circuits health
Degradation after 20 years of operations
Monitoring conditions: anomalous change of current flows, impedance, circuit functioning …
› What to monitor?
Electrical circuits
magnets, power converters, switches …
› Control system: 16 WinCC OA servers
44 industrial FECs
2800 radiation-hard devices
~ 500M Signals
Readout (from 10KHz to 1Hz)
19
Rule definition:Truth(sma(I_Meas, 1m30s)> I_Threshold)):
duration(>=1h)
UC7: LHC circuit monitoring
• Inefficient current flow of analysis
• Manual data extraction, transformation and load
• Many independent scripts
• Time consuming
• New expert system as common framework
• Translate experts’ knowledge into formulation sets / rules
• Central knowledge database
• Rule template to be reused, parametrized, validated
• Domain specific language for simple formulation:
• Time reasoning and temporal expression
• Mathematical and logical functions
• Status: under development [lab testing]!
Distribute complex event processing
Rules
List of similar assets
20
Visualize the results of the analysis to the operators in order to take the proper actions!
UC3: Feed analytical results into the control system
Status: Working prototype
21
Application of PCA (Principal Component Analysis) to detect faults or degradation as early as possible to allow either preventive maintenance or to make operators aware to allow an optimal corrective action to increase
uptime of an industrial plant
• PCA:• an unsupervised, non parametric statistical technology,
• used in Machine learning to reduce the dimensionality of datasets feasible to the most relevant ones
• Applications: CV, CRYO
Collaboration with U. of Valladolid, Spain
Fault detection applied to industrial process
Contact: Enrique Blanco
22
Extract and identify relevant KPI from alarms and data via online exploration, thanks to pre-emptive data indexing and CEP pattern matching
techniques being researched at Marburg University (ChronicleDB).
• compare online query performance between Spark + Hbase / Kudu / Impala vs Marburg’s ChronicleDB and apply online KPI extraction techniques for predictive improvements of alarms
• Applications: EL distribution, Access Control
Collaboration: Marburg University, Germany
Pattern-based KPI discovery via CEP
Contact: Matthias Braegger, Brice Copy
2323
Next use-cases
• Linac3 ion beam source optimization:In collaboration with BE-ABP• Find the optimal settings of control inputs to
• Optimize the ion current in the beam transformer of LINAC3
• Minimize the variance of the ion current
• Learn from the ~10 years data operation of LINAC3
• Assist the operators to choose the best settings for operation
• Vacuum leak detection:In collaboration with TE-VSC• Critical for the proper operation of the accelerators and LHC machine
• Initial for SPS, then for all the other vacuum systems
• Historical analysis of pressure sensors (Pirani and Penning gauges) combined with :• beam energy, beam mode and
• temperature sensors
• Inform operators to take the proper actions
24
Conclusions
• Data Analytics has an important added value already today to understand the behaviour and optimize complex systems
• Big impact on Operation and, running and maintenance Costs
• BE-ICS working on Data Analytics with Siemens for the last 5 years
• Openlab collaboration
• Growing community of users in different Groups and Departments
• Very distinct use-cases, not only related to controls
• General approach for multi domains application
• Reusability of the developed analysis
25
Use-cases: a partial list
› Online monitoring Control System Health Electrical power quality of service Looking for heat in superconducting magnets Oscillation in cryogenics valves Discharge of superconducting magnets heaters Trending and forecast of the control process behavior Vacuum Leak detection
› Faults diagnosis Anomalies in the process regulation PLC anomalies Data loss detection Root-cause analysis for complex WinCC OA installations Analysis of sensors functioning and data quality Analysis of OPC-CAN middleware Analysis of electrical power cuts Cryogenic system breakdowns
› Engineering design Electrical consumption forecast Efficiency of electric network Predictive maintenance of control systems elements LINAC3 ion beam source optimization Vibration analysis Efficiency of control process …
Thank you!
CERN BE-ICShttps://be-dep-ics.web.cern.ch/
26
UC1: Compensation of the e-cloud
thermal effect
› LHC vacuum chamber Cold bore at 1.9K
Beam screen (5-20K) to intercept heat load
› Interference with Cryo control system
Ideal measurement cycle
Heat load of the screen
Thermal resistance, R =6k/WThermal capacitance, C = 1200 J/K
› Main issue: temperature increase close to the quench level trigger!
In collaboration with TE-CRG
(Benjamin Bradu)
27
UC1: Compensation of e-cloud thermal effect
Qdbs= heat load on the beam screenQsr = synchrotron radiationsQic =image currentQec= electron clouds
Currently used in production for Cryo:
Keep temperature away from the quench level trigger
Data analytics techniques to reduce computing time from weeks to hours!
• Cloud computing to parallelize and distribute
Compensation due to Feed Forward loops
Feed-Forward loops to compensate electron cloud heat load
28
UC6: Leak detection in Cooling and ventilation systems
• Problem:
• Manually set alarms thresholds
• Changing filling conditions
• Anomaly detection based on historical data
• Detection of “large” leaks:
Anomalous valve opening time
• Detection of “small” leaks:
Anomalous frequency of valve openings
• Achievements:
• Identification of anomalous behaviours
• Improving thresholds setting
Distribution of valve openings [FSED_001_VMA400]