Optimising CERN systems through ML & DA using …...1 Optimising CERN systems through ML & DA using...

1

Optimising CERN systems through ML & DA

using controls data

BE ML and DA Forum Workshop

Filippo Tilaro, M. Bengulescu, Fernando Varela (BE/ICS)

Manuel Gonzalez (BE-BI)

in collaboration with Siemens AG CT Munich, St. Petersburg, Brasov

CERN – 28th May 2019

2

CERN Industrial Control Systems

3

BE-ICS interest in Big Data

Exploit the Big Data volume produced by our systems to:

• Extend the monitoring capabilities of the control systems• By detecting symptomatic effects in the data which do not trigger alarms

• Reduce operational and maintenance costs• Increase availability, stability and performance of the processes• Detect anomalous behavior and over-usage of sensors, actuators, industrial

devices

• Predictive maintenance• Anticipate when sensors and actuators have to be replace

• Render Industrial Control Systems smarter• Guide engineers and operators in order to take corrective actions

The work of BE-ICS in Big Data spans well beyond the Department boundaries

Control System

Data Analytics

4

BE-ICS Collaborations

• CERN collaboration with Siemens since 2011 to work in areas of common interest in the domains of:• Evolution of SCADA systems

• 1 Fellow entirely paid by Siemens

• Big Data• 1 LD post, which will now be converted temporary into 2 Fellow positions

• Access to a network of researches in Siemens specialized in ML techniques

• Collaboration with University of Valladolid, Spain for PID performance monitoring

• Potential collaboration with University of Marburg, Germany on Distributed Complex Event Processing

5

Selected work done so far

Some of the work done so far in the area of Big Data:

• Identifications of CERN use-cases (>40) for Big Data:• Offline

• Stream-data

• Model trainedon archived databut works on stream-data

• Implementation of algorithms to tackle some selected cases

• Joint development of a platform for Distributed Complex Event Processing

• Evaluation of Siemens solutions• Mindsphere, IoT and edge devices, …

} Need forEdge Computing

• Anomaly Detection based on ML• Process optimization based on ML• Distributed Complex Event

Processing• Root-cause analysis

6

Our vision

• Combining cloud and edgecomputing into a single analytical framework

• Stream analysis

• Central rules deployment

• Distributed computational load across multiple nodes

• Support for multiple data ingestion protocols

• Users• Advanced: Jupyter & Python

• Dummy: Simple web UIs to select type of analysis, time-windows and sets of signals for the analysis

Fieldbus

Middleware

WinCC

OA

Archiving

NXCALS

VM

Driver

VM

Driver

BrokerAnalytics

worker

Analytics

worker

Cloud & Edge Link

Edge computing

Broker

Analytics

master

UI

Rules

Cloud computing

7

Identified Use-Cases: Partial list

Control Systems Online Monitoring Faults diagnosis Engineering design

Cryogenics1. Detection of valves oscillation2. Anomaly detection for sensors and actuators

1. Compensation of heat excess in LHC magnets due to e- cloud

2. Vibration analysis for cryocompressors

PID performance evaluation

Cooling & Ventilation Detection of tanks leak Detect PLC anomaliesAssessment of dynamic vs fixed thresholds

Machine protection LHC Circuit monitoring Identify causes for QPS data loss Data loss detection

CERN Power GridForecast of the control systembehaviour

1. Electrical power quality of service

2. Analysis of electrical power cuts

Recommendation system for WinCCOA users

Vacuum Vacuum leaksUnderstanding the degradation of the vacuum system due to leaks

Anomalies in process regulation

LHC Experiments(Gas Control Systems)

Alarms flood managementRoot-cause fault analysis in Gas Control Systems

Analysis of OPC-CAN middleware

8

Goal: detect anomalous oscillation of valves

• Lifespan of valves in km!

Impact on:

• Process system stability and safety

• Communication load

• Maintenance (overuse of valves)

• Performance (Physic time)

Why data analytics?

• General algorithm to detect different oscillations

• Monitoring several thousands of signals (not manually!)• Over 34000 physical instrumentations and channels

• 12136 AI, 4856 AO,4536 DI,1568 DO,8000 spare and virtual channels, ~4000 analogical control loops

• More than 120 PLCs • Siemens S7-416-2DP,30000 conceptual objects/parameters

UC1: Oscillation analysis for cryogenics valves

Examples of abnormal behaviour of valves

9

UC1: Oscillation detection to minimize operational

and maintenance cost

Results:

› ~10% of the CRYO valves showed abnormal oscillations

› Multiple anomalies per valve

up to 20 hours/month

› Wide range of anomalous frequencies:

from 2 hours to sec

Actions:

Improve tuning of control loops to deal with external disturbances & unexpected interoperations

Achievements

Reduce maintenance cost by extending valves’ operational life

More stable system

CALSHadoop cluster

Status: In production

~5000 signal analysed every 24h

Continuous anomaly detection analysis

Web

report

System

expert

10

UC2: Anomaly detection in CRYO signals

Presence of different anomalies not detected by the control systems!

Possible causes:

• hardware failures/degradations

• wrong tuning/structure

• false measurements…

Impact

• Process stability and safety

• Maintenance (overuse of valves)

• Performance and downtime

Why data analytics?

• Too complex to embed calculations into the control systems

• Learn from historical data the group of signals with similar behaviour

Valves CV910 positions in L2 (26th June 2017)

Direct impact on the operational cost!!!

Beam dump!

11

Signal offset detection


Flipping fault detection

Oscillation detection Faulty amplitude detection

Signals Correlation and K-NN in action!

› Multi-purpose algorithms

Avoid lots of specialized analyses difficult to maintain

› ~5000 signal analysed continuously

› Able to detect faults not foreseen by experts

12


• Learn the groups of sensors/actuators which behave similarly

• Physical and logical relations

• Exploit historical data (~4GB/day for Cryo)

• Combine Machine Learning techniques with Experts’ knowledge

• Build a model to detect abnormal system behaviours

Challenges:

• Model not specific to a domain/system

• Different types of anomalies, duration and noise

• Not precise boundaries between normal/anomalous

• Mostly unsupervised training: no database of faults!

• Dynamic system => dynamic model

Model

building

13

UC4: Root-cause analysis in

Experiments Gas Control Systems

28 gas systems deployed around LHC

4 Data Server, 51 PLCs (29 for process control, 22 for flow-cells handling)

Essential for particle detection

Reliability and stability are critical

Any variation in the gas composition can affect the accuracy of the acquired data

~18 000 physical sensors / actuators

7 Apps

9 Apps

6 Apps

6 Apps

14

Fault in the distribution system Alarms flooding

Domino effect

› Diagnosing a fault is complex: it may take weeks!

Alarms flood: a single fault can generate up to thousands of events

The 1st alarm is not necessarily the most relevant for the diagnosis

The same fault generates different events sequence depending on the system status

A single fault can stop the whole control process

UC4: Root-cause analysis in Experiments Gas Control Systems

Diagnose Alarm flood

Misleading feedback!Actual problem in the

distribution and not in the Pump

15

Identify and detect fault / abnormal pattern for Diagnosis and Prognostics

Analyze

Provide experts with Root-cause and Gap Analysis using Rules and Patterns Mining

Learn

Forecasts, Trends and Early-Warnings to increase Operating Hours

X T C D F A A E D N D B K D F A A B K D

АА B A A B Alarm Pattern

Diagnose

Dat

a

Event lists generated by the same fault

UC4: Root-cause analysis in Experiments Gas Control Systems

Event stream analysis

Achievements:

Identification of the root of the problem

Algorithm learns patterns and use them to forecast possible faults

Early warning to operators to intervene

16

UC5: Evaluation of PID performance

BE-ICS in collaboration with the University of Valladolid (not an openlab activity with Siemens)Based on: “Performance monitoring of industrial controllers based on the predictability of controller behaviour”, R. Ghraizi, E. Martinez, C. de Prada

› Impact on the regulation of the entire control systems

› Too many PIDs to check manually!

› A general method to assess different PIDs structure

› Many sources of faults/malfunctions System status dependency

External disturbances/factors

Bad tuning/Wrong controller type/structure

Slow degradation

ProcessControlleruw y

SP CV

v

MV

Assist system engineering

1717

Bad

Good

› PID anomaly detection: Assess each PID model based on the

historical data

Simple performance index

› Efficiency of control process: Time/actions taken/energy consumed

to reach steady points

Stability of the controlled variable

› Improvement of ~10% of the analysed control loops

UC5: Evaluation of PID performance

18

UC7: LHC circuit monitoring

Condition monitoring analysis (in collaboration with TE-MPE)

› Main Goal: evaluation of the superconducting circuits health

Degradation after 20 years of operations

Monitoring conditions: anomalous change of current flows, impedance, circuit functioning …

› What to monitor?

Electrical circuits

magnets, power converters, switches …

› Control system: 16 WinCC OA servers

44 industrial FECs

2800 radiation-hard devices

~ 500M Signals

Readout (from 10KHz to 1Hz)

19

Rule definition:Truth(sma(I_Meas, 1m30s)> I_Threshold)):

duration(>=1h)

UC7: LHC circuit monitoring

• Inefficient current flow of analysis

• Manual data extraction, transformation and load

• Many independent scripts

• Time consuming

• New expert system as common framework

• Translate experts’ knowledge into formulation sets / rules

• Central knowledge database

• Rule template to be reused, parametrized, validated

• Domain specific language for simple formulation:

• Time reasoning and temporal expression

• Mathematical and logical functions

• Status: under development [lab testing]!

Distribute complex event processing

Rules

List of similar assets

20

Visualize the results of the analysis to the operators in order to take the proper actions!

UC3: Feed analytical results into the control system

Status: Working prototype

21

Application of PCA (Principal Component Analysis) to detect faults or degradation as early as possible to allow either preventive maintenance or to make operators aware to allow an optimal corrective action to increase

uptime of an industrial plant

• PCA:• an unsupervised, non parametric statistical technology,

• used in Machine learning to reduce the dimensionality of datasets feasible to the most relevant ones

• Applications: CV, CRYO

Collaboration with U. of Valladolid, Spain

Fault detection applied to industrial process

Contact: Enrique Blanco

22

Extract and identify relevant KPI from alarms and data via online exploration, thanks to pre-emptive data indexing and CEP pattern matching

techniques being researched at Marburg University (ChronicleDB).

• compare online query performance between Spark + Hbase / Kudu / Impala vs Marburg’s ChronicleDB and apply online KPI extraction techniques for predictive improvements of alarms

• Applications: EL distribution, Access Control

Collaboration: Marburg University, Germany

Pattern-based KPI discovery via CEP

Contact: Matthias Braegger, Brice Copy

2323

Next use-cases

• Linac3 ion beam source optimization:In collaboration with BE-ABP• Find the optimal settings of control inputs to

• Optimize the ion current in the beam transformer of LINAC3

• Minimize the variance of the ion current

• Learn from the ~10 years data operation of LINAC3

• Assist the operators to choose the best settings for operation

• Vacuum leak detection:In collaboration with TE-VSC• Critical for the proper operation of the accelerators and LHC machine

• Initial for SPS, then for all the other vacuum systems

• Historical analysis of pressure sensors (Pirani and Penning gauges) combined with :• beam energy, beam mode and

• temperature sensors

• Inform operators to take the proper actions

24

Conclusions

• Data Analytics has an important added value already today to understand the behaviour and optimize complex systems

• Big impact on Operation and, running and maintenance Costs

• BE-ICS working on Data Analytics with Siemens for the last 5 years

• Openlab collaboration

• Growing community of users in different Groups and Departments

• Very distinct use-cases, not only related to controls

• General approach for multi domains application

• Reusability of the developed analysis

25

Use-cases: a partial list

› Online monitoring Control System Health Electrical power quality of service Looking for heat in superconducting magnets Oscillation in cryogenics valves Discharge of superconducting magnets heaters Trending and forecast of the control process behavior Vacuum Leak detection

› Faults diagnosis Anomalies in the process regulation PLC anomalies Data loss detection Root-cause analysis for complex WinCC OA installations Analysis of sensors functioning and data quality Analysis of OPC-CAN middleware Analysis of electrical power cuts Cryogenic system breakdowns

› Engineering design Electrical consumption forecast Efficiency of electric network Predictive maintenance of control systems elements LINAC3 ion beam source optimization Vibration analysis Efficiency of control process …

Thank you!

CERN BE-ICShttps://be-dep-ics.web.cern.ch/

26

UC1: Compensation of the e-cloud

thermal effect

› LHC vacuum chamber Cold bore at 1.9K

Beam screen (5-20K) to intercept heat load

› Interference with Cryo control system

Ideal measurement cycle

Heat load of the screen

Thermal resistance, R =6k/WThermal capacitance, C = 1200 J/K

› Main issue: temperature increase close to the quench level trigger!

In collaboration with TE-CRG

(Benjamin Bradu)

27

UC1: Compensation of e-cloud thermal effect

Qdbs= heat load on the beam screenQsr = synchrotron radiationsQic =image currentQec= electron clouds

Currently used in production for Cryo:

Keep temperature away from the quench level trigger

Data analytics techniques to reduce computing time from weeks to hours!

• Cloud computing to parallelize and distribute

Compensation due to Feed Forward loops

Feed-Forward loops to compensate electron cloud heat load

28

UC6: Leak detection in Cooling and ventilation systems

• Problem:

• Manually set alarms thresholds

• Changing filling conditions

• Anomaly detection based on historical data

• Detection of “large” leaks:

Anomalous valve opening time

• Detection of “small” leaks:

Anomalous frequency of valve openings

• Achievements:

• Identification of anomalous behaviours

• Improving thresholds setting

Distribution of valve openings [FSED_001_VMA400]

Optimising CERN systems through ML & DA using …...1 Optimising CERN systems through ML & DA using...

Documents

Transcript of Optimising CERN systems through ML & DA using …...1 Optimising CERN systems through ML & DA using...