Memorandum · Web viewwith a significant loss of physics production. For linear lepton colliders a...

CERN-ACC-NOTE-2018-00722023-05-24

Machine learning for early fault detection in accelerator systems

Andrea Apollonio, Thomas Cartier-Michaud, Lukas Felsberger, Andreas Mueller, Benjamin Todd

Keywords: Fault detection, machine learning, reliability

INTRODUCTIONWith the development of systems based on a combination of mechanics, electronics and – more and more - software components, increasing system complexity is a de facto trend in the engineering world. Particle accelerators make no exception to this paradigm. The continuous push for higher energies driven by particle physics implies that next generation machines will be at least one order of magnitude larger and more complex than present ones, posing unprecedented challenges in terms of beam performance and availability. The two most promising approaches CERN discusses as next generation projects are the Future Circular Collider (FCC) and the Compact Linear Collider (CLIC), with a size of 100 km and 48 km, respectively (see Fig.1 and Fig. 2).

Fig. 1 Foreseen layout of the Future Circular Collider.

Fig. 2 Foreseen layout of the Compact Linear Collider.

Maximizing the scientific findings of future colliders in time and cost directly relates to minimizing their downtime, i.e. the time during which the accelerator is not available for operation. For circular hadron accelerators, a critical failure makes a preventive beam dump by the machine protection system necessary, which requires the accelerator cycle to be run through again before re-establishing collisions. Each of these events has at least an impact of several hours

2

of operation, with a significant loss of physics production. For linear lepton colliders a relevant aspect is the recovery time required to optimize collisions at the interaction point, which is very sensitive to misalignment and ground motion. Every failure requiring repairs of several hours might require in addition a significant set-up time. High availability has therefore to be achieved by reducing the number of failures and by limiting the number of interventions, especially considering that the large size of future machines will have an impact on logistics and equipment maintainability. Systems are already required to be remotely maintainable and sufficiently redundant to ensure their functionality even in case of a failure (degraded operation). In this context, advanced system monitoring becomes fundamental to guarantee safe operation of the accelerator. Today most accelerators run applying a corrective maintenance approach during a predefined period of time (typically several weeks), interleaved with scheduled technical stops for periodic maintenance (few days), see for example an extract of the LHC schedule in 2018 (Fig. 3).

Fig. 3 Extract of 2018 LHC schedule. Operation is interleaved by scheduled Maintenance Periods (MD) and Technical Stops (TS).

Monitoring of systems status is not possible in most of the cases to a sufficient level of detail to perform condition-based maintenance. In the future, the need for predictive maintenance based on system performance measurements will be a necessity to minimize the downtime with the goal of optimizing interventions and synchronizing them within scheduled technical stops. Methods for the timely detection of upcoming failures will be fundamental to achieve this goal. In this note we are therefore exploring new methods for early failure prediction based on machine learning techniques. The basic idea of the present study is to identify a use case to demonstrate the potential application of machine learning for failure prediction in the context of particle accelerators.

3

Fig. 4 Available data sources at CERN used in the present study.

The available data sources for the study are the Accelerator Fault Tracker (AFT), the Alarm System (LASER - LHC Alarm Service Project [1]) and individual system databases, plus the Logging System. The Accelerator Fault Tracker represents the primary source of information for faults causing accelerator downtime. LASER is the main tool used by operators in the CERN Injector Complex for fast failure diagnostics, displaying all the system alarms which led to a failure. Individual system databases or issue tracking tools (e.g. Jira) potentially complement the information from the first two tools with additional details from system experts, when applicable. Data stored in the Logging System provides all ancillary information to retrieve the context in which failures occurred (e.g. beam modes, beam parameters, etc., see Fig. 4).

USE CASE: THE PSB POWER CONVERTER SYSTEMThe Proton Synchrotron Booster (PSB) is the second element in the CERN injector chain, following – until the end of 2018 – Linac2. As of Run 3, Linac4 will become the first element in the injector chain in the context of the LHC Injectors upgrade replacing Linac2. The PSB has a radius of 25 m and is composed of four superimposed rings, providing beams to ISOLDE or to the Proton Synchrotron (PS). The PSB has been operational at CERN since 1972, ensuring excellent availability performances for the CERN complex. From 2017, the faults in the PSB are tracked by means of the AFT, which ensures consistency of the data and allows for a direct comparison with alarms registered and managed by operators via LASER, which is used on a daily basis for fault diagnostics. Figures 5 and 6 show the availability and downtime statistics for the PSB in 2017. The PSB weekly availability ranged from a minimum of 75 % to a maximum of close to 100 %, with an average of 96.11 %. The downtime distribution shows that power converters are the second biggest source of downtime. In this note, a case study for demonstrating the feasibility of predictive fault diagnostics by means of machine learning techniques has been chosen from the available power converter data (both in AFT and LASER).

4

Fig. 5 PSB weekly availability in 2017.

Fig. 6 PSB system downtime distribution in 2017.

Power converters are one of the mission critical elements of the PSB. Some reported outages of the PSB were caused by problems with the power converters. Therefore these elements are studied for their suitability for predictive maintenance in this note. The list of power converters in the PSB as of August 2019 is shown in table 1

5

Model Quantity Where in PSB

in use before LS2

Controller data in LASER

A1/IP 3 transfer to PS

yes G64/MIL1553 (no)

ACAPULCO 114 booster ring

yes FGC3.1 yes

BR-MPS 1 booster ring


BR-Q 2 booster ring


BR-TRIM 1 booster ring


CANCUN 50 16 booster ring

yes FGC3.1 yes

COMET 4 booster ring

yes FGC3.1 no

ENE-1 10 booster ring


MEGADISCAP_24KA_2015

3 injection

yes FGC3.1 (not operated yet)

no

POPS-B 2 booster ring

(no) FGC3.1 (but new just before LS2)

no

SIRIUS_P2P_3400 12 injection

no FGC3.1 no

SIRIUS_P2P_6700 4 injection

no FGC3.1 no

QSTRIP <9 booster ring

yes G64/MIL1553 yes

Table 1, list of power converters in the PSB as of August 2019.

The data in table 1 was extracted from the EPC database, which is an expert database of the TE-EPC group for operation and maintenance of power converters. It is indicated if the power converters were in use before the long shutdown 2 (LS2) as we are only considering data before it. For some power converters it is not yet evident if data is available in the LASER database. Therefore, in the column 'data in LASER' an entry with '(no)' indicates that the data in LASER could not be associated with specific power converter modules due to different naming conventions but there is reason to believe that the data is in principle available in LASER.

6

https://te-dep-epc-databases.web.cern.ch/Models/Report.aspx?Model=QSTRIP

https://mmm.cern.ch/owa/redir.aspx?C=pKqcxz__5HvE2KNMgQn_ni7PGL3zCF2ZM0hiukjRtYPNRnqtuTbXCA..&URL=https%3A%2F%2Fte-dep-epc-databases.web.cern.ch%2FModels%2FReport.aspx%3FModel%3DSIRIUS_P2P_6700

https://mmm.cern.ch/owa/redir.aspx?C=0BEWnORzJ-CtXu1Pw8EiD12zMCXaMnIHx7xJR8KyZde8H3qtuTbXCA..&URL=https%3A%2F%2Fte-dep-epc-databases.web.cern.ch%2FModels%2FReport.aspx%3FModel%3DSIRIUS_P2P_3400

https://mmm.cern.ch/owa/redir.aspx?C=QrZLONoBUFdeozjU1CCppUHeQ14c65TeygDZ3sT7HAOr-HmtuTbXCA..&URL=https%3A%2F%2Fte-dep-epc-databases.web.cern.ch%2FModels%2FReport.aspx%3FModel%3DMEGADISCAP_24KA_2015

https://mmm.cern.ch/owa/redir.aspx?C=QrZLONoBUFdeozjU1CCppUHeQ14c65TeygDZ3sT7HAOr-HmtuTbXCA..&URL=https%3A%2F%2Fte-dep-epc-databases.web.cern.ch%2FModels%2FReport.aspx%3FModel%3DMEGADISCAP_24KA_2015

1. Existing WorkflowDepending on the accelerator and the system type, the LASER alarm logging system is used in different ways. For power converters, a common workflow is shown in figure 7.

Fig. 7 Existing workflow for fault analysis and recovery.

The workflow is as follows: a problem in some subsystems of the machine triggers an alarm. The triggering mechanism is typically pre-defined by equipment experts. These alarms are signaled to machine operators. If the problem can be understood they act accordingly. If the problem cannot be understood, the equipment experts are informed. They investigate the alarms and will probably use additional information from other databases, such as the Logging database or equipment specific interfaces, to assess the problem and provide a solution to the operators.Two main limitations of the current workflow have been reported to the authors. Firstly, operators can be overwhelmed by the amount of alarms and, as a result, cannot draw useful conclusions from them. Secondly, the current workflow does not enable operators to act preventively to alleviate problems before they occur.

2. Opportunities of Prognostics and DiagnosticsWe investigate the feasibility and potential of setting up a preventive warning system. Ideally such a system should allow operators and equipment experts to identify problems before they cause downtime of an accelerator and help to reduce the number of incidents significantly.In an off-line setting, frequent downtime events can be studied, their root causes investigated, and subsequently mitigated. In an on-line setting, a preventive warning system could continuously analyse system alarms and data to warn of future problems, inform operators and equipment experts and guide their search for preventive measures as shown in figure 8.

Fig. 8 Idealized future workflow for fault analysis and recovery.

7

AVAILABLE DATA FOR THE PSB POWER CONVERTER SYSTEMLASER is a database managed by the BE-CO-APS Acquisition Team. It serves as the diagnostic tool for operators and focuses on alarms that require human intervention. LASER is not used for interlock purposes. The whole accelerator chain and the technical infrastructures are monitored by means of LASER. As a first step, we decided to focus on data analysis for the Proton Synchrotron Booster (PSB), as this is one of the machines where LASER is exploited to its full potential on a daily basis. For the LHC for example, the amount of alarms is too large to be continuously processed by operators during shifts. If the framework discussed in this note will prove any predictive ability, the LHC use case would become an excellent application.From 2011 up to 2017, about 4.365 million raw events have been recorded for the PSB. Each event is broken down into 30 fields, some of which are very detailed. For the purpose of this study, we focus on the six following fields:

FAULT_FAMILY : the type of equipment sending the alarm FAULT_MEMBER : the name of the component sending the alarm FAULT_CODE : a code mapped to an error message PRIORITY : an evaluation of the criticality of the alarm, ranging from 0 to 3 ACTIVE : a flag depicting whether the event is the beginning of the alarm or the end SYSTEM_TS : the timestamp given by the LASER database when the event is received

Each combination (FAULT_FAMILY, FAULT_MEMBER, FAULT_CODE) defines a unique element in the set of alarm definitions. The field PRIORITY is of importance as it indicates the impact or severity if an alarm and its consequences. Therefore, a later goal will be to predict alarms of high priority (associated with the value 3). For extracting the data of LASER, a connection to the Technical Network (TN) is necessary. Although a web interface exists, we used an SQL client and automation scripts to write Comma Separated Values (CSV) files to retrieve the large amounts of data. A simple “full outer join” has to be performed between a table containing the list of events and a table containing the definition of events. Even if only 6 fields are currently used in our framework, we extracted more fields, which can be used for further studies. It should be noted that this database has been initially designed to display alarms to operators in real time without any plans as a logging system for historical data analysis. Therefore the consistency of some parameter names over time was not a design principle at first. This can become relevant when extracting data sets of long periods for a given system. Having extracted as many fields as possible could help us to check the integrity of the data. The 4.356 million events extracted in total for PSB represent 37 GB stored in raw CSV files. This large size is due to several fields containing long strings describing the event and the repair. As those fields are not filled for each event, it is possible to map timestamps using a small dictionary. A simple zip compression reduced the file size down to 173 MB, resulting in a reduction from about 9 KB per event to about 40 B. Only considering the 6 mentioned fields, reduces the alarm data to 600 MB of raw files and 35 MB of zipped files, what corresponds to 150 B and 8 B per event for raw and zipped files.

8

ANALYSIS METHODSIn this section, the methodological approach is described. It contains a mathematical formulation of the problem, a description of the machine learning approach to reach this goal and the underlying computing infrastructure. The primary objectives are to (1) predict high-impact alarms ahead of time and (2) provide the necessary information to operators and equipment experts to identify preventive measures as exemplified conceptually in figure 9.

Fig. 9: Simplified illustration of the cross-system prediction approach. Here five alarms in the green system (pre-cursors) induce a severe problem in the blue system.

Mathematical formulationThese objectives can principally be achieved, if predictive patterns within the data can be identified and models of future system behaviour can be learned. We can learn such predictive models by framing the problem as a supervised machine learning approach as illustrated in figure 10.

Fig. 10: Sliding window approach to frame time-series prediction as supervised classification problem. Adapted from [AS2019]

The x-axis corresponds to time and the y-axis lists different system alarms. The data points indicate whether at a certain time a specific alarm is active or not. At every time step t i=i× ∆ t , we take a multi-variate data window of size n for all system alarms from time step t i−n to t i as input data. The output data is usually based on a unique time step. The latter could be extended to a time window under the condition that the output window is still reduced to a unique vector of output values. Here the value of each alarm is defined by its maximum value inside the output windows. The output window is taken with a delay of t p=p × ∆t , the prediction time. Thereby, the mathematical structure of the problem is similar to an image classification problem. Supplying pairs of input and target data for many time-steps to machine learning algorithms allows identifying predictive models. When supplying sufficient learning data, a general approximator, such as a deep neural network, is able to learn any predictive interdependency

9

between subsystems leading to specific system alarms. Therefore, objective (1), the prediction of high impact alarms, can be fulfilled.To satisfy objective (2), providing necessary information to operators and equipment experts to identify preventive measures, we employ methods that quantify the relevance of certain inputs towards the classification output. In this context, we quantify the relevance of specific alarms at certain times in the past as input data towards the occurrence of a certain critical alarm in the future output data. This should help system experts and operators in understanding the problem and finding measures to prevent the system downtime. Such systems have successfully been applied in image recognition to explain the classification of images [L2019].

Machine Learning ApproachThe general machine learning workflow consists of several steps, which are described in more detail with respect to the considered problem in the following.To have an unbiased estimate of the predictive ability of models learned from existing data, a part of the data is split off, suiting as a validation set to estimate prediction accuracy of the final model. Since the data is generated from a system undergoing changes over time, several slices of data at different times need to be removed from the learning data for later validation.Visualization of data allows to obtain further insights to the problem. Figure 11 shows the activity of different alarms (sorted by electrical circuits and alarms in the horizontal axis) as a function of time (vertical axis, sampling time 24 hours) from 2011 to 2018 for data generated by FGC – a generic power converter controller platform provided by the TE-EPC group - controlled converters and gateways in PSB.

10

Fig. 11: Illustration of system alarms from 2011 to 2018 for FGC controlled power converters. The color depicts the count of a certain alarm within a discrete time window of 24 hours. For better visualization, counts higher than three are visualized as if

counting three. The x axis depicts different alarms (numbered 0 upwards) and the y axis shows a time scale with calendar dates.

Regular patterns can be identified within the data set. The found patterns change abruptly with time. To better investigate individual signals, a random selection of 8 alarm signals over a shorter time span is plotted in figure 12.

Fig. 12: Zoom into illustration of system alarms by selecting eight alarms over a shorter time span. The color depicts the count of a certain alarm within a discrete time window of 24 hours. For better visualization, counts higher than three are visualized as if

counting three.

It can be easily seen that certain signals have a pairwise similarity over time. To quantify a measure for linear similarity, the correlation coefficient can be calculated. It is illustrated in figure 13.

11

Fig. 13: Correlation matrix of the eight selected alarm signals.As shown in figure 10, the time series problem can be formulated as a supervised classification problem by a sliding window approach. This transforms a time series of N discrete time steps into N pairs of input and target data for supervised learning. The input data consists of a 2D window containing several alarms over discrete time steps. The target data is a label, which is ‘0’ if there is no severe alarm or outage within the specified future time-window and ‘1’ otherwise, as shown figure 14.

Figure 14: Examples of “class 0” and of “class 1” are provided, using the same conventions as in figure 10.An input window is labeled as class 0 if no priority 3 alarm is found in the output window, and class 1 otherwise.

Training of predictive models can be obstructed by heavily imbalanced data sets [HHY2013]; i.e. there are much less input-target pairs labelled by ‘0’ as those labelled by ‘1’. Often, this is the case when working with real data (in our case no alarm – ‘0’ – is generally much more frequent than alarm – ‘1’). Therefore, so-called re-sampling can be employed to correct for this. In the simplest case this is done by removing input-target pairs of the more frequent class randomly (or replicating the less frequent class randomly). Furthermore, to make the predictions more robust, it is possible to use data augmentation, e.g. by shifting and scaling the input data in time to emphasize the interdependencies between data points instead of their exact timing.Data filtering is used to reduce the data size without reducing its information content. E.g., there might be alarms that replicate each other. Such alarms can be removed as they do not add additional insight and would only slow the computational training process.

12

Traditional dimensionality reduction approaches are not applied to the dataset, as they often do not allow a back-transformation to the original space of data. However, this would be required to achieve objective (2) - root cause identification of the outage.For model training and selection, the data is split into training and test sets. For time-series problems, the data is split in time without changing the temporal order of the events to obtain an unbiased estimate of the prediction error in case of underlying system changes. I.e., if data is ordered by time, we split it in a training part before and a test part after a certain timestamp. This ensures that we test if our models can actually predict the future. If we would do a cross validation split, the algorithm learns from splits across time and validates against splits across time. In this case the estimation of the ability of the models to predict the future is biased.

There are several parameters in the problem formulation requiring optimization, such as the time discretisation, the size of the input window, the prediction time and the side of the output window (see figure 10). Except for the prediction time, those parameters have an impact on the computation time as well as on the performance of the network. We do so by manually defining a grid of parameter combinations and successfully refine it for parameter combinations, which achieve low prediction errors.The predictive models that are trained were using standard deep learning architectures. We re-used benchmarked algorithms from [F2019], which showed state-of-the art performance over a range of comparable tasks. To achieve objective (2), root-cause identification, we employ an explainable AI methodology developed for image classification analysis [L2019]. It provides the input features which are most relevant for the image classification. In our context, we can use it to identify and provide input alarms, which are most relevant for future alarm prediction to assist the root-cause search of failures. We adopted the repository provided in [A2019].

The investigated accelerator systems undergo smaller design changes and replacements of subsystems over the years. Therefore, in future it is considered to also apply an optimal re-training rate of the models over time to adapt to the changes.As known examples of failure dependencies and the corresponding data are not at hand, we pursue an iterative approach to validate the analysis. In a first step, we generate artificial data of intra-system failure dependencies and apply the proposed machine learning approach to it. Thereby, we validate that both objective (1) and (2) can be achieved with the chosen learning algorithms. Variating the ratio between events in the pattern artificially created and a background noise of events, we could estimate the maximum level of information dilution that the algorithms can tackle. Introducing artificial patterns in real data is another way to detect how evident a dependency needs to be, to be detected by our algorithms.As a next step, the approach is applied to a subset of real data. Once predictive patterns and relevant alarm sequences leading to future system alarms have identified, we will try to validate them with system experts. Each of these steps will probably require further sub-iterations and trigger learning processes in terms of choice of parameters, strengths and weaknesses of predictive models, required input data, etc.A fully developed system could scale across CERN infrastructures to assist operations in an online and offline fashion and provide insights not identifiable solely by human expertise.

Computing EnvironmentThe methodology is built upon a hardware and software stack, which is publicly available or provided by CERN. For the data handling and manipulation, scripts are written in python using pandas, numpy, matplotlib, pyspark and seaborn packages, among others, and executed on local machines, the SWAN interface and the Spark clusters provided by CERN. The SWAN Spark Cluster could be particularly useful as it allows to load and process large amount of data without the bottleneck of RAM quantity and with a fast execution of up to 128 cores.

13

For the training of predictive models, code was written in python using pandas, keras and tensorflow packages among others and is executed on local machines, the SWAN interface, the CERN batch computing system and the Spark clusters provided by CERN. In order to parallelize training of models, we put in place a framework allowing to run jobs on the CERN cluster HTCondor (190000 cores). This solution does not speed up the training compared to local computing as for now we only use one core per training but it allows running parallel trainings, using different hyper-parameters and hypotheses. Using many cores or GPUs for a unique training would speed up each training but this has not yet been successfully addressed.

14

PRELIMINARY RESULTSThe results are presented in two separate subsections. The first one focuses on synthetic data which helps to understand, test and optimize the framework developed. The second one uses LASER data and highlights first results.

1. Synthetic Data ExperimentsTo evaluate suitability, performance and limiting scenarios of applicability of the proposed machine learning approach, experiments with synthetic data were carried out. To this end, a simple data generator has been built as described in figure 15.

Figure 15: Sketch of the synthetic data used. Two kinds of errors are present: the Precursor and the Consequence.Two occurrences of the precursor (green crosses) cause a consequence (red crosses).

Parameters are: the duration of precursor (light blue arrow), the duration between precursors (purple arrow), the duration between the last precursor and the consequence (dark blue arrow), the duration of the consequence (light green arrow) and the time delay

between the end of the consequence and the following precursor (intermediate blue).

To define the system, 5 durations are needed: duration of a precursor, duration of a consequence, duration between consecutive precursors, duration between last precursor and consequence and finally duration between end of consequence and first following precursor. In our artificial data, each of those durations can be deterministic or follow an exponential distribution law (a unique parameter define the mean value and standard deviation of the time series) or a normal distribution law (two parameters are used to define the mean value and standard deviation).In this study, the parameters of the algorithm were set to “FCN” for the classifier as it was the best performing with respect to [F2019]. The number of epochs was set to a maximum of 1000 with an early stop if a series of 100 epochs could not show improvement. Other common parameters are the prediction time set to 0 and the output window size set to 1, except for the scan dedicated to this parameters. The jupyter notebooks used to generate the artificial data, to train the classifier and to study the results are available on Gitlab. In order to avoid rerunning the simulations which can be time consuming using SWAN or a local machine, the results of the scan are also available on request in the CERNbox account of the project (ml-for-alarm-system/private/scanFakeData).

To measure the performance of the models, common classification metrics, such as the prediction accuracy, the precision, the recall based on a crossentropy loss function will be used.

Negative = 0 Positive = 1Negative = 0 True Negative False PositivePositive = 1 False Negative True PositiveTable 2: Definitions used to compute precision and recall.

Accuracy = (True Negative + True Positive) / Number of elementsPrecision =

½ True Positive / (True Positive + False Positive)

Predicted by the algorithm as

Actual value

15

https://gitlab.cern.ch/tcartier/mlcern/tree/master/notebooks

+ ½ True Negative / (True Negative + False Negative)Recall =

½ True Positive / (True Positive + False Negative) + ½ True Negative / (True Negative + False Positive)

Given y i the actual value of an element and y 'i the predicted value of an element, then

Loss function = 1N ∑

i=1

N

y i∗log ( y 'i)+(1− y i )∗log (1− y '

i)

It is worth to notice that an algorithm which always predict class 0 would have: an accuracy equal to the percentage of class 0 in the data set a precision equal to half percentage of class 0 in the data set a recall equal to 50%

For the following studies, unless specified otherwise, the baseline parameters are the following. The format x∓ y is used to express the distribution of the duration as the mean of a normal law with a given standard deviation. If calling the random generator we get a negative value, we simply make a new call to insure the positivity of the time delay and thus the causality of the artificial data.

Duration of the time series = 1 month Number of precursors = 2 Delay between precursors =600∓0s Duration of precursor = 60∓0s Delay between last precursor and consequence = 300∓0s Duration of consequence = 60∓0s Delay between consequence and first following precursor ¿4∓0 h Discretization time frequency ∆ t=1min Input window size = 2 h Output window size of = 60 s, FCN training algorithm

a) Number of class 1 occurrences in training dataStudying the impact of the time series length with the simplest type of synthetic data (deterministic time delay), it appears that the minimum number of class 1 elements is 4, see figure 16. Then having extremely long time series does not improve the performance as precision and recall hits 1.0 as soon as 4 class 1 elements are present in the data. The computation time scales linearly with the length of the time series. Then having extremely long time series does not improve the performance as precision and recall hits 1.0 as soon as 4 class 1 elements are present in the data. The computation time scales linearly with the length of the time series.

b) Ratio between class 0 and class 1In our dataset class 1 are scarce while class 0 are overabundant. While taking as much class 1 as possible but decreasing the number of class 0 to play with the ratio between class 1 and class 0, it appears the FCN classifier can learn from data sets with quite unbalanced classes and it seems better to have as much class 0 as possible compared to having a balanced set. For artificial data based on deterministic patterns, no variation in the performance has been detected for ratio in the range 10% to 50%. For non-deterministic pattern a slight improvement can be seen when the ratio decreases but it is still under investigation as it might be an artefact.

16

c) Input window size:Scanning the input window size (at a given discretisation, thus increasing the time span in the windows input size) showed a plateau in computation time for small input windows (n ≤ 30¿, maybe due to memory cache effect, then the computation time increase in √ n for larger values of n. This trend is not due to the early stopping as it was deactivated for this study. Neither it was due to the slight reduction of the training samples due to side effects when increasing the input window. The impact of the input window size with respect to the performance of the algorithm strongly depends on the parameters used to generate the artificial data. Indeed, an input window associated to a prediction time which could not both capture a precursor in the input window and the consequence in the output window, because their spans would not allow it, would not be able to predict anything. If the computation resources allow it, maximising the input window size is a way to increase performance.

d) Duration between two precursorsHaving a deterministic delay between two precursors or not does not change the performance of the prediction as long as the duration between the last precursor and the consequence is constant. Indeed, even if the first precursor is too far from the second precursor, and therefore both of them cannot fit inside the input windows, the results are still the same.

e) Duration between last precursor and consequenceTo be one step closer to the actual data of LASER we study here the impact of random delay between the last precursor and the consequence. This delay follows a normal law with a mean of 300s and a variating standard deviation. The x axis of the figure 17 is in logarithmic scale thus the deterministic case (standard deviation = 0) cannot be plotted but the performance is almost the same as for a standard deviation of 10 seconds. In this study the performance decreases quite fast as for a standard deviation of 2 min the predicting power almost ceases.

Figure 16: The FCN algorithm performance w.r.t. number of class 1 in the train data set

17

This can be explained by the discretisation frequency of ∆ t=1min, by the topology of class 0 and class 1 elements due to the large standard deviation and the resampling strategy. Indeed, for time series it is interesting to use the directly preceding and following class 0 of a class 1 to constraint as good as possible the fact of being a class 1, especially with respect to the time delay between the last precursor and the consequence. In the present case it leads to always select the 3 elements displayed in the figure 18.

Figure 18: The size of the output window is set to 1, then only 1 configuration out of the 3 displayed is a class 1.

Figure 17: The FCN algorithm performance w.r.t. the standard deviation of the delay between the last precursor and the consequence.

18

Unfortunately, if the time delay of the consequence has a significate probability of being in a neighbouring cell, it leads to identical inputs corresponding both to class 0 and class 1 as shown figure 19. In this particular case, the content of the input window of the top example is different to the two following but the second and third are exactly the same while being sorted in two different classes because of the content in the output window. To tackle this problem, one can increase the size of the output window.

Figure 19: The second and third configurations have a similar content in the input window while there are classified as, respectively, class 1 and class 0 because of the large standard deviation (compared to the discretization time) in the time delay

between the last precursor and the consequence.

f) Number of elements in the output windowThe study of the output window size is directly linked to the findings a) and b) showing that the amount of possible class 1 events drives the performance of a classifier. Indeed, changing the size of the output window increases the number of class 1 to train as shown figure 20.

Figure 20: The first and the second configurations are classified as class 1 thanks to the larger output window.

19

Increasing the output size window also solves the problem described in e), as it avoids the problem of having one input pattern which can be classified as a class 0 and a class 1, see figure 21.

Figure 21: The second and the second third configurations are classified as class 1 thanks to the larger output window.

In figure 22 the performances of networks are depicted when variating the size of the output window. Several notable effects are:

1) The ratio of class 0 decreases (the ratio of class 1 increases) due to the effects presented in figure 20 and 21. Having more time frames classified as class 1 helps to balance the ratio between the two classes.

2) For the range Dto = 1 to 16, performance strongly increases, as the network does not predict anything at Dto = 1 (precision and recall are below the ratio of class 0 in the data set and the accuracy of the best model is equal to the ratio of class 0 in the data set), to precision, recall and accuracy of 96%.

3) Finally, for a maximum size of Dto = 32 - a time span of 32 min with respect to the 1 min time step discretisation - both the two precursors and the consequence fit inside the duration of the output window, apart from possible outliers generated by the normal distribution. This situation leads to very few class 1 occurences to learn a pattern.

Figure 22: The FCN algorithm performance w.r.t. the output window size.

20

2. Field Data ExperimentsThe results of the synthetic data experiments showed that the proposed machine learning approach allows inferring predictive patterns and their root-causes despite small available data sets. This was an encouraging result to carry out experiments with real data collected by LASER. The study was focused on predicting priority 3 alarms of systems controlled by FGC controllers.The actual data recorded from operations is characterised by relatively few priority 3 alarms. Since the algorithms need a sufficient amount of priority 3 alarms to identify predictive patterns, a manual search was performed to find systems with relatively frequent priority 3 alarms.Three power converters had been identified with promising sets of data for pattern mining. These are BR4.DHZ8L1 – Model: ACAPULCO, BR4.DHZ3L4 – Model: ACAPULCO, and BR2.XNO9L1 – Model: CANCUN 50.For each of these converters, LASER data was collected from operation years 2014 to 2017. Non-operational periods were removed. On these data sets, an exhaustive search for predictive patterns was performed using different algorithmic parameters and data configurations.In terms of algorithmic parameters the subsampling ratio, the majority class, the assignment of class weights, the train-to-test-data size ratio, the sampling or discretisation time, the input- and output window sizes, and the selected learning algorithm (FCN, MLP, and timeCNN [F2019] were varied.With respect to data configurations, all possible priority 3 alarms for each converter as output were studied. As inputs comprising priority 2 and priority 3 alarms of the considered converter or additional external interlocks were used. Depending on the chosen algorithmic parameter-combination and data configuration different predictive accuracies of the trained models were achieved.In figure 23 an example of a model with high predictive accuracy for predicting fault code [=y column 3] is shown fort converter BR2.XNO9L1 using

a discretization time dt=6h, an input window size of 30 days, an output window size of 4 days, a fraction of 79% of class 0 outputs in the test data set, a ratio of 30% of test data (70% training data), and the FCN training algorithm.

This algorithm achieved a test accuracy of 94 %. Figure 23 shows an excerpt of the prediction and root-cause analysis results obtained for the test data set. Each line represents an input data item of the test-data set. The first three lines show input data, which does not lead to a priority 3 alarm in the future. The last line shows input data, which does lead to a priority 3 alarm in the future. The text left to the first column shows the true (‘label’) and the predicted (‘pred’) class of the data item. The first column shows the input matrix as heat map (size 60 x 120 representing 60 alarm types and 120 input time steps, darker values represent higher alarm activity). The x-axis (horizontal direction) shows different alarm types and the y-axis (vertical direction) the time (both not labeled for the sake of clarity) with the upper end being the time-step immediately before the predicted output (i.e. the predicted class). The second to fourth column show the relevance of the input towards the predicted class (i.e. if a priority 3 alarm will appear in the future) and allow to aid root-cause investigations. The darker the color of the heat-map, the higher the relevancy of the alarm at a specific time. The three different methods quantifying input relevance, Input * Gradient, LRP-Z and LRP, are described in [SG2017],[MG2019] (under 'Basic Rule (LRP-0)') and in [MG2019] (under 'Epsilon Rule (LRP-e)'), respectively.

21

Fig. 23: Graphical illustration of predictive performance and root cause identification of the proposed framework.

It has to be pointed out that in many cases no predictive patterns could be identified. Since the algorithms have demonstrated in the synthetic data experiments to identify different patterns once they occur a few (>3) times in the training data, this is likely because patterns in the real LASER data may either not exist, are too noisy, or are not repeated a sufficient number of times.In some cases, this can be overcome by collecting more data over longer times or adding auxiliary data, which might add the necessary information for patterns to occur regularly in the training data.Therefore, next steps will be to attempt to improve the prediction accuracy of the algorithms by

Adding auxiliary (non-FGC) data of functionally related systems to the inputs, Add beam destination data which serves as accelerator state variable for the algorithm Employing Transfer Learning to increase the training data size,

o in time (e.g. use 2014-2016 data to learn a general model and retrain it solely with 2017 data),

o across devices (e.g. use all ACAPULCO power converter data to learn a general model and retrain it solely with data from a single ACAPULCO model).

Test Data Augmentation by using statistical up-sampling methods Improving the overall performance estimation by more sophisticated splitting into training,

validation and test data sets as proposed for time series problems [BB2012, BHB2018].Once the algorithms demonstrate to learn robust predictive patterns and their causal relationships, the results should be validated by confirming their correctness and usefulness with the help of system experts.

time

22

SUMMARY, CONCLUSIONS and OUTLOOK

We proposed a scalable machine-learning framework to predict and understand non-trivial system-interdependent alarms and their root causes. It is built upon recently proposed deep learning methods for multivariate-time series and explainable AI methods pioneered in image recognition. Tests with synthetic data have shown the feasibility of the approach in small data scenarios. Applications with field data of the PSB demonstrated that the methodology could prove to be useful in realistic scenarios of accelerator operation and diagnosis. This might be especially valuable for the operation of increasingly complex infrastructure in which manual failure prognosis and diagnosis becomes infeasible.The biggest limitations stem from a limited availability of historic data on system behaviour. We plan to mitigate this by using auxiliary inputs on the machine and system states, by transfer learning, and data-augmentation strategies. This topic is of particular relevance in the context of particle accelerators and will be further addressed by a dedicated PhD thesis in the coming years, dealing with the modelling of complex systems with data limitations and class imbalance (e.g. applied to the analysis of Unidentified Falling Objects). Further steps will include a continuation of the validation of the method from a theoretical perspective with synthetic data, and continued practical validation of the frameworks’ alarm prognosis and diagnosis results for real-world accelerators with system experts.

References[1] CERN Alarms data management: state and improvments, Z. Zaharieva, M. Buttner, CERN, Geneva, Switzeland. CERN-ATS-2011-204[AS2019] Antonello, Federico, and Ugo Gentile. "Functional analysis by machine learning for criticalities identification in complex interconnected infrastructures." https://indico.cern.ch/event/811475/ Machine Learning and Artificial Intelligence activities and results in the EN Department (2019)

[L2019] Lapuschkin, Sebastian, et al. "Unmasking Clever Hans predictors and assessing what machines really learn." Nature communications 10.1 (2019): 1096.[HHY2013] He, Haibo, and Yunqian Ma, eds. Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons, 2013.[F2019] Fawaz, Hassan Ismail, et al. "Deep learning for time series classification: a review." Data Mining and Knowledge Discovery 33.4 (2019): 917-963.[BB2012] Bergmeir, Christoph, and José M. Benítez. "On the use of cross-validation for time series predictor evaluation." Information Sciences 191 (2012): 192-213.[BHB2018] Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. "A note on the validity of cross-validation for evaluating autoregressive time series prediction." Computational Statistics & Data Analysis 120 (2018): 70-83.[A2019] Alber, Maximilian, et al. "iNNvestigate neural networks!." Journal of Machine Learning Research 20.93 (2019): 1-8.[SG2017] Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. "Learning important features through propagating activation differences." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.[MG2019] Montavon, Grégoire, et al. "Layer-wise relevance propagation: an overview." Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, Cham, 2019. 193-209.

24

https://indico.cern.ch/event/811475/

Memorandum · Web viewwith a significant loss of physics production. For linear lepton colliders a...

Documents

Transcript of Memorandum · Web viewwith a significant loss of physics production. For linear lepton colliders a...