TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE /...

24
TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick, Rudiger Schmidt, Benjamin Todd, Daniel Wollmann

Transcript of TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE /...

Page 1: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

TRACKING OF FAULTS AND FOLLOW-UP

Accelerator Fault Tracking project

Jakub Janczyk (TE-MPE-PE / BE-CO-DS)with input from: Andrea Apollonio, Chris Roderick, Rudiger Schmidt, Benjamin Todd, Daniel Wollmann

Page 2: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 2

Agenda

• Purpose of fault tracking

• What has been done in the Past

• Accelerator Fault Tracking project – plans & status

• Summary

10/14/2014

Page 3: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 3

Purpose of fault tracking

Complete and consistent tracking allows to identify:• Problems as early as possible to allow for timely mitigation• Key issues which will limit performance of accelerators or

equipment in the future (Run2, Run3, HL-LHC)• Increase availability, in both short- and long-term, by dealing with

issues ASAP

Track Faults in two areas:

1. Directly affecting accelerator operation – identify root causes (e.g. R2E effects, glitches in electrical network, etc.)

2. Equipment (electronic) faults independently of immediate impact on accelerator operation

10/14/2014

Page 4: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 4

What has been done in the Past

• A lot of different tools for logging of faults, used by different teams:• eLogbook, Post-Mortem, RadWG page, tools in equipment groups

(JIRA, Excel, Onenote, eLogbook)

• A lot of effort was required from individual teams/working groups to gather and exploit fault data

• Nevertheless, difficult to get a consistent picture

10/14/2014

Page 5: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

Credit M.

Brugger

Page 6: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 6

Cardiogram - „life” of LHC from operational point of view

• Graphical analytic tool for combining data from different sources

• Initially created by members of Availability WG: B. Todd, L. Ponce, A. Apollonio

• Tedious work to gather and prepare all the necessary data several months for 2010-2012 cardiogram

10/14/2014

Page 7: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 7

Cardiogram - example

10/14/2014

Accelerator Mode (Proton Physics, Ion Physics, etc.)

Access

Fill Number

Particle Momentum

Beams Intensities

Stable Beams

PM Beam Dump

Beam Dump Classification

Fault

Fault Lines(Systems/ Fault Classifications)

Credit

AWG

Page 8: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 8

Cardiogram – data preparation

10/14/2014

Credit Benjamin

Todd

Page 9: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,
Page 10: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 10

Accelerator Fault Tracking projectProject launched February 2014 (BE/CO, BE/OP, TE/MPE collaboration)

Based on initial inputs from:• Evian Workshops• Availability Working Group• Workshop on Machine Availability & Dependability for Post-LS1 LHC• BE/OP

Goals:

• Capture consistent and complete fault data

• Facilitate fault tracking from perspective of all interested parties (OP,

equipment groups, working groups)

• Single source of data – easier to complete, clean and analyse.

• Provide consistent / standardized statistics, analyses, reports for

different users (8:30 meetings, weekly reports / summaries)

• Interactive overview of faults (cardiogram on demand)

• Proactively identify incomplete data

10/14/2014

Page 11: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

Plans (as presented by Chris Roderick @ LMC 30-04-2014)

Provide infrastructure to consistently & coherently capture, persist and make available accelerator fault data for further analysis.

Foreseen project stages:

1. Put in place a fault tracking infrastructure to capture LHC fault data from an operational perspective

• Enable data exploitation by others (e.g. AWG and OP) to identify areas to improve accelerator availability for physics

• Ready before LHC beam commissioning

• Infrastructure should already support capture of equipment group fault data, but not primary focus

2. Focus on equipment group fault data capture

3. Explore integration with other CERN data management systems (e.g. Infor EAM)

• potential to perform deeper analyses of system and equipment availability

• in turn - start predicting and improving dependability

To support data analysis, AFT data extraction infrastructure should also provide data complimentary to the actual fault data - such as accelerator operational modes and states.

Scope:

Initial focus on LHC, but aim to provide a generic infrastructure capable of handling fault data of any CERN accelerator.

We are here...

Tim

e

Page 12: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 12

Status• AFT is under development – Web application, available for different users, and integration with eLogbook for LHC operators

• Functionalities available from day 1 will be as planned for first stage of the project

• AFT test version available• We’re open to start discussion with equipment groups

[email protected]

10/14/2014

Page 13: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 1310/14/2014

Page 14: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 1410/14/2014

Page 15: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 1510/14/2014

Page 16: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 16

Turnaround Time

10/14/2014

Page 17: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 17

Summary• Consistent and complete tracking of faults is the key to

identify and efficiently mitigate issues• The AFT will ease the recording of faults and their root

causes in a complete and consistent way• Run2 data will be essential to identify future

performance/availability limitations towards HL-LHC• Quality and completeness of the data requires effort

from all involved parties• Open to discuss integration of equipment groups data

10/14/2014

Page 18: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 18

Questions

10/14/2014

Page 19: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 19

Extra Slides

10/14/2014

Page 20: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 20

Roles and simplified workflow

10/14/2014

Page 21: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 2110/14/2014

2011

2010

2012

Page 22: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 22

Multiple failures

• It is easy to see if there are multiple failures at the same time, but it’s not obvious if they are related.

• One of the goal of AFT project is to capture data that will allow to show the relations between faults.

10/14/2014

Faults related

Water leak

Problems caused by water leak

Faults not related – QPS failed and rest of them are accessesin shadow

Page 23: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 23

Access without faults

• In 2012, around 40 times there was access without any fault

• The reasons for these accesses are not classified, but often something is repaired

• Inconsistent data – cardiogram allows to spot this

10/14/2014

Page 24: TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick,

R2E/Availability Workshop 24

Access without faults - examples

10/14/2014

Few accesses:ATLAS,Change of PC,repair of QPS,intervention on the crates of the BPMD

LHCb – fixing muondetectors Accesses in

shadow of QPS fail:QPS – reset cards,ALICE and CMS,Cryogenics – valveregulation,RF – replacing brokenattenuator

ATLAS access