Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient...

Efficient Monitoring and Root Cause Analysis in

Complex Systems

Witek Bedyk

Agenda

● Benefits of robust monitoring

● Measurements vs. Alarms

● Importance of Alarms Correlation

● Effective Alerting

● Self-healing

Why is Monitoring useful?

● Improve system / application uptime

● Reduce administration burden

● Resource optimization

● Prevent bottlenecks

● Make use of collected data (e.g. billing)

Why is Monitoring useful?

● Improve system / application uptime

● Reduce administration burden

● Resource optimization

● Prevent bottlenecks

● Make use of collected data (e.g. billing)

Use Case

Customer escalation:

“We have cloud outage! Keystone is flapping up and down continuously and many requests get 503 service unavailable error.”

Healthcheck

Simple HTTP endpoint up or down checks on services.

http_status [0, 1]http_response_time

Metrics

● Metrics measure and report on quantifiable data from your system

● cpu, memory, network, filesystem, disk IO● Services

○ MySQL, RabbitMQ, Apache, MemcacheD, etc.

● LibVirt, Open vSwitch● Applications:

○ StatsD, Prometheus

● Custom checks

Dimensions

● Dimensions are a dictionary of key, value pairs used to describe metrics.

● hostname● service● component● url● device

Transaction-level vs. System-level metrics

● Transaction-level: end user perspective○ Is Horizon working correctly?

● System-level: administrator perspective○ Reveals failures of service components

Dependencies

MemcacheDKeystoneApache

Gathered metrics

http_statushttp_response_timeapache.net.hitsapache.performance.idle_worker_countmysql.performance.open_filesmysql.net.connectionsmemcache.curr_connectionsmemcache.get_misses_rateprocess.cpu_percprocess.open_file_descriptors

Dashboards

Alarms

Status of the system or resource meets criteriaindicating an action is required.

Alarm definitions

● Alarm definitions are templates specifying how alarms should be created.

● grouping

● http_status > 0, match_by: ["service", "component", "hostname", "url"]

● filtering

● avg(cpu.idle_perc{service=monitoring}) < 20

Use case (alarms)

Keystone API is down on node A.Keystone API is down on node A.Keystone API is down on node A.Keystone API is down on node A.

Keystone API is up on node A.Keystone API is up on node A.

MemcacheD number of connections is high on node A.

Keystone API is up on node A.Keystone API is up on node A.

MemcacheD hit rate is low on node A.

Alarms correlation

● “80% of the mean time to repair is wasted on trying to locate the issue” Gartner

● Remove noise from the environment● Alerts should be:

○ meaningful○ actionable○ indicate the point of failure

Vitrage

● OpenStack Root Cause Analysis service

● organize alarms○ define relationships between alarms○ represent as an entity graph

● analyze○ represent system health

● find root cause○ graphical visualization

Dependencies

MemcacheDKeystoneApache

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Dependencies

Keystone cluster

Keystone instances

MemcacheD

Monitor Analyze Plan Execute (MAPE)

Monitor Execute

Sensors Effectors

Analyze

ManagedResource

Monitor Analyze Plan Execute (MAPE)

Monitor Execute

Sensors Effectors

Analyze

ManagedResource

Vitrage Templates

● Vitrage Templates are used to express Condition Action scenarios.→

● if <condition> then raise deduced alarm● if <condition> then set deduced state● if <condition> then add causal relationship (used for RCA capability)● if <condition> then execute Mistral workflow

Self-healing

Keystone cluster

Keystone instances

MemcacheD

Self-healing

Keystone cluster

Keystone instances

MemcacheD

Self-healing

Keystone cluster

Keystone instances

MemcacheD

Self-healing

Keystone cluster

Keystone instances

MemcacheD

OpenStack Healthcheck APIs

● more detailed checks would be useful for most OpenStack services● common middleware should get implemented in Oslo● existing old effort:

○ https://storyboard.openstack.org/#!/story/2001439○ https://review.opendev.org/617924

Summary

● Robust monitoring is essential

● Measurements vs. Alarms

● Importance of Alarms Correlation

● Self-healing

Thank You谢谢

Questions and Answers

Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient...

Documents

Transcript of Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient...

Root Cause Analysis - allaboutmetallurgy.comallaboutmetallurgy.com/.../uploads/2017/...for-Root-Cause-Analysis.pdf · ASQ Quality Press Milwaukee, Wisconsin Root Cause Analysis Simplified

Finding Root Cause

The True Meaning of Root Cause and Root Cause Analysis …mnasq.org/wp-content/uploads/presentations/RCA-Jing.pdf · The True Meaning of Root Cause and Root Cause Analysis (RCA) Gary

CMQ220 Root Cause Analysis Lesson 7 -Root Cause Analysis ...cbafaculty.org/DAU/CMQ220 Root Cause Analysis/CMQ220 Root Cause... · CMQ220 Root Cause Analysis Lesson 7 -Root Cause Analysis

Root cause

Root Cause Flyer

Root Cause Analysis - Lean Construction Institute Annual ... Cause Analysis_Fauchier and... · Root Cause Analysis. Dan Fauchier, ... • Background • Current Condition • Root-Cause

Root Cause Analysis_Linkedin

Root Cause Ppt

Root Cause AnalysisRoot Cause Analysis An Introduction

Root Cause Analysis

Root Cause Analysis - ge.com · Define When a Root Cause Analysis Should be Performed 10 Access a Root Cause Analysis 10 Create a Root Cause Analysis 12 Copy a Root Cause Analysis

Root Cause Failure Analysis - EASA · PDF fileRoot Cause Failure Analysis Root Cause Methodology — Section 1 1 Root Cause Methodology Section Outline Page Introduction to failure

Root Cause Analysis1279

Root Cause Analysis - National Council · PDF fileAgenda •What is root cause analysis? •Why use root cause analysis? •How to use root cause analysis?

4D Reporting the Root Cause Embedding Root Cause Analysis ...

Freeleansite.com Root Cause Analysis - Overview Root Cause Analysis and Corrective Action (RCCA)

Root Cause Analysis Training Investigation process and ... - root cause... · Root Cause Analysis Training Investigation process and ... Keep medical terminology to a minimum ...

Root cause presentation

Root Cause 1