Brighttalk what should we be monitoring - final

79
By U.S. Navy photo by Mass Communication Specialist 1st Class James E. Foehl [Public domain], via Wikimedia Commons The Age Old Question: What should our APM Solution be monitoring?

Transcript of Brighttalk what should we be monitoring - final

Page 1: Brighttalk   what should we be monitoring - final

By U.S. Navy photo by Mass Communication Specialist 1st Class James E. Foehl [Public domain], via Wikimedia Commons

The Age Old Question: What should our APM Solution be monitoring?

Page 2: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command.

Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation

Page 3: Brighttalk   what should we be monitoring - final

http://weheartit.com/entry/12433848!

Page 4: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Ground rules for this session… •  If you can’t tell if I am trying to be funny…

–  GO AHEAD AND LAUGH! •  Feel free to text, tweet, yammer, or whatever.

Use •  If you have a question, no need to wait until

the end. Just interrupt me. Seriously… I don’t mind.

Page 5: Brighttalk   what should we be monitoring - final

I have a lot of experience leading Systems and Event Management teams

My name is Andrew White

Page 6: Brighttalk   what should we be monitoring - final

I am here today to share some of what I have learned about

Systems Thinking,

and APM.

Page 7: Brighttalk   what should we be monitoring - final

*Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research

58% of mobile phone users expect websites to load as quickly, almost as quickly or faster on their mobile phone, compared to the computer they use at home*

http://www.flickr.com/photos/lucianbickerton/3858380291/sizes/l/!

Page 8: Brighttalk   what should we be monitoring - final

*Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research

60% of mobile web users have had a problem in the past year when accessing a website on their phone*

http://www.flickr.com/photos/rickyromero/1357938629/sizes/l/!

Page 9: Brighttalk   what should we be monitoring - final

*Among adults who accessed the internet with a mobile phone in the past 12 months (n=602) – Gomez Mobile Web Experience Survey conducted by Equation Research

Slow load time was the number on issue, experience by almost 75% of them*

http://bighugelabs.com/onblack.php?id=2497744197&size=large!

Page 10: Brighttalk   what should we be monitoring - final

Is 5 seconds really bad?

Page 11: Brighttalk   what should we be monitoring - final

Start…

Start…

Observed Maximum:

90th Percentile: 5.44 seconds…

15.4 seconds…

Page 12: Brighttalk   what should we be monitoring - final

Start…

Start…

Observed Maximum:

90th Percentile: DONE! 5.44 seconds…

15.4 seconds…

Page 13: Brighttalk   what should we be monitoring - final

Start…

Start…

Observed Maximum:

90th Percentile: DONE!

DONE!

5.44 seconds…

15.4 seconds…

Page 14: Brighttalk   what should we be monitoring - final

If you were the one on the phone with one of those customers…!

how would you fill that silence?!

Page 15: Brighttalk   what should we be monitoring - final

The rationality of individuals is limited by the information they have. This causes “The Tragedy of the Commons.”

Page 16: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

What Is a System? It is a set of interconnected actors that change over time when they are influenced by other elements of the system.

Actor

Actor

Actor Actor

Actor

Actor

Actor

Actor

Page 17: Brighttalk   what should we be monitoring - final

As we have become more aware that things are always happening, our behavior has changed.

Page 18: Brighttalk   what should we be monitoring - final
Page 19: Brighttalk   what should we be monitoring - final
Page 20: Brighttalk   what should we be monitoring - final
Page 21: Brighttalk   what should we be monitoring - final

We are no longer thinking, we are reacting…

Page 22: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit! http://static4.businessinsider.com/image/5176c232ecad04805d000010-505-277/screen%20shot%202013-04-23%20at%201.09.49%20pm.png

April 23, 2013 The Twitter account for the Associated Press was hacked The hackers posted a fake notice that the White House was attacked and President Obama was injured The Dow dropped 150 points in less than 5 minutes

Page 23: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Systems are Volatile This change makes it difficult to control the behavior of the system. The good news is that systems are perfect. They always deliver the optimum result given a specific stimuli.

Page 24: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Anatomy of An Outage P0 - Affecting Multiple apps!

Corporate LANs & VPNs

Load Balancer

Firewall

Web Servers

Message Queue

zOS CICS

WAS Database

WAS Database

zOS MQ

DB2

4

3

1

5:45-ish pm: CICS ABENDS start flooding OMEGAMON but not high enough to ticket

2

6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics

6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem

5 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue

6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident

Page 25: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Our Problem Statement: The business needs to reliably reach its customers and users regardless of where they may be located. Latency

forces close geographic proximity of the components and limits the quality of service provided to

geographically distributed customers.

If the users can’t use it, it doesn’t work.

Page 26: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Our Constraints At the same time, there are a few inescapable facts we face: 1.  Today’s users demand reliable systems to do their work 2.  Our systems mirror the complexity of the businesses they

support 3.  Our environments must be massive to scale to handle the

workload 4.  There is too much activity for a single person to be totally

situationally aware

Page 27: Brighttalk   what should we be monitoring - final

When all of these happen at the same time…

Ug…

Page 28: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Question

Is there a better way to figure out what monitoring would help?

Page 29: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Your monitoring should help you answer: •  How will we know if the users are getting the experience

they are expecting? •  How much capacity do we need during normal and peak

times to ensure user expectations are met? •  How quickly can the provider we select ramp up to meet

our needs if we find that the service is underperforming? •  How fast do we need to be able to access additional

capacity once it is ready for us?

What Do You Want To Accomplish?

Page 30: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

When decisions are not made based on information, it’s called gambling.

Page 31: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Composite Applications

Site Content!Search!

Session!Information!

User Login!& Identity Mgmt!

Content Mgmt!System!

Social Network!Widgets!

Site Tracking!& Analytics!

Banner Ads & !Revenue Generators!

Multimedia &!CDN Content!

Page 32: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

The Same Old Problem

Corporate!LANs & VPNs!

ISP!Connection!

DNS & Internet!Services!

Content Mgmt!System!

Social Network!Widgets!

Site Tracking!& Analytics!

Banner Ads & !Revenue Generators!

Multimedia &!CDN Content!

Home Wireless!& Broadband!

Mobile Broadband!

Is It My Data Center?!•  Configuration errors!•  Application design issues!•  Code defects!•  Insufficient infrastructure!•  Oversubscription Issues!•  Poor routing optimization!•  Low cache hit rate!

Is It a Service Provider Problem?!•  Non-optimized mobile content!•  Bad performance under load!•  Blocking content delivery!•  Incorrect geo-targeted content!

Is it an ISP Problem?!•  Peering problems!•  ISP Outages! Is it My Code or a Browser Problem?!

•  Missing content!•  Poorly performing JavaScript!•  Inconsistent CSS rendering!•  Browser/device incompatibility!•  Page size too big!•  Conflicting HTML tag support!•  Too many objects!•  Content not optimized for device!

The Cloud!

Distributed

Database

Mainframe

Network

Middleware

Storage

Page 33: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Cognitive Dissonance

Corporate LANs & VPNs

Distributed

Database

Mainframe

Network

Middleware

Storage

ISP Connection

DNS & Internet Services

Content Mgmt System

Social Network Widgets

Site Tracking & Analytics

Banner Ads & Revenue Generators

Multimedia & CDN Content

Home Wireless & Broadband

Mobile Broadband

The Part You Control

The Part They Experience

…meanwhile the user is NOT happy

All our systems look great,

SLA’s are being met…

You Have More Control Here Than

You Think

Page 34: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Gaining Perspective Requires Balance

Packet Capture!

Synthetic Transactions!

Client Monitoring!

Client Monitoring!

Synthetic Transactions!

Server Probe!

1.  Client to the Server!2.  Server to the Client!3.  “3rd Party” Vantage Point!4.  Synthetic Transactions!

Four Perspectives of User Experience!

Page 35: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

What Does Good Monitoring Look Like?

Corporate!LANs & VPNs!

Load Balancer!

Load Balancer!

Firewall!

Switch!

Web Server Farm!

Database!

Data Power!Mainframe!

Middleware!

Load Balancer!

1.  System Availability 2.  Operating System Performance 3.  Hardware Monitoring 4.  Service/Daemon and Process Availability 5.  Error Logs 6.  Application Resource KPIs 7.  End-to-End Transactions 8.  Point of Failure Transactions 9.  Fail-Over Success 10. “Activity Monitors” and “Reverse Hockey Stick”

Elements of Good Monitoring !!!!!!!!!!!!!!!!!!!!!!!!!!!!3!2! 4! 5! 6!1!

!!!!

7!

!!!!!!!!!!!!!!!!!!8!

!!!!!!!!!!!!!!!!!!!!

9! !!!!!!

10!

Page 36: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Finding Metrics That Matter §  Will the metric be used in a report? If so, which one? How is it used in the report? §  Will the metric be used in a dashboard? If so, which one? How will it be used? §  What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket

be generated? If so, what severity? §  How often is this event likely to occur? What is the impact if the event occurs? What

is the likelihood it can be detected by monitoring? §  Will the metric help identify the source of a problem? Is it a coincident / symptomatic

indicator? §  Is the metric always associated with a single problem? Could this metric become a

false indicator? §  What is the impact if this goes undetected? §  What is the lifespan for this metric? What is the potential for changes that may

reduce the efficacy of the metric?

Evaluating the Effectiveness of a Metric

Page 37: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Beware of Averages 75th

Percentile!50th

Percentile!25th

Percentile!

0.5! 0.7! 0.9! 1.8! 2.5! 2.5! 2.6! 2.9! 3.3! 3.5!

Average!

Page 38: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

What Matters Most?

Dr.  Lee  Goldman  

Cook  County  Hospital,  Chicago,  IL  

§  Is the patient feeling unstable angina?

§  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood

pressure below 100?"

The Goldman Algorithm

Prediction of Patients Expected to Have a Heart Attack Within 72 Hours

0  

20  

40  

60  

80  

100  

Traditional Techniques Goldman Algorithm

By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20

percentage points and eliminated the “false positives” altogether.

Page 39: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

The Goldman Algorithm

ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age)

Observation Unit

Inpatient Telemetry Unit

High Risk Low Risk Very Low Risk Moderate Risk

Yes No

Coronary Care Unit

No

ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)

Yes

Patient suspected of Acute Cardiac

Ischema

Perform Electrocardiogram

(EKG)

0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors

Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease

Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease

Page 40: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Driving the Right Action

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

Page 41: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Driving the Right Action

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

Page 42: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Driving the Right Action

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

KPI 1!

KPI 2!

KPI N!

Database!

Page 43: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Driving the Right Action

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

Page 44: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Driving the Right Action

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

Page 45: Brighttalk   what should we be monitoring - final
Page 46: Brighttalk   what should we be monitoring - final

Our success in any endeavor depends directly on our ability to solve problems

What do we need to do that?

Page 47: Brighttalk   what should we be monitoring - final

You Gotta Have Skillz…!

Page 48: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems

Rule-Based Approach

Event Based Approach

Page 49: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Event-Based Problem Solving

•  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes

Page 50: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Rules for Causal Relationships

Database Down !

(Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated (Cause)!

①  Causes are effects, and effects are causes!

Page 51: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Rules for Causal Relationships

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

②  You can keep identifying causes – there is no limit!

Page 52: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Two Important Questions

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

Ask “Why?”!

Ask “What”!

Page 53: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Rules for Causal Relationships

③  An Effect is often the result of multiple causes!

SQL Server was not processing queries (Effect)!

Transaction log was unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon

vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require truncation!

Space allocations are fixed! Lack of Control!

-AND-!

-AND-!

-AND-!

Page 54: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Rules for Causal Relationships

④  Causes need to be both necessary and sufficient!

SQL Server was not processing queries

(Effect)!

Transaction log was unable to grow

(Transitory Cause)!

T: Drive at 0 Bytes free!(Non-transitory Cause

& Effect)!

Logs were not truncated!

(Transitory Cause & Effect)!

DBA on honeymoon vacation in Fiji!

(Transitory Cause)!

Logs are truncated manually!

(Non-Transitory Cause)!

Company has only 1 DBA!

(Non-Transitory Cause)!

“Backup” DBA was not aware the logs require

truncation!(Non-Transitory Cause)!

Space allocations are fixed!

(Non-Transitory Cause)!Lack of Control!

-AND-!

-AND-!

-AND-!

Page 55: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

How Fire Works

Time

Oxygen Heat Fuel

Fire

Mat

ch S

trike

Transitory Non-Transitory

Fire

Oxygen

Heat

Fuel

Match Strike

-AND-

•  Transitory Causes act as catalysts to bring about change (think Transition)

•  Non-Transitory Causes are objects, properties/attributes, and status

Page 56: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

RCA Diagram

Customers Complaining

Web Server returning 500 errors

The application server was timing

out

SQL Server was not processing queries

Transaction log was unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require truncation

Space allocations are fixed Lack of Control

Only one database cluster in use

DR SQL Cluster

DR Cluster being used for UAT testing

More Information Needed

One one application server exists

More Information Needed

Trying to do business on the website Desired Condition

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

Page 57: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Add Evidence

Customers Complaining

Web Server returning 500 errors

The application server was timing

out

SQL Server was not processing queries

Transaction log was unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require truncation

Space allocations are fixed Lack of Control

Only one database cluster in use

DR SQL Cluster

DR Cluster being used for UAT testing

More Information Needed

One one application server exists

More Information Needed

Trying to do business on the website Desired Condition

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

Statistical Data

Situational

Observation

Page 58: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Failure Modes Analysis

SQL Server Not Available

Transaction log is unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require

truncation (Condition Cause)

Space allocations are fixed

(Condition Cause) Lack of Control

SQL is unable to cache query results

Available RAM at 0 Bytes Free

C: Drive at 0 Bytes free

Minidump is configured to write to C: Drive

Server was ASRing frequently

Software distributions were leaving files in the

TEMP folder

%TEMP% configured to C:\Temp

Kernel able to write to page file

-AND-

-AND-

-AND-

-AND-

-OR-

-AND-

-OR-

Page 59: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Picking Monitors

SQL Server Not Available

Transaction log is unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require

truncation (Condition Cause)

Space allocations are fixed

(Condition Cause) Lack of Control

SQL is unable to cache query results

Available RAM at 0 Bytes Free

C: Drive at 0 Bytes free

Minidump is configured to write to C: Drive

Server was ASRing frequently

Software distributions were leaving files in the

TEMP folder

%TEMP% configured to C:\Temp

Kernel able to write to page file

-AND-

-AND-

-AND-

-AND-

-OR-

-AND-

-OR-

Monitor the intersections at

the “OR’s”

At least one point along each branch

after the “OR”

Page 60: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

FMEA Matrix (Impact Calculation)

Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable

Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day

Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users

Page 61: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

FMEA Matrix (Evidence)

These are the events that help us to RULE IN a failure mode as a possible cause

These are the events that help us RULE OUT the failure mode as not relevant

Page 62: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Determining Severity Logical Server

Virtual Machine 1

Virtual Machine 2

Severity Description Critical The component has completely failed Major The component is operating but is in a degraded or crippled state Minor The component is functioning normally but is at risk of a more serious failure Informational The component is functioning normally but is reporting a change in state Unknown The component has changed its operating state but the effect is not known Clear The component is operating normally or a higher severity event has been resolved

•  The event severity is determined with respect to the component generating the event

•  The event severity does not consider impact or urgency

•  The incident priority is not determined by event severity

•  The event severity helps drive an effective triage when multiple events arrive at approximately the same time

•  Only after the effected components and their relationships to each other have been determined can impact and urgency be determined

Six Levels of Severity

Physical Server

Server 1

Server 2

Logical Volumes Volume Group 1

Volume Group 2

Physical Volumes Hard

Drive 1 Hard

Drive 2 Hard

Drive 3

Page 63: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Monitoring Patterns Layers of Pre-Defined Monitoring Patterns

•  The OS template is deployed when the server is provisioned

•  As a server is customized to fit its role, additional templates are deployed

•  Templates are stacked on top of each other until no gaps remain

•  This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution

Page 64: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Application-Technology Matrix Maps services, applications and technologies

enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs

Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy

Page 65: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Event Lifecycle

Legend!Element Manager!Distributed Collectors!Object Server Triggers!Impact Policies!ITNM RCA Engine!Gateway Replication!Webtop Event List!

Software-Operating System!

Data Collection!

Anomaly Detection!

Event Generation!

Integration!

Event Processing!

Enrichment!

Event Suppression!

Correlation!

Root Cause Analysis!

Business Impact Analysis!

Automation!

Notification & Escalation!

Presentation!

User Interaction Tools!

Archiving!

Reporting!

Activity! Responsible Tool!

Trigger Ticket Request!

Create Ticket!

Update Event with IM#!

Trigger Courtesy Pages!

Send Pages!

Activity! Responsible Tool!

Page 66: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Automated Action

Notification and Escalation

Business Impact

Analysis

Root Cause Analysis

Correlation and Event Suppression

Enrichment

Meta-Data Integration Bus

Distributed Collectors Distributed Collectors

LOB Managed Monitoring System

Service Provider Monitoring System

Vendor Managed Monitoring System

Element Manager

Element Manager

Element Manager

Other Enterprise

Data Document

Sharing Service Desk CMDB Batch Scheduling

Knowledge Database

Online Run Book

PBX/Call Manager

Visualization Framework

Comm

on Event Form

at

Topology And Relationship

Database Automated Action

Tools

Distributed Collectors Automated Provisioning

System

Predictive Analysis

Automated Change

Reconciliation

Security Management

Archive and Report

Business Telemetry Data

Service Center and Enterprise

Notification Tool

Event Processing

Page 67: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

As you recognize opportunities to capture knowledge, use it to improve your Event Management System.

Iterative Development

Page 68: Brighttalk   what should we be monitoring - final

How do we keep it evolving?!

Page 69: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Sometimes We Miss What’s Going On

Say… what’s a mountain goat doing all the way up here in a cloud bank?

Page 70: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

The Path to Situational Awareness

Collection Analytics

Situational Awareness

Presentation Aggregation

Each phase builds on the previous helping to establish situational awareness: •  Data is collected from our IT systems •  These data are aggregated into a central location •  Correlations transform the data into information and predictive

analytics process them further into knowledge •  The processed and enriched knowledge is presented to users in a

way that helps them make good decisions

Page 71: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Cleaning Up the Landscape

Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391

Silo

Monolithic Framework

Nich

e

Launch Pad

Information Bus

Page 72: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Directed Workflows

Directed !

Non Directed!

Launchpad!

Executive Dashboard!

Business Area!Dashboards!

Application PAC!Dashboards!

Command Center!Dashboards!

Technology Owner!Dashboard!

Application Owner!Dashboard!

Problem Isolation!

Workspace!

Problem Diagnostics!Workspace!

System Detail!View!

Component Detail!View!

Page 73: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!73

Here comes the elevator pitch…

Page 74: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

The IBM Solution !IBM SmartCloud APM Suite offers essential management capabilities for applications in complex cloud and hybrid environments. !

!! !•  At-a-glance status determination

via network topology graphs!•  Proactively identify and respond to

compliance issues!•  Monitor the performance of the

environment and the tenants living inside of it!

•  Understand the current capacity needs and forecast future needs!

•  Understand the costs associated with providing the service and enable “showback” and charge back” reporting to the application owners!

SINGLE POINT OF MANAGEMENT!

!•  Minimize service and system

outages!•  Identify recurring incidents and

implement action to remediate problems before they cause impacts!

•  Assist troubleshooting by suppressing “noise” events and providing root cause determination!

MAXIMIZE SERVICE AVAILABILITY!

!•  Reduce the need for manual

action or intervention!•  Automate for repeatability and

elimination of human error!•  Develop standardized practices

for complex business processes!•  Enable the development of APIs

to allow for self-service management by the consumers!

IMPROVED OPERATIONAL EFFICIENCY!

Page 75: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Understand the end-user experience

Follow changing workloads

Mobile devices & "smart endpoints

Private, public & "hybrid clouds

Highly virtualized applications, storage & networks

Discovery Visibility into application resources

End User Experience

Transaction performance monitoring to ensure SLA compliance

Transaction Tracking

Rapid problem isolation through transaction "path analysis

Diagnostics

Domain-specific operations tools for diagnosis and repair

Predictive Analytics

Proactive approach to reduce outages & improve performance

shared data & common services

See steps across the cloud

VISIBILITY, CONTROL AND AUTOMATION TO INTELLIGENTLY MANAGE CRITICAL APPLICATIONS IN CLOUD AND HYBRID ENVIRONMENTS.

Page 76: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Tivoli Enterprise Portal

Monitor the complete Application and Application Infrastructure

Measure, Baseline and Analyze the Service and Transactions

ITCAM for Applications

ITM for Microsoft

Applications

ITM

ITCAM for Transactions

ITCAM for SOA Platform

OMEGAMON XE

Tivoli Enterprise

Portal

Tivoli Automation

Tivoli Data Warehouse

Tivoli Common Reporting

IBM Tivoli Monitoring Solution

Page 77: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Business Value of Adopting APM Predic've  Outage  

Avoidance  Ensure  availability  of  

applica3ons  and  services  

   

•  Use learning tools to augment custom best practices •  Leverage statistical methods to maximize predictive warning •  Improve problem detection across IT silos

Predict

Faster  Problem  Resolu'on  

Find  &  correct  problems  faster  with  tools  that  determine  ac3ons  

required  to  resolve  issues  

   

•  Identify problems quicker with insight to large unstructured repositories

•  Isolate problems quicker by bringing relevant unstructured data into problem investigations

•  Repair problems quicker with the right details quickly to hand.

Resolve

Op'mized  Performance    

Track,  Op3mize,  and  Predict  capacity  and  performance  needs  

over  3me  

   

•  Track capacity and performance of applications and services in classic and cloud environments • Optimize resource deployment with what-if and best fit planning tools •  Escalate capacity and performance problems before they cause critical failures

Perform

Improved  Insight    Enhance  visibility  into  systems  resource  rela3onships  while  

increasing  customer  sa3sfac3on    

   

•  Determine what resources are interdependent to assess impact of failures •  Gain insight into what is important to your customer

•  Decrease customer churn and acquisition costs while increasing customer retention and satisfaction

Know

Automated Analytics helps lower IT Administration Costs: • Performance and Capacity planning tools monitor appropriately and escalate, reducing time consuming

report browsing •  Learning tools reduce customization and best practices investment on initial deployment •  Log Analysis helps speed problem resolution to be able to do more with less

Page 78: Brighttalk   what should we be monitoring - final

Follow Us: #ITSMSummit!

Let’s keep the conversation going…

[email protected]!

ReverendDrew!

SystemsManagementZen.Wordpress.com!

systemsmanagementzen.wordpress.com/feed/!

@SystemsMgmtZen!

ReverendDrew!

[email protected]!

614-306-3434!

Page 79: Brighttalk   what should we be monitoring - final