Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through...

Performance ForensicsPerformance ForensicsUncovering the Mysteries of Performance and Scalability

Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

mailto:[email protected]

mailto:[email protected]

Sessions Goals

The goals of today’s session are…• Introduce you to the practice of performance

forensics.• Present a methodology for performing forensics.• Emphasize the importance of session level

analysis.• Discuss techniques for arriving at root cause

analysis.

Session Learning Objectives

At the end of the session you should be able to…• Write your own problem statements.• Perform the process of evidence collection and

interviewing.• Apply techniques for using data and analysis to

avoid diagnosis bias and value attribution.• Perform root cause analysis as part of the

performance forensics process.

A Practical Definition

• The term forensics means “The science and practice of collection, analysis, and presentation of information relating to a crime in a manner suitable for use in a court of law.”– This definition is in the context of a crime.– Discussing an individual event

• Forensic engineering is the application of accepted engineering practices and principles for discussion, debate, argumentative, or legal purposes.

Definition of Performance Forensics

• The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem.– In context of a performance (response time problem)– Discussing an individual event (session experience)

• Performance problems can be classified in two main categories:– Response Time Latency– Queuing Latency

A Case for a Performance Maturity Model

Reactiveand

Exploratory

Monitorand

Instrument

PerformanceOptimizing

BusinessOptimizing

ProcessOptimized

Level 5Level 4Level 3Level 2Level 1

Emphasis on Emphasis on HardwareHardware

Emphasis on Emphasis on ApplicationApplication

Emphasis on Emphasis on Eco-SystemEco-System

Emphasis on Emphasis on ProcessProcess

Emphasis on Emphasis on PeoplePeople

Cognition of Response Times

Queuing Model: Visual of a Bottleneck

Performance Forensics MethodologyIdentify the

Problem

Interviewing

CollectingEvidence

Data Analysis

Modeling and

Visualizing

Method-R

Sampling and

Simulating

Root Cause

Iden

tify

th

e M

ost

Imp

ort

ant

Op

erat

ion

s th

at A

ffec

t Y

ou

r B

usi

nes

s

Turn the Problem Statement into a Diagnosis to Get to Root Cause

Develop a Problem Statement

Formulate a Hypothesis

Establish a Diagnosis

Perform Session

Inspection

Putting Performance Forensics in Context

• Emphasis on the user and the user’s actions and experiences.– How can this be measured?

• Capture the response time experience and the response time expectations of the user.– Put into perspective user action in-line with the goals

of Method-R (what’s most important to the business)

• Quantify the response time breakdown– RT = ST + Latency (Waits)

• Identify the contributors of response latency

Developing a Problem Statement

Identifying the Problem

• Problems are not always easily identifiable.• When they are easily apparent a simple problem

statement should be declared so that the investigation can commence.– Calling out symptoms not diagnosing

• When the problem is not clear, narrowing down the possibilities of what the problem could be should be the appropriate course of action.

• Be willing to leave the problem statement open ended until a more formulated problem statement can be attained.

Steps for Developing a Problem Statement

• Identify Who is doing What in a particular area of the application (Where) at a particular time (When).– Avoid Why and How (Not ready to identify yet)

• Establish whether the problem is reproducible at all times, conditional or unexplainable.

• Must be able to associate to a user operation (use case).

• Must be able to quantify– Saying it’s slow is not acceptable (How slow?)

Formulate a Hypothesis

Interviewing

• Techniques – Lassie Question– Time Association– Component/Feature Specific– Operation Specific– User experienced: Data Oriented– Geo-location and Connectivity

• Avoiding diagnosis bias and value attribution• Can a pattern be identified?

Diagnosis Bias

• It is human nature to label people, ideas or things based on our initial opinions of them.

• Not necessarily scientific, but rather a combination of gut feelings, irrational judgment or failure to process enough conclusive data.

• We often diagnose before we can get to root cause analysis based on a hunch or perception.

Value Attribution

• Humans have a tendency to imbue someone or something with certain qualities based on its perceived value rather than objective data.

• Example 1: The problem can’t be my SAN, I spent $250,000 on it.

• Example 2: It can’t be the network, my engineers are the best in the field. They won’t allow a network problem to happen.

Does this Sound Familiar?

• Typical Conversation:– Sally: “Blackboard is slow.”– Bob: “How slow?”– Sally: “Real slow!”– Bob: “Where is it slow?”– Sally: “The server.”– Bob: “I mean where in the application?”– Sally: “All over…everywhere…does it really matter? Can’t you

just reboot the server to make it faster?”– Bob: “It’s more complicated then that. Can you be more concise

as to where it’s slow and possibly tell me when it’s slow?”– Sally: “Bob can’t you just check for yourself? You will see what I

am talking about…I’m not as technical as you.”– Bob: “Sally, I am logged in right now and don’t see a problem.”

Could the Conversation Go Better?

• Revised Conversation:– Sally: “Blackboard is slow.”– Bob: “Tell me a little bit about what you were doing when you

experienced this slowness? Try to be as descriptive as possible.”– Sally: “I was editing grades in one of my course sections.”– Bob: “Can you show me right now? Walk me through the steps leading

up to your problem.”– Sally: “Sure. Login as…select the third course in my course list…”– Bob: “Do you see this in all of your courses or just this course? Can we

test these courses as well?”– Sally: “This is strange Bob…It’s not happening now. You must be my

good luck charm!”– Bob: “Do you happen to know around what time the issue happened?

Do you know if it happened to anyone else? Were there other actions in the system that were slow? I know it’s tough, but could your quantify in seconds how long it took for certain actions to complete?”

Evidence

• Multiple types of gathered evidence used to solve performance problems.– Log artifacts– Monitoring/Measurement tools– Instrumentation/Sensors

• Interactive evidence gathering through interviews.

• Evidentiary support through discrete simulation• Improving future evidentiary capabilities by

improving Performance Maturity Model

Log Artifacts

• Understand what logs are in place and where they can be found.

• Know what they are used for and whether they provide the right information.

• Keep them slim and usable.• Learn how to associate and correlate

– Associate multiple log artifacts– Correlate events to the problem statement

Example Log Visualization

Putting Collectors/Sensors in Place

• When should this happen?– When a problem statement cannot be developed from

the data you do have (evidence or interviews) and more data needs to be collected.

• How should you go about this?– Want to minimize disruption to the production

environment.– Adaptive collection: Less Intensive to More Intensive

over time.

Basic Sampling Continuous Collection Profiling

Monitoring and Measurement

• Third party components whether commercial or open source deployed to measure responsiveness and resource utilization

• Excellent tools for trending and correlation• Specialization of tools to solve different types of

problems.• Used in forensics for correlation for resource utilization to

event occurrences.– System Data Capturing Tools– Component Data Capturing Tools– User Data Capturing Tools

Establish a Diagnosis

Hypothesis versus Diagnosis

• Hypothesis: A prediction or educated guess about a problem prior to proving scientifically or mathematically.

• Diagnosis: A scientific, empirical or measured conclusion about a problem (Requires data analysis).– Not necessarily the correct answer, but enough data has been

gathered to propose a diagnosis.– Requires some form of observation or testing

• A problem statement needs to be in place for both to exist.

• Both need supporting data to develop either

What is Correlation?

• Correlation is a measure of the statistical relationship between two comparable data points.– Time associations are typically made.– Correlate to wait event or occurrence– Correlate to resource demand

Every Day Correlation

• To be at Level 4 or 5 in the PMM, the data has to be accessible on demand.– LOE to assemble the data for correlation is substantial.– Problems could disappear by the time the data is available.– New problems might surface

• Response time metrics must be captured and presented. – Recommend response time histograms of Top Ten Transactions

• Accepting: 0 to 2 seconds (Instantaneous and Immediate)• Tolerating: 2 to 10 seconds (Continuous and Captive)• Frustrated: Greater then 10 seconds (Likely to Abandon)

– Collect frequency chart to correlate with response times• Correlate to metrics that are time oriented and can represent

latency.• Acceptable to present system metrics, but understand it’s

hypothesizing• There’s got to be a better way: Session Level Data

Measuring a User’s Session

First a little story…

Why Measuring the Session is Important

• Remember that we are trying to understand why User X had a response time problem.

• How can system data tell us this information?• It can’t…rather it’s a guess or hypothesis• Session explains step-by-step where the time

was spent.– Once you understand where it was spent, you are that

much closer to address each element of latency.

Techniques for Measuring a User’s Session

• Measurement is end-to-end– Start with HTTP client experience (can be obtained from

advanced log analysis, http profilers or user experience monitoring tools.

• Will show the client rendering times• Can even show network transport

– Database is designed to present session and wait events with less impact to overall performance

• Oracle: ASH and 10046• SQL Server: Session Level DMVs (sys.dm_os_wait_stats and

sys.dm_exec_sessions)– Application container is somewhat limited right now

• Can use aggregators (JSTAT, Thread Dumps and –Xloggc)• Profilers that can get into JSR-138 Specification• Need lighter-weight session level wait events (Next Generation of

Java)

Quick Comments About Method-R

• Method-R is a preferred methodology for problem statement development and problem diagnosis.

• While it was created for Oracle performance analysis, it can be applied to all aspects of software performance forensics.

• Identifying the most important user actions for the needs of the business in order to improve performance.

Getting to Root Cause Analysis

Getting to Root Cause Analysis

• Devising a strong problem statement– Foundation steps of Method-R

• Knowing where to collect evidence• Formulating a data-driven hypothesis• Appropriate use of correlation, modeling and visualizing• Proving the hypothesis out (test-driven approach)• Establishing a diagnosis

– Avoid diagnosis bias and value attribution

• Inspect the session– Measuring at the system will only get you so far

Want More?

• Come to my second presentation on Forensics Tools on Thursday 11:30 AM - 12:20 PM in San Polo 3405

• To view my resources and references for this presentation, visit www.scholar.com

• Simply click “Advanced Search” and search by [email protected] and tag: ‘bbworld08’ or ‘forensics’

http://www.scholar.com/

Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through...

Documents

Transcript of Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through...