ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen...

46
ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen [email protected] http://pinpoint.stanford.edu

Transcript of ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen...

Page 1: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Connecting the Dots:Using Runtime Paths for Macro Analysis

Mike [email protected]://pinpoint.stanford.edu

Page 2: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Motivation Divide and conquer, layering, and replication are

fundamental design principles– e.g. Internet systems, P2P systems, and sensor networks

Execution context is dispersed throughout the system=> difficult to monitor and debug

Lots of existing low-level tools that help with debugging individual components, but not a collection of them– Much of the system is in how the components are put

together

Observation: a widening gap between the systems we are building and the tools we have

Page 3: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Peer Peer

Peer Peer

Peer Peer

Peer

Sensor Sensor

Sensor Sensor

Sensor Sensor

Sensor

Current Approach

Apache Apache

JavaBean

JavaBean

Database Database

JavaBean

Page 4: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Current Approach Micro analysis tools, like

code-level debuggers (e.g. gdb) and application logs, offers details of each individual component

Scenario– A user reports request A1

failed– You try the same request,

A2, but it works fine– What to do next?

Apache Apache

JavaBean

JavaBean

Database Database

JavaBean

gdbX = 1Y = 2

A1 A2

X = 3Y = 1

X = 3Y = 2

Page 5: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Macro Analysis Macro analysis exploits non-local context

to improve reliability and performance– Performance examples: Scout, ILP, Magpie

Statistical view is essential for large, complex systems

Analogy: micro analysis allows you to understand the details of individual honeybee; macro analysis is needed to understand how the bees interact to keep the beehive functioning

Page 6: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Observation Systems have a single system-wide

execution paths associated with each request– E.g. request/response, one-way messages

Scout, SEDA, Ninja use paths to specify how to service requests

Our philosophy– Use only dynamic, observed behavior– Application-independent techniques

Page 7: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Our Approach Use runtime paths to

connect the dots!– dynamically captures the

interactions and dependency between components

– look across many requests to get the overall system behavior

• more robust to noise

Components are only partially known (“gray boxes”)

Apache Apache

JavaBean

JavaBean

Database Database

JavaBean

path

paths

Page 8: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Apache Apache

JavaBean

JavaBean

JavaBean

JavaBean

JavaBean

paths

Apache Apache

JavaBean

JavaBean

Database Database

JavaBean

Sensor Sensor

Sensor Sensor

Sensor Sensor

Sensor

paths

Peer Peer

Peer Peer

Peer Peer

Peer

paths

Our Approach Applicable to a wide range of systems.

A B

D E

F G

C

paths

Page 9: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Open Challenges in Systems Today

1. Deducing system structure– manual approach is error-prone– static analysis doesn’t consider resources

2. Detecting application-level failures– often don’t exhibit lower-level symptoms

3. Diagnosing failures– failures may manifest far from the actual faults– multi-component faults

Goal: reduce time to detection, recovery, diagnosis, and repair

Page 10: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Talk Outline Motivation Model and architecture Applying macro analysis Future directions

Page 11: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Runtime Paths Instrument code to dynamically trace

requests through a system at the component level– record call path + the runtime properties – e.g. components, latency, success/failure, and

resources used to service each request Use statistical analysis detect and diagnose

problems– e.g. data mining, machine learning, etc.

Runtime analysis tells you how the system is actually being used, not how it may be used

Complements existing micro analysis tools

Page 12: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Reusable Analysis Framework

Architecture Tracer

– Tags each request with a unique ID, and carries it with the request throughout the system

– Report observations (component name + resource + performance properties) for each component

Aggregator + Repository– Reconstructs paths and

stores them Declarative Query Engine

– Supports statistical queries on paths

– Data mining and machine learning routines

Visualization

A

Tracer

C

Tracer

B

Tracer

D

Tracer

E

Tracer

F

Tracer

Aggregator

Path Repository

Query Engine

Visualization

Developers/Operators

request

Page 13: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Request Tracing Challenge: maintaining an ID with each request

throughout the system Tracing is platform-specific but can be application-

generic and reusable across applications 2 classes of techniques

– Intra-thread tracing• Use per-thread context to store request ID (e.g. ThreadLocal

in Java)• ID is preserved if the same thread is used to service the

request – Inter-thread tracing

• For extensible protocols like HTTP, inject new headers that will be preserved (e.g. REQ_ID: xx)

• Modify RPC to pass request ID under the cover• Piggyback onto messages

Page 14: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Talk Outline Motivation Model and architecture Applying macro analysis

– Inferring system structure– Detection application-level failures– Diagnosing failures

Future directions

Page 15: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Inferring System Structure Key idea: paths directly capture application

structure

2 requests

Page 16: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Indirect Coupling of Requests Key idea: paths associate requests with

internal state Trace requests from web server to database

– Parse client-side SQL queries to get sharing of db tables

– Straightforward to extend to more fine-grained state (e.g. rows)

Requesttypes

Database tables

Page 17: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Failure Detection and Diagnose Detecting application-level failures

– Key idea: paths change under failures => detect failures via path changes.

Diagnosing failures– Key idea: bad paths touch root cause(s). Find

common features.

Page 18: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Future Directions Key idea: violation of macro invariants are

signs of buggy implementation or intrusion Message paths in P2P and sensor networks

– a general mechanism to provide visibility into the collective behavior of multiple nodes

– micro or static approaches by themselves don’t work well in dynamic, distributed settings

– e.g. algorithms have upper bounds on the # of hops

• Although hop count violation can be detected locally, paths help identify nodes that route messages incorrectly

– e.g. detecting nodes that are slow or corrupt msgs

Page 19: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Conclusion Macro analysis fills the need when monitoring and

debugging systems where local context is of insufficient use

Runtime path-based approach dynamically traces request paths and statistically infer macro properties

A shared analysis framework that is reusable across many systems– Simplifies the construction of effective tools for other

systems and the integration with recovery techniques like RR

http://pinpoint.stanford.edu– Paper includes a commercial example from Tellme!

(thanks to Anthony Accardi and Mark Verber)

Page 20: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Backup Slides

Page 21: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Backup Slides

Page 22: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Current Approach Micro analysis tools, like

code-level debuggers (e.g. gdb) and application logs, offers details of each individual component

Apache Apache

JavaBean

JavaBean

Database Database

JavaBean

gdbX = 1Y = 2

JavaBeanX = 1Y = 2

JavaBeanX = 2Y = 3

JavaBeanX = 5Y = 2

JavaBeanX = 3Y = 2

JavaBeanX = 2Y = 4

JavaBeanX = 7Y = 1

Page 23: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Related Work Commercial request tracing systems

– Announced in 2002, a few months after Pinpoint was developed– PerformaSure and AppAssure focus on performance problems.– IntegriTea captures and playback failure conditions.– Focus on individual requests rather than overall behavior, and on

recreating the failure condition. Extensive work in event/alarm correlation, mostly in the context

of network management (i.e. IP)– Don’t directly capture relationship between events – Rely on human knowledge or use machine learning to suppress

alarms. Distributed debuggers

– PDT, P2D2, TotalView, PRISM, pdbx– Aggregates views from multiple components, but do not capture

relationship and interaction between components– Comparative debuggers: Wizard, GUARD

Dependency models– Most are statically generated and are likely to be inconsistent.– Brown et al. takes an active, black box approach but is invasive.

Candea et al. dynamically trace failures propagation.

Page 24: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

1. Detecting Failures using Anomaly Detection

Key idea: paths change under failures => detect failures via path changes

Anomalies– Unusual paths– Changes in distribution– Changes in latency/response time

Examples:– Error paths are shorter.– User behavior changes under failures

• Retries a few times then give up Implement as long running queries (i.e. diff) Challenges:

– detecting application-level failures– comparing sets of paths

Page 25: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

2. Root-cause Analysis Key idea: all bad paths touch root cause, find

common features Challenge: a small set of known bad paths and a

large set of maybes Ideally want to correlate and rank all

combinations of feature sets– E.g. association rules mining– May get false alarms because the root cause may not

be one of the features Automatic generation of dynamic functional and

state dependency graphs– Helps developers and operators understand inter-

component dependency and inter-request dependency– Input to recovery algorithms that use dependency

graphs

Page 26: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

3. Verifying Macro Invariants Key idea: violations of high-level invariants are signs

of intrusion or bugs Example: Peer auditing

– Problem: A small number of faulty or malicious nodes can bring down the system

– Corruption should be statistically visible in your behavior• look for nodes that delay or corrupt messages or route

messages incorrectly– Apply root-cause analysis to locate the misbehaving peers

• Some distributed auditing is necessary Example: P2P implementation verification

– Problem: are messages delivered as specified by the algorithms?

– Detect extra hops, loops, and verify that the paths are correct

– Can implement as a query:• select length from paths where (length > log2(N))

Page 27: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

4. Detecting Single Point of Failure Key idea: paths converge on a

single-point of failure Useful for finding out what to

replicate to improve availability

P2P example:– Many P2P systems rely on

overlay networks, which typically are networks built on top of the IP infrastructure.

– It’s common for several overlay links to fail together if they depend on a shared physical IP link that failed

Implement as a query:– intersect edge.IP_links from paths

Peer Peer

Peer Peer

Peer Peer

Peer

Sensor Sensor

Sensor Sensor

Sensor Sensor

Sensor

A B

D E

F G

C D

Page 28: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

5. Monitoring of Sensor Networks An emerging area with primitive tools Key idea: use paths to reconstruct

topology and membership Example:

– Membership• select unique node from paths

– Network topology • for directed information dissemination

Challenge: limited bandwidth– Can record a (random) subset of the nodes for

each path, then statistically reconstruct the paths

Page 29: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Macro Analysis Look across many requests to get the

overall system behavior– more robust to noise

Request 1

Request 2 Request 3 Request 4

Component A X X X

Component B X

Component C X X

Macro Analysis

Page 30: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Properties of Network Systems Web services, P2P systems, and sensor

networks can have tens of thousands of nodes each running many application components

Continuous adaptation provides high availability, but also makes it difficult to reproduce and debug errors

Constant evolution of software and hardware

Page 31: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Motivation Difficult to understand and debug network

systems– e.g. Clustered Internet systems, P2P systems and sensor

networks– Composed of many components– Systems are becoming larger, more dynamic, and more

distributed Workload is unpredictable and impractical to

simulate– Unit testing is necessary but insufficient. Components

break when used together under real workload Don’t have tools that capture the interactions

between components and the overall behavior– Existing debugging tools and application-level logs only

do micro analysis

Page 32: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Macro vs Micro Analysis

Macro Analysis

Micro Analysis

Resolution Component. Complements micro analysis tools.

Line or variable

Overhead Low. Can use it in actual deployment.

High. Typically not used in deployment other than application logs.

Page 33: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

What’s a dynamic path? A dynamic path is the

(control flow + runtime properties) of a request– Think of it as a stack trace

across process/machine boundaries with runtime properties

– Dynamically constructed by tracing requests through a system

Runtime properties– Resources (e.g. host, version)– Performance properties (e.g.

latency)– Arguments (e.g. URL, args,

SQL statement)– Success/failure

request

RequestID: 1Seq Num: 1Name: AHost: xxLatency: 10msSuccess: true…..D

E

Path

A

A B

C D

E F

Page 34: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Related Work Micro debugging tools

– RootCause provides extensible logging of method calls and arguments.

– Diduce look for inconsistencies in variable usage.– Complements macro analysis tools.

Languages for monitoring– InfoSpect looks for inconsistencies in system state

using a logic language

Network flow-based monitoring– RTFM and Cisco NetFlow classify and record network

flows

Statistical and data mining languages– S, DMQL, WebML

Page 35: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Visualization Techniques Tainted paths: mark all flows

that have a certain property (e.g. failed or slow) with a distinct color and overlay it on the graph

Detecting performance bottlenecks: look for replicated nodes that have different colors

Detecting anomaly: look for missing edges and unknown paths

Page 36: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Pinpoint Framework

Communications Layer

(Tracing & Internal F/D)

A B C

Components

#1

Requests

ExternalF/D

#2

#3

StatisticalAnalysis

DetectedFaults

1,A1,C2,B..

1, success2, fail3, ...

Logs

Page 37: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Experimental Setup Demo app: J2EE Pet Store

– e-commerce site w/~30 components

Load generator– replay trace of browsing– Approx. TPCW WIPSo load (~50% ordering)

Fault injection parameters– Trigger faults based on combinations of used

components– Inject exceptions, infinite loops, null calls

55 tests with single-components faults and interaction faults– 5-min runs of a single client (J2EE server limitation)

Page 38: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Application Observations

0%

20%

40%

60%

80%

100%

1 6 11 16 21

# of components

Cu

mu

lati

ve

% o

f to

tal r

eq

ue

sts

0

2

4

6

8

10

12

cart

productIt

emList

Invento

ryEJB

SignO

nEJB

ClientC

ontrolle

rEJB

ShoppingCartEJB

# o

f ti

gh

tly

cou

ple

d c

om

po

nen

ts

# of components used in a dynamic web page request: – median 14, min 6, max 23

large number of tightly coupled components that are always used together

Page 39: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Metrics

Precision: C/P Recall: C/A Accuracy: whether all actual faults are

correctly identified (recall == 100%)– boolean measure

PredictedFaults (P)

Actual Faults (A)

CorrectlyIdentifiedFaults (C)

Page 40: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

4 Analysis Techniques Pinpoint: clusters of components that

statistically correlate with failures Detection: components where Java

exceptions were detected– union across all failed requests– similar to what an event monitoring system

outputs Intersection: intersection of components

used in failed requests Union: union of all components used in

failed requests

Page 41: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Results

Pinpoint has high accuracy with relatively high precision

76%

50%

83%

100%

33%

77%

15%11%

0%

20%

40%

60%

80%

100%

Pinpoint Detection Intersection Union

average accuracy

average precision

0%

20%

40%

60%

80%

100%

1 2 3 4

# of Interacting Components

Av

era

ge

Ac

cu

rac

y

Pinpoint

Detection

Intersection

Union

Page 42: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Pinpoint Prototype Limitations Assumptions

– client requests provide good coverage over components and combinations

– requests are autonomous (don’t corrupt state and cause later requests to fail)

Currently can’t detect the following:– faults that only degrade performance– faults due to pathological inputs

Single-node only

Page 43: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Current Status Simple graph visualization

Page 44: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

Proposed Research 3 classes of large network systems

– Clustered Internet systems• Tiered architecture, high bandwidth, many replicas

– Peer-to-peer (P2P) systems, including sensor networks

• Widely distributed nodes, dynamic membership

– Sensor networks• Limited storage, processing, and bandwidth.

Page 45: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

P2P Systems: Tracing Trace messages by piggybacking the

current node name to the messages Tracing overhead

– Assume 32-bit per node name and a very conservative log2(N) hops for each msg and

– Data overhead is 40% for a 1500-byte message in a 106-node system

Page 46: ROC Retreat 1/14/2003 Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu .

ROC Retreat 1/14/2003

P2P Systems: Implementation Verification

Current debugging techniques: lots of printf()’s on each node and manually correlate the paths taken by messages

How do you know the messages are delivered as specified by the algorithms?

Use message paths to check for routing invariants– detect extra hops, loops, and verify that the paths are

correct

Can implement as a query:– select length from paths where (length > log2(N))