A Path-based Approach to Managing Failures and · PDF fileRubis, a J2EE auction application,...

40
A Path A Path - - based Approach to based Approach to Managing Failures and Evolution Managing Failures and Evolution Mike Chen Mike Chen Computer Science Division Computer Science Division University of California, Berkeley University of California, Berkeley

Transcript of A Path-based Approach to Managing Failures and · PDF fileRubis, a J2EE auction application,...

A PathA Path--based Approach to based Approach to Managing Failures and EvolutionManaging Failures and Evolution

Mike ChenMike ChenComputer Science DivisionComputer Science Division

University of California, BerkeleyUniversity of California, Berkeley

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 22

Need for Fast RecoveryNeed for Fast Recovery

Failures are common and costlyFailures are common and costly–– Daily partial site outages for large sites.Daily partial site outages for large sites.

–– Downtime: $300K Downtime: $300K -- $6million/hr.$6million/hr.

Challenges: Challenges: –– Lots of potential sources of faults.Lots of potential sources of faults.

–– Multiple independent faults.Multiple independent faults.

–– Distributed runtime behavior (e.g. load balancing)Distributed runtime behavior (e.g. load balancing)

Observation: very short outages are “free”Observation: very short outages are “free”–– Cost of downtime is not linear.Cost of downtime is not linear.

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 33

Need for Rapid EvolutionNeed for Rapid Evolution

Competition drives demand for new features Competition drives demand for new features and bug fixesand bug fixes–– Switching cost is low.Switching cost is low.–– Single administrative domain lowers upgrade barrier.Single administrative domain lowers upgrade barrier.

Challenges: Challenges: –– Short release cyclesShort release cycles

•• Weekly and biWeekly and bi--weekly for new features at weekly for new features at eeBBaayy and and TellTellmeme, , shorter for bug fixes.shorter for bug fixes.

–– Distributed runtime behaviorDistributed runtime behavior

Observation: trend towards application server Observation: trend towards application server frameworksframeworks–– E.g. J2EE, .NET, etc. E.g. J2EE, .NET, etc.

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 44

Current Approaches to Understand SystemsCurrent Approaches to Understand Systems

2 extremes of granularity2 extremes of granularityProblems:Problems:–– Dispersed execution contextDispersed execution context–– Local context often insufficientLocal context often insufficient–– ““BlackboxBlackbox” components ” components

eeBBaayy

External (end to end)External (end to end)

X = 3X = 3Y = trueY = true

““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers

granularitygranularity

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 55

““Macro” ApproachMacro” Approach

Captures the relationship between components and Captures the relationship between components and their aggregate behaviortheir aggregate behavior–– Complements both endComplements both end--toto--end tools and “micro” analysis end tools and “micro” analysis

tools.tools.

eeBBaayyWebWeb

ServerServer

““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers

““Macro” viewMacro” view

WSWS

WSWS

WSWS

AppApp

AppApp

AppApp

DBDB

External (end to end)External (end to end)

X = 3X = 3Y = trueY = true

““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 66

Research ContributionsResearch Contributions

Developed concept of paths in J2EE app server Developed concept of paths in J2EE app server environment environment [DSN 02][DSN 02]–– Pinpoint: applicationPinpoint: application--generic failure diagnosisgeneric failure diagnosis

Implemented a pathImplemented a path--based analysis framework and based analysis framework and instrumented instrumented JBossJBoss (open(open--source J2EE) source J2EE) [[HotOSHotOS 03, 03, WIAPP 03]WIAPP 03]–– Failure detection/diagnosis and dependency collectionFailure detection/diagnosis and dependency collection

Deployed and evaluated pathDeployed and evaluated path--based analysis based analysis at at TellTellmeme and and eeBBaayy [NSDI 04, ICAC 04][NSDI 04, ICAC 04]–– Adapted machine learning techniques to failure Adapted machine learning techniques to failure

detection/diagnosis detection/diagnosis

–– Change detectionChange detection

–– In progress: autoIn progress: auto--recoveryrecovery

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 77

First Step: PathFirst Step: Path--based Analysisbased Analysis

Paths record runtime Paths record runtime properties of requestsproperties of requests–– components used components used

(name, version, etc)(name, version, etc)

–– timestamps timestamps

Two principlesTwo principles1.1. Use paths as the core Use paths as the core

abstractionabstraction

2.2. Apply statistical analysis Apply statistical analysis to a large number of pathsto a large number of paths

Focus on correctnessFocus on correctness–– In addition to performance In addition to performance

(Magpie, (Magpie, WebMonWebMon, HP), HP)

WebA

WebB

AppB

AppC

DBA

DBB

AppA

path

1. Web A, t = 11. Web A, t = 12. App A, t = 232. App A, t = 233. App B, t = 303. App B, t = 304. DB B, t = 564. DB B, t = 56….….

requestrequest

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 88

ArchitectureArchitecture

Observation includes:Observation includes:–– Component/resource Component/resource

names, version, …names, version, …–– TimestampsTimestamps

ApplicationApplication--generic generic tracing tracing –– By By instrumentinginstrumenting the the

application serversapplication servers•• E.g. < 1K lines for E.g. < 1K lines for

JBossJBoss, a J2EE app , a J2EE app serverserver

–– RequestRequest--centriccentric•• Associate system events Associate system events

to userto user--visible eventsvisible events

Web

Tracer

App

Tracer

Web

Tracer

App

Tracer

DB

Tracer

DB

Tracer

Ops/QA/Dev

request

Storage

Query interface

Analysis Engines

Detection

Diagnosis

Viz

observationobservation

PathPath

Aggregator

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 99

Talk OutlineTalk Outline

Motivation and ApproachMotivation and Approach

Failure ManagementFailure Management–– Paths applied to failure management processPaths applied to failure management process

•• Failure detection via path anomaliesFailure detection via path anomalies

–– Failure diagnosis using machine learning methodsFailure diagnosis using machine learning methods

Evolution ManagementEvolution Management–– ApplicationApplication--generic dependency trackinggeneric dependency tracking

–– Detecting and diagnosing changesDetecting and diagnosing changes

ConclusionsConclusions

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1010

Failure ManagementFailure Management

Goal: minimize impact of failuresGoal: minimize impact of failures–– UserUser--visible failures => $$$ lostvisible failures => $$$ lost

78% of recovery time is spent on detection 78% of recovery time is spent on detection and diagnosisand diagnosis

Feedback

Impact AnalysisDetection

Diagnosis

Recovery

Repairfailure

timelinetimeline

78%78%

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1111

Fast Recovery ChallengesFast Recovery Challenges

Many potential causes of failuresMany potential causes of failures–– SW bugs, hardware, configuration, network, DB, … SW bugs, hardware, configuration, network, DB, …

–– Multiple independent failures Multiple independent failures

Lots of dataLots of data–– Many small, but tolerable failuresMany small, but tolerable failures

–– RealReal--time detection/diagnosistime detection/diagnosis

Root cause might not be captured in logsRoot cause might not be captured in logs–– Tradeoff between logging granularity and overheadTradeoff between logging granularity and overhead

Observation: exact root cause may not be Observation: exact root cause may not be required for many recovery techniquesrequired for many recovery techniques

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1212

Failure Detection ConceptsFailure Detection Concepts

Path collisionsPath collisions–– Incomplete paths Incomplete paths

interrupted by other interrupted by other requests.requests.

Structural anomaliesStructural anomalies–– Learn a set of “good” Learn a set of “good”

paths, and flag unseen paths, and flag unseen paths.paths.

–– Extend to use probabilistic Extend to use probabilistic models.models.

AppB

AppC

AppA

DBA

DBB

WebA

WebB

requestsrequests requestsrequests

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1313

Failure Diagnosis ConceptsFailure Diagnosis Concepts

Idea: all bad paths Idea: all bad paths touch the root causetouch the root cause–– Look for path properties Look for path properties

common to failed common to failed requestsrequests

•• E.g. components used in all E.g. components used in all failed pathsfailed paths

–– Extend to use Extend to use probabilistic models.probabilistic models.

AppB

AppC

AppA

AppB

DBA

DBB

requestsrequests requestsrequests

WebA

WebB

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1414

Failure DiagnosisFailure Diagnosis

Summarize each path into:Summarize each path into:

What features of requests correlate with failures What features of requests correlate with failures (e.g. (e.g. NullPointerExceptionNullPointerException)?)?–– TxnTxn type, type, txntxn name, pool, host, version, DB, name, pool, host, version, DB,

or a combination of these?or a combination of these?

–– Different causes require different recovery techniquesDifferent causes require different recovery techniques

33

22

11

IDID

……

PriceDBPriceDB

FeedbackDBFeedbackDB, , UserDBUserDB, …, …

DBDB

……

1.0.31.0.3

1.2.11.2.1

VersionVersion

……………………XMLXML

SuccessSuccess231231Cgi2Cgi2BidBidURLURL

NullPointerNullPointer134134Cgi0Cgi0ViewFeedbackViewFeedbackURLURL

StatusStatusHostHostPoolPoolNameNameTypeType Features Class Label

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1515

eBay’s SiteeBay’s Site

2 physical tiers2 physical tiers–– Web server/app server + DBWeb server/app server + DB–– Apps in both Java (Apps in both Java (WebSphereWebSphere) and C++) and C++

SuperCALSuperCAL (Centralized Application Logging)(Centralized Application Logging)–– API for app developer to log anything to CALAPI for app developer to log anything to CAL–– Platform logs common path features: cookie, host, Platform logs common path features: cookie, host,

URL, DB URL, DB table(stable(s), status, etc.), status, etc.

StatsStats–– 1TB raw logs/day1TB raw logs/day (150GB (150GB gzippedgzipped), 200Mbps peak), 200Mbps peak–– 2K app servers, 40 2K app servers, 40 SuperCALSuperCAL machines

How to diagnose accurately and efficiently???

machines

How to diagnose accurately and efficiently???

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1616

Borrow Statistical Learning TechniquesBorrow Statistical Learning Techniques

Machine XMachine X Machine YMachine YMachineMachine

SuccessNullNull--

PointerPointer Success TimeTime--outout

RespondRespondMyFeedbackMyFeedback

ViewFeedbaViewFeedbackck

LoginLoginRequest Request NameName

Cast as Cast as feature selectionfeature selection problem in machine learningproblem in machine learning

Use decision trees because results are easily Use decision trees because results are easily interpretableinterpretable

1.1. Learn the tree from data (with failed paths)Learn the tree from data (with failed paths)

2.2. The edges lead to failed nodes are the predicted faultsThe edges lead to failed nodes are the predicted faults

SuccessSuccessYYViewView--FeedbackFeedback

XMLXML

TimeoutTimeoutYYRespondRespondURLURL

……………………

SuccessSuccessXXLoginLoginURLURL

NullNull--PointerPointer

XXMyMy--FeedbackFeedback

URLURL

StatusStatusMachineMachineNameNameTypeType

Diagnosis: 1) Machine X and Diagnosis: 1) Machine X and MyFeedbackMyFeedback2) Machine Y and Respond2) Machine Y and Respond

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1717

Experimental SetupExperimental SetupData set: 10 oneData set: 10 one--minute tracesminute traces–– 10 most recent failures w/ known causes10 most recent failures w/ known causes–– 4 with 2 independent faults (i.e. total of 14 independent faults4 with 2 independent faults (i.e. total of 14 independent faults))

–– About 1/8 of the whole siteAbout 1/8 of the whole site

Methodology: build 1 tree for each traceMethodology: build 1 tree for each traceMetricsMetrics–– RecallRecall: % of true faults identified: % of true faults identified

= (# of identified faults) / (# of true faults) = (# of identified faults) / (# of true faults)

–– PrecisionPrecision: 1 : 1 –– false positive ratefalse positive rate= (# of identified faults) / (# of predicted faults)= (# of identified faults) / (# of predicted faults)

8840407726026015153003001010

StatusStatusDatabaseDatabaseVersionVersionMachineMachinePoolPoolNameNameTypeType

111111114422

DB, SWDB, SWHost, SWHost, SWHost, DBHost, DBHost, HostHost, HostDBDBHostHost

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1818

Adapting Decision Trees to Failure AnalysisAdapting Decision Trees to Failure AnalysisProblem: standard decision trees produce Problem: standard decision trees produce 90% false positives90% false positives–– Due to 1) noise and 2) dependent featuresDue to 1) noise and 2) dependent features

Feature selection heuristicsFeature selection heuristics1.1. Ignore leaf nodes with no failed transactionsIgnore leaf nodes with no failed transactions2.2. Noise filteringNoise filtering: ignore nodes with < N% : ignore nodes with < N% failuesfailues (e.g. N = 10%)(e.g. N = 10%)3.3. Path trimmingPath trimming: trim each path by eliminating ancestor nodes : trim each path by eliminating ancestor nodes

that are subsumed by children nodes.that are subsumed by children nodes.4.4. Ranking: sort the predicted causes by failure countRanking: sort the predicted causes by failure count

8 205 1 3554

Machine X Machine Y

RespondMyFeedback

ViewFdbk

205 3554

Machine X Machine Y

Respond

MyFeedback

205 3554

Respond

MyFeedback

filterfilter trimtrim

1. noise1. noise

LoginLogin

2. Dependent features2. Dependent featurese.g. All “Respond” requests run on machine “Y”e.g. All “Respond” requests run on machine “Y”

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1919

Adapting Decision Trees to Failure AnalysisAdapting Decision Trees to Failure Analysis

0%

20%

40%

60%

80%

100%

C4.5 C4.5 (noisefiltering)

C4.5 (noisefiltering + path

trimming)

recallprecision

90%90%falsefalsepositivepositive

14% false 14% false positivepositive

Problem: standard decision trees produce Problem: standard decision trees produce 90% false positives90% false positives–– Due to noise and dependent featuresDue to noise and dependent features

Feature selection heuristicsFeature selection heuristics1.1. Ignore leaf nodes with no failed transactionsIgnore leaf nodes with no failed transactions2.2. Noise filteringNoise filtering: ignore nodes with < N% : ignore nodes with < N% failuesfailues (e.g. N = 10%)(e.g. N = 10%)3.3. Path trimmingPath trimming: trim each path by eliminating ancestor nodes : trim each path by eliminating ancestor nodes

that are subsumed by children nodes.that are subsumed by children nodes.4.4. Ranking: sort the predicted causes by failure countRanking: sort the predicted causes by failure count

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2020

Diagnosis Results of Decision TreesDiagnosis Results of Decision TreesRecall Recall vsvs precision tradeoffprecision tradeoff–– by varying the noise filtering by varying the noise filtering

thresholdthreshold

Decision treesDecision trees–– C4.5 w/ adaptationC4.5 w/ adaptation

•• A standard decision tree A standard decision tree algorithmalgorithm

–– MinEntropyMinEntropy•• A greedy variant of decision A greedy variant of decision

tree that finds one leaf with tree that finds one leaf with the most failuresthe most failures

•• Actual results from Actual results from eeBBaay y deployment deployment

–– Association rulesAssociation rules•• Data mining algorithm that Data mining algorithm that

computes the conditional computes the conditional probabilities for all probabilities for all combinations of featurescombinations of features

Precision vs Recall

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Prec

isio

n

C4.5Assoc. RulesMinEntropy

perfectperfect

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2121

Talk OutlineTalk Outline

Motivation and ApproachMotivation and Approach

Failure ManagementFailure Management

Evolution ManagementEvolution Management–– ApplicationApplication--generic dependency trackinggeneric dependency tracking

–– Detecting and diagnosing Detecting and diagnosing expectedexpected and and unexpectedunexpectedchangeschanges

ConclusionsConclusions

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2222

Tracking DependencyTracking Dependency

Current approachesCurrent approaches–– Manual approaches are errorManual approaches are error--prone and slowprone and slow

–– Static analysis captures Static analysis captures possiblepossible system behavior vs. system behavior vs. runtime analysis which captures the runtime analysis which captures the actualactual behaviorbehavior

Paths directly captures application structurePaths directly captures application structure–– ApplicationApplication--generic tracking of actual dependencygeneric tracking of actual dependency

•• Zero changes to applicationsZero changes to applications

RubisRubis, a J2EE auction application, hosted on Pinpoint/, a J2EE auction application, hosted on Pinpoint/JBossJBoss

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2323

Automatically Derived State DependencyAutomatically Derived State Dependency

Paths associate requests with internal statePaths associate requests with internal state–– Coupling of requests through shared stateCoupling of requests through shared state

•• Easily extended to track fineEasily extended to track fine--grained (e.g. rowgrained (e.g. row--level) state level) state sharingsharing

WWRRCommitOrderCommitOrder

RRCategoryCategory

RRRRSearchSearch

RRRRRRVerifySigninVerifySignin

R/WR/WRRRRCartCart

WWCheckoutCheckout

RRRRNewAccountNewAccount

R/WR/WRRProductDetailsProductDetails

InventoryInventoryBannerBannerAccountAccountSignonSignonProductProduct

Database TablesDatabase Tables

Requ

ests

Requ

ests

R R –– readreadW W -- writewrite

PetStorePetStore, a J2EE e, a J2EE e--commerce application, hosted on Pinpoint/commerce application, hosted on Pinpoint/JBossJBoss

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2424

Detecting/Diagnosing ChangesDetecting/Diagnosing Changes

Paths provides a flexible mechanism to profile Paths provides a flexible mechanism to profile any subany sub--pathpath–– Take the interval between any two observationsTake the interval between any two observations–– Drill down to identify problematic subDrill down to identify problematic sub--pathspaths

Statistical analysis simultaneously examines Statistical analysis simultaneously examines thousands of subthousands of sub--pathspaths–– Use nonUse non--parametric tests (e.g. Mannparametric tests (e.g. Mann--Whitney)Whitney)–– 10K+ sub10K+ sub--paths tested for every paths tested for every TellTellmeme release release

observationobservation obsobs obsobs obsobs obsobs pathpath

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2525

Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes

Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drill down to diagnose specific slow subdrill down to diagnose specific slow sub--pathspaths

Lower quartileLower quartileMedianMedian

Upper quartileUpper quartile

Change detected in Change detected in 1 sub1 sub--path in 1 applicationpath in 1 application

OutliersOutliers

2 versions of a 2 versions of a TellTellmeme applicationapplication

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2626

Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes

Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drilling down to diagnose the specific slow subdrilling down to diagnose the specific slow sub--pathspaths

Lower quartileLower quartileMedianMedian

Upper quartileUpper quartileNo changesNo changes

OutliersOutliers

2 versions of 2 2 versions of 2 TellTellmeme applications and 3 subapplications and 3 sub--pathspaths

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2727

Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes

Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drilling down to diagnose the specific slow subdrilling down to diagnose the specific slow sub--pathspaths

App fixedApp fixed

3 versions of 2 3 versions of 2 TellTellmeme applications and 3 subapplications and 3 sub--pathspaths

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2828

Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes

Look for consistent deviation across Look for consistent deviation across applicationsapplications

Change detected in Change detected in 1 sub1 sub--path in 1 applicationpath in 1 application

2 versions of a 2 versions of a TellTellmeme platformplatform

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2929

Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes

Look for consistent deviation across Look for consistent deviation across applicationsapplications

Consistent changesConsistent changesacross all appsacross all apps

2 versions of a 2 versions of a TellTellmeme platformplatform

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3030

Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes

Look for consistent deviation across Look for consistent deviation across applicationsapplications

platform fixedplatform fixed

3 versions of a 3 versions of a TellTellmeme platformplatform

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3131

Understanding OutliersUnderstanding Outliers

Use survivor plot (1 Use survivor plot (1 –– CDF)CDF)–– Plot on a logarithmic scalePlot on a logarithmic scale

Change from v1 to v2 Change from v1 to v2 is statistically significantis statistically significant(Mann(Mann--Whitney)Whitney)

1 1 --

CDF

CDF

2 versions of a 2 versions of a TellTellmeme applicationapplication

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3232

Understanding OutliersUnderstanding Outliers

Use survivor plot (1 Use survivor plot (1 –– CDF)CDF)–– Plot on a logarithmic scalePlot on a logarithmic scale

fixedfixed 1 1 --

CDF

CDF

3 versions of a 3 versions of a TellTellmeme applicationapplication

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3333

Lessons LearnedLessons Learned

Separate the path analysis logic from Separate the path analysis logic from observation instrumentation observation instrumentation –– Improves maintainability and extensibilityImproves maintainability and extensibility

Data is cheapData is cheap–– Allows the use of simple statistical algorithmsAllows the use of simple statistical algorithms

Live workload Live workload –– Important to support online use of toolsImportant to support online use of tools

Record “attempts”Record “attempts”–– Failed components/resources may not record Failed components/resources may not record

observations properlyobservations properly

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3434

SummarySummary

Macro approach helps us understand Macro approach helps us understand aggregate system behavioraggregate system behavior

Paths + statistical analysis:Paths + statistical analysis:–– Improves failure detection and diagnosis to support Improves failure detection and diagnosis to support

fast recovery.fast recovery.

–– Automates dependency tracking and change analysis Automates dependency tracking and change analysis to support rapid evolution.to support rapid evolution.

Deployed and evaluated on real systemsDeployed and evaluated on real systems–– PinpointPinpoint, , TellTellmeme, and , and eeBBaayy

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3535

What’s Next?What’s Next?

Extending pathsExtending paths–– WideWide--area systemsarea systems

•• Multiple administrative domainsMultiple administrative domains

•• P2P systemsP2P systems

–– Continuous queriesContinuous queries

HCI/ML/System DependabilityHCI/ML/System Dependability–– Detecting failures via anomalous user behaviorDetecting failures via anomalous user behavior

•• E.g. E.g. clickstreamclickstream path analysispath analysis

–– Automatic recoveryAutomatic recovery

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3636

Parting Thoughts…Parting Thoughts…

A failure is not a failure if it’s not A failure is not a failure if it’s not detected by the users.detected by the users.

Can we exploit the users to buy us some Can we exploit the users to buy us some recovery time?recovery time?

Maybe…Maybe…

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3737

Try… Distracting the UsersTry… Distracting the Users

Please read our Please read our newnew privacy agreementprivacy agreement

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3838

Or… Shift the blame to the users!Or… Shift the blame to the users!

Did Did youyou forget your password?forget your password?

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3939

Thank YouThank You

AcknowledgementsAcknowledgements–– Professor Eric Brewer and Professor Dave PattersonProfessor Eric Brewer and Professor Dave Patterson

–– Berkeley/Stanford ROC Research GroupBerkeley/Stanford ROC Research Group

–– Anthony Accardi (Anthony Accardi (TellTellmeme) and Jim Lloyd () and Jim Lloyd (eeBBaayy))

–– Professor Michael Jordan and Alice ZhengProfessor Michael Jordan and Alice Zheng

For more info:For more info:–– http://http://www.cs.berkeley.edu/~mikechenwww.cs.berkeley.edu/~mikechen

Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 4040

My ResearchMy ResearchSearch Engine Meets HCISearch Engine Meets HCI–– ChaCha--Cha Search: exploiting hyperlink structure for results Cha Search: exploiting hyperlink structure for results

presentation and relevance ranking [USITS 99]presentation and relevance ranking [USITS 99]

Secure Pervasive Computing InfrastructureSecure Pervasive Computing Infrastructure–– PostPost--PC: a PC: a composablecomposable proxy architecture enabling devices to proxy architecture enabling devices to

access secure services [WMCSA 00]access secure services [WMCSA 00]

Scalable Internet Services Framework Scalable Internet Services Framework –– Ninja Ninja vSpacevSpace: an event: an event--driven platform for building robust driven platform for building robust

Internet services [USENIX 02]Internet services [USENIX 02]–– Active Connection Management [NOMS 02]Active Connection Management [NOMS 02]

PathPath--based approach to system dependabilitybased approach to system dependability–– Pinpoint: applicationPinpoint: application--generic problem determination [DSN 02]generic problem determination [DSN 02]–– PathPath--based failure and evolution management [HotOS 03, based failure and evolution management [HotOS 03,

WIAPP 03, NSDI 04]WIAPP 03, NSDI 04]–– A statistical learning approach to failure diagnosis [ICAC 04]A statistical learning approach to failure diagnosis [ICAC 04]