A Path-based Approach to Managing Failures and · PDF fileRubis, a J2EE auction application,...
Transcript of A Path-based Approach to Managing Failures and · PDF fileRubis, a J2EE auction application,...
A PathA Path--based Approach to based Approach to Managing Failures and EvolutionManaging Failures and Evolution
Mike ChenMike ChenComputer Science DivisionComputer Science Division
University of California, BerkeleyUniversity of California, Berkeley
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 22
Need for Fast RecoveryNeed for Fast Recovery
Failures are common and costlyFailures are common and costly–– Daily partial site outages for large sites.Daily partial site outages for large sites.
–– Downtime: $300K Downtime: $300K -- $6million/hr.$6million/hr.
Challenges: Challenges: –– Lots of potential sources of faults.Lots of potential sources of faults.
–– Multiple independent faults.Multiple independent faults.
–– Distributed runtime behavior (e.g. load balancing)Distributed runtime behavior (e.g. load balancing)
Observation: very short outages are “free”Observation: very short outages are “free”–– Cost of downtime is not linear.Cost of downtime is not linear.
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 33
Need for Rapid EvolutionNeed for Rapid Evolution
Competition drives demand for new features Competition drives demand for new features and bug fixesand bug fixes–– Switching cost is low.Switching cost is low.–– Single administrative domain lowers upgrade barrier.Single administrative domain lowers upgrade barrier.
Challenges: Challenges: –– Short release cyclesShort release cycles
•• Weekly and biWeekly and bi--weekly for new features at weekly for new features at eeBBaayy and and TellTellmeme, , shorter for bug fixes.shorter for bug fixes.
–– Distributed runtime behaviorDistributed runtime behavior
Observation: trend towards application server Observation: trend towards application server frameworksframeworks–– E.g. J2EE, .NET, etc. E.g. J2EE, .NET, etc.
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 44
Current Approaches to Understand SystemsCurrent Approaches to Understand Systems
2 extremes of granularity2 extremes of granularityProblems:Problems:–– Dispersed execution contextDispersed execution context–– Local context often insufficientLocal context often insufficient–– ““BlackboxBlackbox” components ” components
eeBBaayy
External (end to end)External (end to end)
X = 3X = 3Y = trueY = true
““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers
granularitygranularity
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 55
““Macro” ApproachMacro” Approach
Captures the relationship between components and Captures the relationship between components and their aggregate behaviortheir aggregate behavior–– Complements both endComplements both end--toto--end tools and “micro” analysis end tools and “micro” analysis
tools.tools.
eeBBaayyWebWeb
ServerServer
““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers
““Macro” viewMacro” view
WSWS
WSWS
WSWS
AppApp
AppApp
AppApp
DBDB
External (end to end)External (end to end)
X = 3X = 3Y = trueY = true
““Micro” viewMicro” viewe.g. codee.g. code--level debuggerslevel debuggers
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 66
Research ContributionsResearch Contributions
Developed concept of paths in J2EE app server Developed concept of paths in J2EE app server environment environment [DSN 02][DSN 02]–– Pinpoint: applicationPinpoint: application--generic failure diagnosisgeneric failure diagnosis
Implemented a pathImplemented a path--based analysis framework and based analysis framework and instrumented instrumented JBossJBoss (open(open--source J2EE) source J2EE) [[HotOSHotOS 03, 03, WIAPP 03]WIAPP 03]–– Failure detection/diagnosis and dependency collectionFailure detection/diagnosis and dependency collection
Deployed and evaluated pathDeployed and evaluated path--based analysis based analysis at at TellTellmeme and and eeBBaayy [NSDI 04, ICAC 04][NSDI 04, ICAC 04]–– Adapted machine learning techniques to failure Adapted machine learning techniques to failure
detection/diagnosis detection/diagnosis
–– Change detectionChange detection
–– In progress: autoIn progress: auto--recoveryrecovery
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 77
First Step: PathFirst Step: Path--based Analysisbased Analysis
Paths record runtime Paths record runtime properties of requestsproperties of requests–– components used components used
(name, version, etc)(name, version, etc)
–– timestamps timestamps
Two principlesTwo principles1.1. Use paths as the core Use paths as the core
abstractionabstraction
2.2. Apply statistical analysis Apply statistical analysis to a large number of pathsto a large number of paths
Focus on correctnessFocus on correctness–– In addition to performance In addition to performance
(Magpie, (Magpie, WebMonWebMon, HP), HP)
WebA
WebB
AppB
AppC
DBA
DBB
AppA
path
1. Web A, t = 11. Web A, t = 12. App A, t = 232. App A, t = 233. App B, t = 303. App B, t = 304. DB B, t = 564. DB B, t = 56….….
requestrequest
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 88
ArchitectureArchitecture
Observation includes:Observation includes:–– Component/resource Component/resource
names, version, …names, version, …–– TimestampsTimestamps
ApplicationApplication--generic generic tracing tracing –– By By instrumentinginstrumenting the the
application serversapplication servers•• E.g. < 1K lines for E.g. < 1K lines for
JBossJBoss, a J2EE app , a J2EE app serverserver
–– RequestRequest--centriccentric•• Associate system events Associate system events
to userto user--visible eventsvisible events
Web
Tracer
App
Tracer
Web
Tracer
App
Tracer
DB
Tracer
DB
Tracer
Ops/QA/Dev
request
Storage
Query interface
Analysis Engines
Detection
Diagnosis
Viz
observationobservation
PathPath
Aggregator
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 99
Talk OutlineTalk Outline
Motivation and ApproachMotivation and Approach
Failure ManagementFailure Management–– Paths applied to failure management processPaths applied to failure management process
•• Failure detection via path anomaliesFailure detection via path anomalies
–– Failure diagnosis using machine learning methodsFailure diagnosis using machine learning methods
Evolution ManagementEvolution Management–– ApplicationApplication--generic dependency trackinggeneric dependency tracking
–– Detecting and diagnosing changesDetecting and diagnosing changes
ConclusionsConclusions
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1010
Failure ManagementFailure Management
Goal: minimize impact of failuresGoal: minimize impact of failures–– UserUser--visible failures => $$$ lostvisible failures => $$$ lost
78% of recovery time is spent on detection 78% of recovery time is spent on detection and diagnosisand diagnosis
Feedback
Impact AnalysisDetection
Diagnosis
Recovery
Repairfailure
timelinetimeline
78%78%
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1111
Fast Recovery ChallengesFast Recovery Challenges
Many potential causes of failuresMany potential causes of failures–– SW bugs, hardware, configuration, network, DB, … SW bugs, hardware, configuration, network, DB, …
–– Multiple independent failures Multiple independent failures
Lots of dataLots of data–– Many small, but tolerable failuresMany small, but tolerable failures
–– RealReal--time detection/diagnosistime detection/diagnosis
Root cause might not be captured in logsRoot cause might not be captured in logs–– Tradeoff between logging granularity and overheadTradeoff between logging granularity and overhead
Observation: exact root cause may not be Observation: exact root cause may not be required for many recovery techniquesrequired for many recovery techniques
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1212
Failure Detection ConceptsFailure Detection Concepts
Path collisionsPath collisions–– Incomplete paths Incomplete paths
interrupted by other interrupted by other requests.requests.
Structural anomaliesStructural anomalies–– Learn a set of “good” Learn a set of “good”
paths, and flag unseen paths, and flag unseen paths.paths.
–– Extend to use probabilistic Extend to use probabilistic models.models.
AppB
AppC
AppA
DBA
DBB
WebA
WebB
requestsrequests requestsrequests
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1313
Failure Diagnosis ConceptsFailure Diagnosis Concepts
Idea: all bad paths Idea: all bad paths touch the root causetouch the root cause–– Look for path properties Look for path properties
common to failed common to failed requestsrequests
•• E.g. components used in all E.g. components used in all failed pathsfailed paths
–– Extend to use Extend to use probabilistic models.probabilistic models.
AppB
AppC
AppA
AppB
DBA
DBB
requestsrequests requestsrequests
WebA
WebB
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1414
Failure DiagnosisFailure Diagnosis
Summarize each path into:Summarize each path into:
What features of requests correlate with failures What features of requests correlate with failures (e.g. (e.g. NullPointerExceptionNullPointerException)?)?–– TxnTxn type, type, txntxn name, pool, host, version, DB, name, pool, host, version, DB,
or a combination of these?or a combination of these?
–– Different causes require different recovery techniquesDifferent causes require different recovery techniques
33
22
11
IDID
……
PriceDBPriceDB
FeedbackDBFeedbackDB, , UserDBUserDB, …, …
DBDB
……
1.0.31.0.3
1.2.11.2.1
VersionVersion
……………………XMLXML
SuccessSuccess231231Cgi2Cgi2BidBidURLURL
NullPointerNullPointer134134Cgi0Cgi0ViewFeedbackViewFeedbackURLURL
StatusStatusHostHostPoolPoolNameNameTypeType Features Class Label
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1515
eBay’s SiteeBay’s Site
2 physical tiers2 physical tiers–– Web server/app server + DBWeb server/app server + DB–– Apps in both Java (Apps in both Java (WebSphereWebSphere) and C++) and C++
SuperCALSuperCAL (Centralized Application Logging)(Centralized Application Logging)–– API for app developer to log anything to CALAPI for app developer to log anything to CAL–– Platform logs common path features: cookie, host, Platform logs common path features: cookie, host,
URL, DB URL, DB table(stable(s), status, etc.), status, etc.
StatsStats–– 1TB raw logs/day1TB raw logs/day (150GB (150GB gzippedgzipped), 200Mbps peak), 200Mbps peak–– 2K app servers, 40 2K app servers, 40 SuperCALSuperCAL machines
How to diagnose accurately and efficiently???
machines
How to diagnose accurately and efficiently???
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1616
Borrow Statistical Learning TechniquesBorrow Statistical Learning Techniques
Machine XMachine X Machine YMachine YMachineMachine
SuccessNullNull--
PointerPointer Success TimeTime--outout
RespondRespondMyFeedbackMyFeedback
ViewFeedbaViewFeedbackck
LoginLoginRequest Request NameName
Cast as Cast as feature selectionfeature selection problem in machine learningproblem in machine learning
Use decision trees because results are easily Use decision trees because results are easily interpretableinterpretable
1.1. Learn the tree from data (with failed paths)Learn the tree from data (with failed paths)
2.2. The edges lead to failed nodes are the predicted faultsThe edges lead to failed nodes are the predicted faults
SuccessSuccessYYViewView--FeedbackFeedback
XMLXML
TimeoutTimeoutYYRespondRespondURLURL
……………………
SuccessSuccessXXLoginLoginURLURL
NullNull--PointerPointer
XXMyMy--FeedbackFeedback
URLURL
StatusStatusMachineMachineNameNameTypeType
Diagnosis: 1) Machine X and Diagnosis: 1) Machine X and MyFeedbackMyFeedback2) Machine Y and Respond2) Machine Y and Respond
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1717
Experimental SetupExperimental SetupData set: 10 oneData set: 10 one--minute tracesminute traces–– 10 most recent failures w/ known causes10 most recent failures w/ known causes–– 4 with 2 independent faults (i.e. total of 14 independent faults4 with 2 independent faults (i.e. total of 14 independent faults))
–– About 1/8 of the whole siteAbout 1/8 of the whole site
Methodology: build 1 tree for each traceMethodology: build 1 tree for each traceMetricsMetrics–– RecallRecall: % of true faults identified: % of true faults identified
= (# of identified faults) / (# of true faults) = (# of identified faults) / (# of true faults)
–– PrecisionPrecision: 1 : 1 –– false positive ratefalse positive rate= (# of identified faults) / (# of predicted faults)= (# of identified faults) / (# of predicted faults)
8840407726026015153003001010
StatusStatusDatabaseDatabaseVersionVersionMachineMachinePoolPoolNameNameTypeType
111111114422
DB, SWDB, SWHost, SWHost, SWHost, DBHost, DBHost, HostHost, HostDBDBHostHost
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1818
Adapting Decision Trees to Failure AnalysisAdapting Decision Trees to Failure AnalysisProblem: standard decision trees produce Problem: standard decision trees produce 90% false positives90% false positives–– Due to 1) noise and 2) dependent featuresDue to 1) noise and 2) dependent features
Feature selection heuristicsFeature selection heuristics1.1. Ignore leaf nodes with no failed transactionsIgnore leaf nodes with no failed transactions2.2. Noise filteringNoise filtering: ignore nodes with < N% : ignore nodes with < N% failuesfailues (e.g. N = 10%)(e.g. N = 10%)3.3. Path trimmingPath trimming: trim each path by eliminating ancestor nodes : trim each path by eliminating ancestor nodes
that are subsumed by children nodes.that are subsumed by children nodes.4.4. Ranking: sort the predicted causes by failure countRanking: sort the predicted causes by failure count
8 205 1 3554
Machine X Machine Y
RespondMyFeedback
ViewFdbk
205 3554
Machine X Machine Y
Respond
MyFeedback
205 3554
Respond
MyFeedback
filterfilter trimtrim
1. noise1. noise
LoginLogin
2. Dependent features2. Dependent featurese.g. All “Respond” requests run on machine “Y”e.g. All “Respond” requests run on machine “Y”
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 1919
Adapting Decision Trees to Failure AnalysisAdapting Decision Trees to Failure Analysis
0%
20%
40%
60%
80%
100%
C4.5 C4.5 (noisefiltering)
C4.5 (noisefiltering + path
trimming)
recallprecision
90%90%falsefalsepositivepositive
14% false 14% false positivepositive
Problem: standard decision trees produce Problem: standard decision trees produce 90% false positives90% false positives–– Due to noise and dependent featuresDue to noise and dependent features
Feature selection heuristicsFeature selection heuristics1.1. Ignore leaf nodes with no failed transactionsIgnore leaf nodes with no failed transactions2.2. Noise filteringNoise filtering: ignore nodes with < N% : ignore nodes with < N% failuesfailues (e.g. N = 10%)(e.g. N = 10%)3.3. Path trimmingPath trimming: trim each path by eliminating ancestor nodes : trim each path by eliminating ancestor nodes
that are subsumed by children nodes.that are subsumed by children nodes.4.4. Ranking: sort the predicted causes by failure countRanking: sort the predicted causes by failure count
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2020
Diagnosis Results of Decision TreesDiagnosis Results of Decision TreesRecall Recall vsvs precision tradeoffprecision tradeoff–– by varying the noise filtering by varying the noise filtering
thresholdthreshold
Decision treesDecision trees–– C4.5 w/ adaptationC4.5 w/ adaptation
•• A standard decision tree A standard decision tree algorithmalgorithm
–– MinEntropyMinEntropy•• A greedy variant of decision A greedy variant of decision
tree that finds one leaf with tree that finds one leaf with the most failuresthe most failures
•• Actual results from Actual results from eeBBaay y deployment deployment
–– Association rulesAssociation rules•• Data mining algorithm that Data mining algorithm that
computes the conditional computes the conditional probabilities for all probabilities for all combinations of featurescombinations of features
Precision vs Recall
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1Recall
Prec
isio
n
C4.5Assoc. RulesMinEntropy
perfectperfect
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2121
Talk OutlineTalk Outline
Motivation and ApproachMotivation and Approach
Failure ManagementFailure Management
Evolution ManagementEvolution Management–– ApplicationApplication--generic dependency trackinggeneric dependency tracking
–– Detecting and diagnosing Detecting and diagnosing expectedexpected and and unexpectedunexpectedchangeschanges
ConclusionsConclusions
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2222
Tracking DependencyTracking Dependency
Current approachesCurrent approaches–– Manual approaches are errorManual approaches are error--prone and slowprone and slow
–– Static analysis captures Static analysis captures possiblepossible system behavior vs. system behavior vs. runtime analysis which captures the runtime analysis which captures the actualactual behaviorbehavior
Paths directly captures application structurePaths directly captures application structure–– ApplicationApplication--generic tracking of actual dependencygeneric tracking of actual dependency
•• Zero changes to applicationsZero changes to applications
RubisRubis, a J2EE auction application, hosted on Pinpoint/, a J2EE auction application, hosted on Pinpoint/JBossJBoss
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2323
Automatically Derived State DependencyAutomatically Derived State Dependency
Paths associate requests with internal statePaths associate requests with internal state–– Coupling of requests through shared stateCoupling of requests through shared state
•• Easily extended to track fineEasily extended to track fine--grained (e.g. rowgrained (e.g. row--level) state level) state sharingsharing
WWRRCommitOrderCommitOrder
RRCategoryCategory
RRRRSearchSearch
RRRRRRVerifySigninVerifySignin
R/WR/WRRRRCartCart
WWCheckoutCheckout
RRRRNewAccountNewAccount
R/WR/WRRProductDetailsProductDetails
InventoryInventoryBannerBannerAccountAccountSignonSignonProductProduct
Database TablesDatabase Tables
Requ
ests
Requ
ests
R R –– readreadW W -- writewrite
PetStorePetStore, a J2EE e, a J2EE e--commerce application, hosted on Pinpoint/commerce application, hosted on Pinpoint/JBossJBoss
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2424
Detecting/Diagnosing ChangesDetecting/Diagnosing Changes
Paths provides a flexible mechanism to profile Paths provides a flexible mechanism to profile any subany sub--pathpath–– Take the interval between any two observationsTake the interval between any two observations–– Drill down to identify problematic subDrill down to identify problematic sub--pathspaths
Statistical analysis simultaneously examines Statistical analysis simultaneously examines thousands of subthousands of sub--pathspaths–– Use nonUse non--parametric tests (e.g. Mannparametric tests (e.g. Mann--Whitney)Whitney)–– 10K+ sub10K+ sub--paths tested for every paths tested for every TellTellmeme release release
observationobservation obsobs obsobs obsobs obsobs pathpath
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2525
Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes
Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drill down to diagnose specific slow subdrill down to diagnose specific slow sub--pathspaths
Lower quartileLower quartileMedianMedian
Upper quartileUpper quartile
Change detected in Change detected in 1 sub1 sub--path in 1 applicationpath in 1 application
OutliersOutliers
2 versions of a 2 versions of a TellTellmeme applicationapplication
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2626
Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes
Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drilling down to diagnose the specific slow subdrilling down to diagnose the specific slow sub--pathspaths
Lower quartileLower quartileMedianMedian
Upper quartileUpper quartileNo changesNo changes
OutliersOutliers
2 versions of 2 2 versions of 2 TellTellmeme applications and 3 subapplications and 3 sub--pathspaths
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2727
Detecting/Diagnosing AppDetecting/Diagnosing App--level Changeslevel Changes
Paths enables simultaneous testing of many Paths enables simultaneous testing of many subsub--pathspaths–– drilling down to diagnose the specific slow subdrilling down to diagnose the specific slow sub--pathspaths
App fixedApp fixed
3 versions of 2 3 versions of 2 TellTellmeme applications and 3 subapplications and 3 sub--pathspaths
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2828
Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes
Look for consistent deviation across Look for consistent deviation across applicationsapplications
Change detected in Change detected in 1 sub1 sub--path in 1 applicationpath in 1 application
2 versions of a 2 versions of a TellTellmeme platformplatform
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 2929
Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes
Look for consistent deviation across Look for consistent deviation across applicationsapplications
Consistent changesConsistent changesacross all appsacross all apps
2 versions of a 2 versions of a TellTellmeme platformplatform
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3030
Detecting/Diagnosing Platform ChangesDetecting/Diagnosing Platform Changes
Look for consistent deviation across Look for consistent deviation across applicationsapplications
platform fixedplatform fixed
3 versions of a 3 versions of a TellTellmeme platformplatform
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3131
Understanding OutliersUnderstanding Outliers
Use survivor plot (1 Use survivor plot (1 –– CDF)CDF)–– Plot on a logarithmic scalePlot on a logarithmic scale
Change from v1 to v2 Change from v1 to v2 is statistically significantis statistically significant(Mann(Mann--Whitney)Whitney)
1 1 --
CDF
CDF
2 versions of a 2 versions of a TellTellmeme applicationapplication
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3232
Understanding OutliersUnderstanding Outliers
Use survivor plot (1 Use survivor plot (1 –– CDF)CDF)–– Plot on a logarithmic scalePlot on a logarithmic scale
fixedfixed 1 1 --
CDF
CDF
3 versions of a 3 versions of a TellTellmeme applicationapplication
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3333
Lessons LearnedLessons Learned
Separate the path analysis logic from Separate the path analysis logic from observation instrumentation observation instrumentation –– Improves maintainability and extensibilityImproves maintainability and extensibility
Data is cheapData is cheap–– Allows the use of simple statistical algorithmsAllows the use of simple statistical algorithms
Live workload Live workload –– Important to support online use of toolsImportant to support online use of tools
Record “attempts”Record “attempts”–– Failed components/resources may not record Failed components/resources may not record
observations properlyobservations properly
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3434
SummarySummary
Macro approach helps us understand Macro approach helps us understand aggregate system behavioraggregate system behavior
Paths + statistical analysis:Paths + statistical analysis:–– Improves failure detection and diagnosis to support Improves failure detection and diagnosis to support
fast recovery.fast recovery.
–– Automates dependency tracking and change analysis Automates dependency tracking and change analysis to support rapid evolution.to support rapid evolution.
Deployed and evaluated on real systemsDeployed and evaluated on real systems–– PinpointPinpoint, , TellTellmeme, and , and eeBBaayy
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3535
What’s Next?What’s Next?
Extending pathsExtending paths–– WideWide--area systemsarea systems
•• Multiple administrative domainsMultiple administrative domains
•• P2P systemsP2P systems
–– Continuous queriesContinuous queries
HCI/ML/System DependabilityHCI/ML/System Dependability–– Detecting failures via anomalous user behaviorDetecting failures via anomalous user behavior
•• E.g. E.g. clickstreamclickstream path analysispath analysis
–– Automatic recoveryAutomatic recovery
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3636
Parting Thoughts…Parting Thoughts…
A failure is not a failure if it’s not A failure is not a failure if it’s not detected by the users.detected by the users.
Can we exploit the users to buy us some Can we exploit the users to buy us some recovery time?recovery time?
Maybe…Maybe…
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3737
Try… Distracting the UsersTry… Distracting the Users
Please read our Please read our newnew privacy agreementprivacy agreement
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3838
Or… Shift the blame to the users!Or… Shift the blame to the users!
Did Did youyou forget your password?forget your password?
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 3939
Thank YouThank You
AcknowledgementsAcknowledgements–– Professor Eric Brewer and Professor Dave PattersonProfessor Eric Brewer and Professor Dave Patterson
–– Berkeley/Stanford ROC Research GroupBerkeley/Stanford ROC Research Group
–– Anthony Accardi (Anthony Accardi (TellTellmeme) and Jim Lloyd () and Jim Lloyd (eeBBaayy))
–– Professor Michael Jordan and Alice ZhengProfessor Michael Jordan and Alice Zheng
For more info:For more info:–– http://http://www.cs.berkeley.edu/~mikechenwww.cs.berkeley.edu/~mikechen
Mike Chen Mike Chen -- UC BerkeleyUC Berkeley Slide Slide 4040
My ResearchMy ResearchSearch Engine Meets HCISearch Engine Meets HCI–– ChaCha--Cha Search: exploiting hyperlink structure for results Cha Search: exploiting hyperlink structure for results
presentation and relevance ranking [USITS 99]presentation and relevance ranking [USITS 99]
Secure Pervasive Computing InfrastructureSecure Pervasive Computing Infrastructure–– PostPost--PC: a PC: a composablecomposable proxy architecture enabling devices to proxy architecture enabling devices to
access secure services [WMCSA 00]access secure services [WMCSA 00]
Scalable Internet Services Framework Scalable Internet Services Framework –– Ninja Ninja vSpacevSpace: an event: an event--driven platform for building robust driven platform for building robust
Internet services [USENIX 02]Internet services [USENIX 02]–– Active Connection Management [NOMS 02]Active Connection Management [NOMS 02]
PathPath--based approach to system dependabilitybased approach to system dependability–– Pinpoint: applicationPinpoint: application--generic problem determination [DSN 02]generic problem determination [DSN 02]–– PathPath--based failure and evolution management [HotOS 03, based failure and evolution management [HotOS 03,
WIAPP 03, NSDI 04]WIAPP 03, NSDI 04]–– A statistical learning approach to failure diagnosis [ICAC 04]A statistical learning approach to failure diagnosis [ICAC 04]