Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...

75
Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard P. Martin Thu D. Nguyen Dept. of Computer Science, Rutgers University USITS’03
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...

Page 1: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Internet Services

Kiran Nagaraja, Xiaoyan LiRicardo Bianchini, Richard P. MartinThu D. Nguyen

Dept. of Computer Science,Rutgers University

USITS’03

Page 2: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project2

Motivation

Accumulating evidence that today’s services only achieve ~99-99.9% availability

[Gray 2001], [Patterson et al. 2002]

Compare to public telephone system: close to 99.999%

Unavailability is costly (downtime costs per hour)Brokerage operations $6,450,000

Credit card authorization $2,600,000

Ebay (1 outage 22 hours) $225,000

Amazon.com $180,000Sources: InternetWeek 4/3/2000

Page 3: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project3

Motivation

Complexity of Internet servicesLarge design space

Many software and hardware components numerous fault points and types

Currently used ad-hoc techniques (e.g., unplugging cables) not sufficient

Need methodology to systematically quantify availability as well as performanceAvailability may conflict with performance

performability metric combining performance and availability

Page 4: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project4

Contributions

Methodology for quantitative evaluation of cluster-based services

Availability and Performability

MendosusCluster-based fault injection and network emulation

Support injection of network faults such as switch failure

Capable of injecting multiple types of faults appropriate to cluster-based environments

Case study of a high-performance cluster-based web serverEffect of faults on overall behavior

Tradeoff of performance against availability

Effects of design and environmental decisions

Page 5: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project5

Methodology: Overview

Phase I – Fault injection experiments

Define set of fault types

Inject each fault (and subsequent recovery) into “live” system

Measure system behavior under each fault type

Case study: throughput under constant load

Phase II - Use analytical model to quantify overall service performability

Inputs:

Measured throughputs from phase I

MTTF and MTTR for each fault type

Environmental parameters: operator response time and server reset time

Outputs: average availability and average throughput

Page 6: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project6

Assumed Platform: Clusters

Page 7: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project7

Phase II: Per-Fault Seven-Stage Model

Page 8: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project8

Phase II: Computing Average Throughput and Average Availability

Assume: Fault arrive independent and do not overlap

Page 9: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project9

Performability Metric

T – Throughput under normal execution

AI - Availability of “Ideal” system (e.g., 0.99999)

A – Average Availability

Normal performance

Penalty component

U

UT

A

ATlityPerformabi II

)log(

)log(

Page 10: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project10

Current Limitations

Does not quantify effect of correlated faultsInsufficient data

Sensitivity analysis in the future?

Explosion of fault-injection experiments?

Does not consider session and data-integrity faultsRestricts the class of cluster-based servers

Only consider averagesDoes not capture potential importance of variance in throughput

Does not capture resiliency to sudden changes in load

Page 11: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project11

Case Study: PRESS Web Server

Cluster-based web serverNodes cooperate to globally manage memory to cache contentRequests distributed based on locality and load balancing

Several versions developed over time for increasing performanceVIA-PRESS: Cooperative caching using VIA

VIA connection break for fault detection

Dynamic reconfiguration to tolerate node and application crashes

ReTCP-PRESS: Cooperative caching using TCPHeart-beats for fault detection

Dynamic reconfiguration to tolerate node and application crashes

TCP-PRESS: TCP timeouts for fault-detection; no dynamic reconfiguration

I-PRESS: Independent servers

Question: did increased performance come at a cost in availability?

Page 12: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project12

Phase I: Single-Fault Experiments

Setup: 4-PC cluster running at 90% utilization

800Mhz, 2 SCSI disks, 1 Gbps network

4 client nodes make HTTP requests

Discussion of scaling to larger clusters in paper

Fault Set

Link down

Switch down

SCSI timeout

Node crash

Application crash

All faults are modeled as fail-stop

Page 13: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project13

Single Faults – Link Down

OperatorReset

Page 14: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project14

Phase II – Model Parameters

Average operator response time: 5 minutes

Average restart time: 5 minutes

Fault MTTF MTTR

Link down 6 months 3 minutes

Switch down 1 year 1 hour

SCSI timeout 1 year 1 hour

Node crash 2 weeks 3 minutes

Application crash 1 month 3 minutes

Sources: [Iyer99, Talagala99, Heath02]

Page 15: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project15

Performance

Throughput

0

1000

2000

3000

4000

5000

6000

7000

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

Req

ues

ts/s

ec

+21%

Page 16: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project16

Unavailability

Unavailability by Component

0

0.0005

0.001

0.0015

0.0020.0025

0.003

0.0035

0.004

0.0045

0.005

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

% U

nav

aila

bil

ity

application crash

node crash

scsi timeout

internal switch

internal link

+58%

Page 17: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project17

Performability

0

10

20

30

40

50

60

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

Per

form

abili

ty

Page 18: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project18

Performability with More Extensive Fault Model + FME

0

20

40

60

80

100

120

140

I-PRESS TCP-PRESS

PRESS Versions

Per

form

abili

ty

Page 19: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project19

Design Tradeoffs

Performability Tradeoffs

0

10

20

30

40

50

60

70

80

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Versions

Per

form

abili

ty

4hour Operator

Normal

RAID

Page 20: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project20

Discussion

Fault injection uncovered bugs

Modeling allowed quantification and analysis of different design decisions and parameters

Single fault can halt a cooperative service

Problem: cooperation disseminates the effect of faults

Solution: Early detection/exclusion and fault-model enforcement

TCP connection termination not good for fault detection; heartbeats not ideal either

Solution: More extensive infrastructure?

Mismatch between fault model and actual faults

Solution: Extend the PRESS fault model?

Page 21: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project21

Related Work

Our work depends on studies of actual fault types and rates

Large body of work based on stochastic analysisOur model much simpler

Easy application vs. more limited domain?

Some similar methodologies and studies of fault-tolerant systems

Concentrated on fault-tolerance of redundant platform

Page 22: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project22

Summary

Proposed a methodology for quantitative evaluation of cluster-based services

Quantify both performance and availability

Fault-injection infrastructure criticalUsed Mendosus

Will be available sometime soon

Case study of PRESSQuantified performability of several versions

Studied performance vs availability tradeoff

Studied effect of operator coverage and RAID

Page 23: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project23

Thank you! Questions?

http://vivo.cs.rutgers.eduImpact of communication architecture [HPCA03]

Detailed study of TCP and VIA fault models

SW faults: application bugs, memory exhaustion

Fault Model Enforcement (FME) [EASY02]

Techniques for improving availability [Rutgers DCS-TR-517]

Extensive monitoring + FME improve availability 10x

Compiler-directed program-fault coverage [DSN03]

Support for testing of fault-detection and recovery code

Page 24: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project24

Related Work

Empirical measurements of fault ratesDifficult to extrapolate beyond observed behavior

Benchmarking methodologiesSingle-node robustness and availability

Difficult to extrapolate to overall availability and performability

Analytical modeling Stochastic models of availability and performability

Difficult to construct (& solve) models, esp. w/o fault injection

Do not consider penalty for being away from ideal

Availability and performability of cluster-based serversNo work out there. Closest: availability of single-node Apache

Page 25: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project25

Future Work

Study more complex systems3 tiers: front-end, application, & back-end servers

General class of servers (e.g., Web store)

Possibly more complex dependencies

Validate methodology

Extend our infrastructure/methodologyData-integrity faults, session-loss faults, etc

Metrics to capture user satisfaction (e.g., response time)

Page 26: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project26

Future Work

Eliminate limitations of our modelingAccount for concurrent and correlated failures

Improving availability and manageabilityMinimize “on-line” operator intervention

Design services for automatic recovery

Validate operator actions when they are necessary

Explore the full benefits of FME

Arbitrary software failures fail-stop

Recovery procedures are complex, untested (buggy)

Page 27: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project27

Average Availability: Details

AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)

MTTFf MTTFf

AA = AT/Tn

F – faults, S - stages

Dfs – Duration of stage s of fault f

Tfs – Throughput during stage s of fault f

Tn - Throughput under normal execution

Page 28: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project28

Average Availability: Details

AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)

MTTFf MTTFf

AA = AT/Tn

F – faults, S - stages

Dfs – Duration of stage s of fault f

Tfs – Throughput during stage s of fault f

Tn - Throughput under normal execution

Page 29: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project29

Average Availability: Details

AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)

MTTFf MTTFf

AA = AT/Tn

F – faults, S - stages

Dfs – Duration of stage s of fault f

Tfs – Throughput during stage s of fault f

Tn - Throughput under normal execution

Page 30: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project30

Average Availability: Details

AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)

MTTFf MTTFf

AA = AT/Tn

F – faults, S - stages

Dfs – Duration of stage s of fault f

Tfs – Throughput during stage s of fault f

Tn - Throughput under normal execution

Page 31: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project31

Our Study

Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults

TCP vs. VIAKernel-level comm. vs. user-level

Mature vs. new technology

Differ in fault-model

Quantify performability (performance + availability)

Study systems under various fault scenarios

Sensitivity to fault rates and fault classes

Case study: High performance cluster-based Web server

Understand relation between high performance and high availability design choices

Page 32: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project32

PRESS Versions Comparison

PRESS Versions Description Fault Detection

General Protocol Characteristics

TCP-PRESS Base version Connection based

TCP

Assumes: Very few h/w permanent faults, transient faults are common

Robust to transient faults

OK to lose packets

TCP-PRESS-HB Periodic heartbeats

VIA-PRESS-0 Base version Connection based

VIA

Assumes: Faults indicate serious problems

Fail-stop model

Lost packets are bad

VIA-PRESS-3 RDMA for comm. Same

VIA-PRESS-5 RDMA and

Zero-copy (Dynamic pinning)

Same

Page 33: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project33

Performance Comparison

VIA-based communication enables higher performance

Low latency, less software overhead

Performance Comparison

0

1000

2000

3000

4000

5000

6000

7000

8000

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Req

ues

ts/s

ec

Page 34: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project34

Performability Results

Identical fault load for all versionsApplication fault rate 1/month

All versions of VIA do better than TCP

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

TCP TCP-HB

VIA-0 VIA-3 VIA-5

PRESS Versions

Un

avai

lab

ilit

y

0

5

10

15

20

25

30

Perfo

rmab

ility

internal link internal switch node crash node freeze

os-mem-no-locking os-sk-buf-no-mem application crash application hang

app-nullpointer app-offbyNpointer app-offbyNsize Performability

Page 35: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project35

TCP Vs VIA: Program Robustness

VIA application fault rates 1/day, 1/week, 1/monthProgramming complexity

TCP application fault rate 1/month

Program Robustness

0

5

10

15

20

25

30

TCP TCP-HB

VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Cross over point

Page 36: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project36

VIA under Stressful Fault Load

Additional fault load Transient packet drops1/month, system failure 1/month

Application faults -> 2/month

TCP-HB performs slightly better than 2 VIA versions

Performability

0

5

10

15

20

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Page 37: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project37

Observations – Cluster Communication

Match fault-model of network stack to fabricNon-fatal behavior on transient faults

TCP is robust to packet drops

Fail-stop behavior on permanent faults

Protocol level fault-avoidance Preserve message boundaries

Reduce number of copies

Pre-allocate communication resources

Explicit fault reporting by all components in “path”End-to-End necessary, but may not be sufficient

Reduces detection latency

Allows more accurate recovery actions

Page 38: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project38

Related Work

Impact of faults on systemsRobustness and availability studies

[Lee93, Liu99, Murphy95, Brown00, Asami00]

Protocol performance studies Congestion avoidance and control

[Jacobson88, Brakmo94, Hoe96]Back-off based algorithms

Interconnects in cluster environmentSAN context: Packet drops Serious failures

Evidence of faults [Wilkes92, Seitz94, Boden95]

Fault tolerant interconnects: Myrinet

Page 39: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project39

Summary & Conclusion

Studied impact of communication architecture on service performability Surprisingly VIA versions delivered better availability

Comparison under varying fault loadsEvaluated architecture maturity and complexity

Desirable cluster-based protocol characteristicsMessaging, single-copy transfers, pre-allocated resources

Page 40: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project40

Mendosus – Fault Injection

Central Controller

Fast & Reliable SAN

Node A Node B

Events

Kernel

User-Level

SCSI

Process Ctrl

Daemon

MlibApplications E.g. PRESS

emulation

n/w faults

n/w stack

comLib glibc sys_calls

Node/OS

Page 41: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project41

Communication Architecture

All operations by main thread are non-blocking

Separate send, receive and multiple disk helper threads

Filling up of queues could stall the entire node

Page 42: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project42

Modeling Parameters

5 minutes duration for operator intervention(E) and restart(F) stages

Fault MTTF MTTR

Link down 6 months 3 minutes

Switch down 1 year 1 hour

Node crash 2 weeks 3 minutes

Node freeze 2 weeks 3 minutes

Process Crash variable 3 minutes

Process Hang variable 3 minutes

Bad parameters

- off-by-N data pointer

variable 3 minutes

Bad parameters

- off-by-N size

variable 3 minutes

Bad parameters – Null pointer

variable 3minutesSources: [Chillarege95, Sullivan91, Iyer99, Talagala99, Heath02,

Trivedi00]

Application

Faults

Page 43: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project43

Pessimistic Fault Load for VIA

Faults due to immature technologyTransient packet drops1/month, system failure 1/month

Program robustness Application faults -> 2/month

Unavailability by Component

0

0.001

0.002

0.003

0.004

0.005

0.006

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Un

avai

lab

ilit

y

bleeding edge complexitytransient n/w errrorsapp-offbyNsizeapp-offbyNpointerapp-nullpointerapplication hangapplication crashos-sk-buf-no-memos-mem-no-lockingnode freezenode crashinternal switch internal link

Page 44: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project44

Results - Performability

Varying application fault rates: 1/day, 1/month

VIA versions do better due to higher performance

Performability

0

5

10

15

20

25

30

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Page 45: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project45

TCP Vs VIA: Transient Packet Drops

VIA packet drop rates 1/day, 1/week, 1/month

TCP is modeled as no additional losses

Transient Packet Drops

0

5

10

15

20

25

TCP TCP-HB

VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Page 46: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project46

TCP Vs VIA: Immature Technology

VIA complexity failures 1/day, 1/week, 1/monthModeled as total interconnect failure

TCP is modeled as no additional losses

Bleeding Edge Complexity

0

5

10

15

20

25

TCP TCP-HB

VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Page 47: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project47

Scaling Results

Model can be used to scale results

Extrapolation to 8 nodes leads to same results as measurements, for constant memory

Page 48: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project48

Related Work *

Approaches to Fault Tolerance & HAImproving Component Robustness

E.g., ECC in disks & memory, RAID

End-to-End Approaches

CRC, TCP and RPC like protocols – RETRY approach

Replication and Failover

Redundancy: TMR, Tandem, Stratus, primary-secondary

N-version programming

Reactive and Proactive Techniques

Recovery oriented computing[ROC01]

Software rejuvenation, Recursive restartability

Page 49: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Improving Availability

High Availability Techniques

Quantifying The Improvement

Page 50: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project50

High Availability Techniques

Front-end and extra capacity(FE-X)

Widely used

Implementation: Linux Virtual Server + extra nodes

Robust Group Membership(MEM)

ReCOOP could not handle switch, link faults

Implementation: Based on TRM[cristian95]

Service Monitoring(QMON)

Handle “lack of progress” scenarios

Implementation: Queue monitoring

Fault Model Enforcement(FME)

Page 51: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project51

Fault Model Enforcement (FME)

Main idea: map unexpected run-time faults to expected ones in the fault model

E.g., If only node crashes are handled:On fatal disk fault crash node

Designer can focus on smaller fault model

Enforces uniform view of system

PRESS Application hang Process crash (then restart)

Disk failures Node Crash (or take off-line)

Page 52: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project52

Quantifying Availability

Apply techniques to COOPCOOP - “fault-tolerance-free” version

Quantitative analysis after each enhancement Same fault-load, environment parameters

Model extended for front-end failure

Page 53: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project53

Quantitative Results

88% improvement in availability of FME over COOP

High Availability Techniques

00.00040.00080.00120.0016

0.0020.00240.00280.00320.0036

0.0040.00440.00480.00520.0056

0.0060.0064

COOP FE- X MEM Q-MON

MQ FME

PRESS Versions

Un

avai

lab

ilit

y

frontend failure

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link

Page 54: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project54

Parameterizable Modeling

Flexible model allows “What if…?” scenariosVariable fault rates, additional components, operator times

With extended analysis COOP PRESS 0.9997 availability

Modeling Other Approaches

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

FME C-MON

X-SW

RAID

PRESS Versions

Un

avai

lab

ilit

y

Page 55: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project55

Background – Internet Services

Popular services handle large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)

Internet growth Explosive Commercializatio

n WWW On-line services

Page 56: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project56

Background – Internet Services

Internet growth has been explosive since 90sOnset of WWW, browsers, search engines

Commercialization of the Internet: on-line services

Page 57: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project57

Internet Services - Popularity

Servers offer variety of servicesEmail, news, search, shopping

Popular ones service large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)

Page 58: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project58

Internet Services - Infrastructure

Cluster-based solutions are popular[Brewer01]Incrementally scalable, cost-effective

Scalability, Performance have been addressedAvailability evaluation – less attention

Page 59: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project59

Approach

Guide designer through evaluation and improvement of availability

Observe system under failuresMeasure service level availability

Quantify “expected” system behaviorAnalytical modeling to predict behavior under various fault

load, “what if” scenarios

Improve upon problem areas Apply well-defined techniques

Page 60: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project60

Front-end and Extra Capacity

Widely usedMasks service failures from client

Fail-over using Linux Virtual Server(LVS) Front-end distributor

Round Robin, IP Tunneling

Monitor backend nodes using “MON”

Request forwarded to “live” node set

Extra CapacityAdditional “live” nodes to soak up load

Increases number of prospective fault sites!

Page 61: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project61

Robust Group Membership(MEM)

Should handle realistic fault loadsNode, link, switch faults and n/w partition

Heartbeats in ReCOOP were insufficient

Up-to-date list of active nodesAllow dynamic reconfiguration of list

Enables effective resource sharing

Detection of failuresFault model: reachability

UNREACHABLE DOWN

Page 62: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project62

Membership Implementation

Independent service Report membership at advertised memory segment

Based on Three Round Membership[Cristian95] Additions and removals follow 2-phase commit

No single point of failure

Coordinator for adds and leaves chosen dynamically

Heartbeat protocol for failure detectionNodes arranged in logical ring, monitor neighbors

Loss of 3 consecutive heartbeats initiates removal

Page 63: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project63

Queue Monitoring (QMON)

Service level monitoringApplication hangs, problems with disk

Queues as basic building-blocks[Ninja, SEDA]

Run-time Q-properties indicate progress of associated components

E.g., buildup at communication send queue transient/permanent failures at other end

Need to avoid false alarms!

Page 64: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project64

Self-Monitoring Queue

In PRESS: N per-node communication queuesMonitors progress at cooperating nodes

Failure triggersQueue length or threshold unanswered requests

Node removed from “cooperation set”

Page 65: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project65

Interaction – MEM and QMON

Both Membership & QMON, monitor node-level activity

Can result in inconsistencies, e.g., application hang

Page 66: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project66

Fault Model Enforcement (FME)

Enforce a reduced fault model at runtimeAllow service to perform correct recovery action to regain full

functionality

How to enforce a reduced fault model?Two ideas so far

Map an unexpected fault to an expected faultE.g., crash a node if the network link connecting it to the switch fails

Fail outer component if sub-component failsE.g., crash a node if the disk fails

How is it different from fail-stop ?Allows reasoning about failures at a desired abstraction

Page 67: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project67

FME - Future Directions

How extensive should the fault model be?Determines programming complexity/effort

How to prevent FME from reducing availability?Bugs within enforcement?

When to declare a symptom a fault?

FME reduces human interventionAre humans better at deciding?

8-23 % of recovery procedures are botched [Brown 2001]

Page 68: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project68

FME Approach

Define a reduced abstract fault model

Components, faults, symptoms, component behavior during faults

Enforce this fault model at run-time

If an “unexpected” fault occurs, map to one that was planned for in the abstract model

“If the facts don’t fit the theory, change the facts.” - Albert Einstein

Allow designer to concentrate on tolerating a well-defined, yet limited in complexity, set of faults

Page 69: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project69

PRESS with FME

Recovery upon fault model mismatchRestart 0, 1 or all nodes?

FME approach: reboot the appropriate node after a fault and its recovery have occurred

Link down – reboot unreachable node

Switch down – reboot all nodes

Disk failure – reboot node with faulty disk

Node, application crash – do nothing

Page 70: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project70

FME Implementation

FME daemon on each nodeMonitors service progress using exported i/f

HTTP requests for PRESS

Monitors disk using SCSI Generic InterfaceLow level operations and error probing

Application hang Process crashUpon consecutive failure to service requests

Process restarted

Disk failures Node failures Detection: Service no-progress + Disk Error

Node taken offline for maintenance

Page 71: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project71

Modeling Results - Unavailability

Unavailability of INDEP ~ 1/10 of COOP

Heartbeat helps, but availability is lower than INDEP

Unavailability by Component

0

0.001

0.002

0.003

0.004

0.005

0.006

INDEP COOP ReCOOP

PRESS Versions

Un

avai

lab

ilit

y

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link

Page 72: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project72

Results - Performability

COOP has lower performability than INDEP

ReCOOP glimpse of possibilities

Performability

0

5

10

15

20

25

30

35

40

45

INDEP COOP ReCOOP

PRESS Versions

Per

form

abil

ity

Page 73: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project73

Future Work

Explore applicability to more structured systemsMulti-tiered: front-end, application & back-end

E.g., Web store

Extend our infrastructure/methodologyMetrics to capture users’ satisfaction

Fault-model for data-integrity faults

Look at more high-availability componentsCollection for a designer to pick from

Extend and validate FME for complex servers

Page 74: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project74

Future Work

Considering operator in the loopGather information about extent of participation – operations, durations, expertise

Extend fault-model for human induced faults

Page 75: Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet Services Kiran Nagaraja, Xiaoyan Li Ricardo Bianchini, Richard.

Thu D. Nguyen, Rutgers U. The Vivo Project75

Quantitative Results

88% improvement in availability of FME over COOP

Unavailability

00.00040.00080.00120.0016

0.0020.00240.00280.00320.0036

0.0040.00440.00480.00520.0056

0.0060.0064

COOP FE- X MEM Q-MON

MQ FME

PRESS Versions

Un

avai

lab

ilit

y

frontend failure

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link