Tempest: An Architecture for Scalable Time-Critical Services Mahesh Balakrishnan Amar Phanishayee...

Post on 19-Jan-2018

212 views 0 download

description

Tempest: Goal Provide programmers replicated data storage primitives Very fast average performance and good worst-case timing guarantees Easy Deployment, Monitoring and Management of time-critical scalable services in a clustered environment

Transcript of Tempest: An Architecture for Scalable Time-Critical Services Mahesh Balakrishnan Amar Phanishayee...

Tempest: An Architecture for Scalable Time-Critical Services

Mahesh BalakrishnanAmar Phanishayee

Tudor MarianProfessor Ken Birman

Clusters of commodity computers used in mission-critical settings

(commercial and military) Advantages

cost-effectiveness, incremental scalability and high availability

Issues failures, arbitrary load, network losses

affect real-time guarantees

Tempest: Goal Provide programmers replicated data

storage primitives

Very fast average performance and good worst-case timing guarantees

Easy Deployment, Monitoring and Management of time-critical scalable services in a clustered environment

Tempest: Approach clone services for scalability, fault tolerance automate replica placement (service

colocation) fine-grained data caching response time monitoring to detect service

slowdown redundant querying for faster response UI to drag and drop services onto a cluster

Accomplishments Ricochet: Low-Latency Multicast for

Scalable Time-Critical Services Submitted to NSDI 2006 (Oct 2005)

Scalable Services Architecture (SSA) Submitted to ICDCS (Nov 2005)

Ricochet vs SRMSRM Recovery

0.0E+00

2.0E+06

4.0E+06

6.0E+06

8.0E+06

1.0E+07

1.2E+07

1.4E+07

1 2 4 8 16 32 64 128

Groups

Mic

rose

cond

s

Average Recovery Delay Average Discovery Delay

Ricochet Recovery

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

2 4 8 16 32 64 128 256 512 1024

GroupsM

icro

seco

nds

Average Recovery Delay

• SRM’s discovery delay is the lower bound on recovery

• SRM’s recovery delay scales poorly with # of Groups (delay in seconds!)

• Ricochet scales in # of Groups (~14ms in 1 group to 24 ms in 1024 groups)

64 Groups

9 seconds

64 Groups16ms !

Ricochet vs SRM in 64 groups

Histogram of SRM Recoveries (64 Groups)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0.0E

+00

2.3E

+06

4.7E

+06

7.0E

+06

9.3E

+06

1.2E

+07

1.4E

+07

1.6E

+07

1.9E

+07

2.1E

+07

2.3E

+07

2.6E

+07

2.8E

+07

3.0E

+07

3.3E

+07

3.5E

+07

3.7E

+07

Microseconds

Perc

enta

ge

Histogram of Ricochet Recoveries (64 Groups)

0.00

5.00

10.00

15.00

20.00

25.00

30.00

2.8E

+03

2.0E

+04

3.7E

+04

5.4E

+04

7.1E

+04

8.8E

+04

1.0E

+05

1.2E

+05

1.4E

+05

1.6E

+05

1.7E

+05

1.9E

+05

2.1E

+05

2.2E

+05

2.4E

+05

2.6E

+05

2.7E

+05

MicrosecondsPe

rcen

tage

SRM Recovery centered around 9 seconds… Ricochet around 15 milliseconds.

1-2 orders of magnitude!Improvement increases with number of groups

Inconsistency WindowsHistogram of Ricochet Recoveries (64 Groups)

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

1.3E

+03

5.1E

+03

9.0E

+03

1.3E

+04

1.7E

+04

2.1E

+04

2.4E

+04

2.8E

+04

3.2E

+04

3.6E

+04

4.0E

+04

4.4E

+04

4.7E

+04

5.1E

+04

5.5E

+04

5.9E

+04

6.3E

+04

Microseconds

Perc

enta

ge

Ricochet Replication:

Updates are reflected at all

replicas within…

65% within 1.25 ms90% within 18 ms99% within 77 ms100% within 125 ms