Tempest: An Architecture for Scalable Time-Critical Services
Mahesh BalakrishnanAmar Phanishayee
Tudor MarianProfessor Ken Birman
Clusters of commodity computers used in mission-critical settings
(commercial and military) Advantages
cost-effectiveness, incremental scalability and high availability
Issues failures, arbitrary load, network losses
affect real-time guarantees
Tempest: Goal Provide programmers replicated data
storage primitives
Very fast average performance and good worst-case timing guarantees
Easy Deployment, Monitoring and Management of time-critical scalable services in a clustered environment
Tempest: Approach clone services for scalability, fault tolerance automate replica placement (service
colocation) fine-grained data caching response time monitoring to detect service
slowdown redundant querying for faster response UI to drag and drop services onto a cluster
Accomplishments Ricochet: Low-Latency Multicast for
Scalable Time-Critical Services Submitted to NSDI 2006 (Oct 2005)
Scalable Services Architecture (SSA) Submitted to ICDCS (Nov 2005)
Ricochet vs SRMSRM Recovery
0.0E+00
2.0E+06
4.0E+06
6.0E+06
8.0E+06
1.0E+07
1.2E+07
1.4E+07
1 2 4 8 16 32 64 128
Groups
Mic
rose
cond
s
Average Recovery Delay Average Discovery Delay
Ricochet Recovery
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
2 4 8 16 32 64 128 256 512 1024
GroupsM
icro
seco
nds
Average Recovery Delay
• SRM’s discovery delay is the lower bound on recovery
• SRM’s recovery delay scales poorly with # of Groups (delay in seconds!)
• Ricochet scales in # of Groups (~14ms in 1 group to 24 ms in 1024 groups)
64 Groups
9 seconds
64 Groups16ms !
Ricochet vs SRM in 64 groups
Histogram of SRM Recoveries (64 Groups)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0.0E
+00
2.3E
+06
4.7E
+06
7.0E
+06
9.3E
+06
1.2E
+07
1.4E
+07
1.6E
+07
1.9E
+07
2.1E
+07
2.3E
+07
2.6E
+07
2.8E
+07
3.0E
+07
3.3E
+07
3.5E
+07
3.7E
+07
Microseconds
Perc
enta
ge
Histogram of Ricochet Recoveries (64 Groups)
0.00
5.00
10.00
15.00
20.00
25.00
30.00
2.8E
+03
2.0E
+04
3.7E
+04
5.4E
+04
7.1E
+04
8.8E
+04
1.0E
+05
1.2E
+05
1.4E
+05
1.6E
+05
1.7E
+05
1.9E
+05
2.1E
+05
2.2E
+05
2.4E
+05
2.6E
+05
2.7E
+05
MicrosecondsPe
rcen
tage
SRM Recovery centered around 9 seconds… Ricochet around 15 milliseconds.
1-2 orders of magnitude!Improvement increases with number of groups
Inconsistency WindowsHistogram of Ricochet Recoveries (64 Groups)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
1.3E
+03
5.1E
+03
9.0E
+03
1.3E
+04
1.7E
+04
2.1E
+04
2.4E
+04
2.8E
+04
3.2E
+04
3.6E
+04
4.0E
+04
4.4E
+04
4.7E
+04
5.1E
+04
5.5E
+04
5.9E
+04
6.3E
+04
Microseconds
Perc
enta
ge
Ricochet Replication:
Updates are reflected at all
replicas within…
65% within 1.25 ms90% within 18 ms99% within 77 ms100% within 125 ms
Top Related