Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an...
Transcript of Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an...
||Consensus in a Box
Consensus in a BoxInexpensive Coordination in Hardware
Zsolt István, David Sidler, Gustavo Alonso, Marko Vukolic*
Systems Group, Department of Computer Science, ETH Zurich *IBM Research, Zurich 3/18/2016 1
||Consensus in a Box
Consensus is an essential function in datacenters
How can consensus be made inexpensive?
3/18/2016 2
Motivation: Cost of consensus
1000
10000
100000
1000000
10000000
1 10 100 1000
Th
rou
gh
pu
t (c
on
sen
su
s
rou
nd
s/s
)
Consensus latency (us)
General
purpose
systems
This talk:
3μs & 4MRPS
||Consensus in a Box
3/18/2016
3
Related work
Speeding up consensus is an important problem
Related work in networking, systems, HPC, etc.
Specialized hardware can remove traditional limitations
[1] Zhang et al. Smartswitch: Blurring the line between network infrastructure
& cloud applications. In HotCloud’14
[2] Mai et al. NetAgg: Using Middleboxes for Application-Specific On-path
Aggregation in Data Centres. In CoNEXT’14
[3] Dang et al. NetPaxos: Consensus at Network Speed. In SOSR’15
[4] Poke et al. DARE: High-Performance State Machine Replication on
RDMA Networks. In HPDC’15.
||Consensus in a Box
Why? Consensus is expensive, but desired
What? Atomic broadcast – Zookeeper’s ZAB protocol
How? Specialized processor on FPGA Tight integration with 10Gbps networking + deep pipelining
Evaluation? Drop-in replacement for Memcachedwith Zookeeper’s replication
3/18/2016 4
Consensus in a Box
||Consensus in a Box 3/18/2016 5
Zookeeper’s Atomic Broadcast
Ordered, reliable channels
Write
Propose
Propose
Ack.
Ack.
Commit
Commit
Follower
Leader
Follower
||Consensus in a Box
ZookeeperTCP
Atomic
BroadcastReplicated
Data Store
Writes
ReadsClients
3/18/2016 6
Zookeeper from 10000ft
Other Zookeeper
instances
||Consensus in a Box
Inherent parallelism
Tight integration
Very fast on-chip memory
3/18/2016 7
Specialized processor architecture
DDR Memory Controller
Atomic Broadcast (State
Machine)
10Gbps TCP/IP Networking
FPGA
Main Memory Key-value Store
DDR3
||Consensus in a Box
Networking optimizations Low-latency on-chip buffers for RX path
Datacenter and application-specific knowledge
Predictable behavior Fast local caches for common case behavior
Pipelined execution
3/18/2016 8
What makes it go fast…
Networking ConsensusKey-value
store
Latency
Throughput
||Consensus in a Box 3/18/2016 9
Deployment and Evaluation
||Consensus in a Box 3/18/2016 10
Hardware platform
Xilinx VC709 Evaluation Board
SFP+
SFP+
SFP+
SFP+
DRAM (8GB)
FPGA
Networking Atomic
Broadcast
Replicated
key-value
store
Reads
Writes
SW Clients /
Other nodes
Other nodes
Other nodes
TCP
Direct
Direct
||Consensus in a Box 3/18/2016 11
Evaluation setup
X 12
10Gbps Switch
3 FPGA cluster
Clients
• Drop-in replacement for Memcached with Zookeeper’s replication
• Standard tools for benchmarking (libmemcached)• Simulating 100s of clients
Comm. over TCP/IP
Comm. over direct
connections + Leader election
+ Recovery
||Consensus in a Box 3/18/2016 12
Latency of KVS writes (consensus)
Consensus
15-35μs ~10μs
Memaslap
(ixgbe)
TCP / 10Gbps Ethernet
~3μs
Direct connections
||Consensus in a Box
1000
10000
100000
1000000
10000000
1 10 100 1000
Th
rou
gp
ut (c
on
sen
su
s
rou
nd
s/s
)
Consensus latency (us)
FPGA (Direct)
FPGA (TCP)
DARE* (Infiniband)
Libpaxos (TCP)
Etcd (TCP)
Zookeeper (TCP)
Specialized
solutions
3/18/2016 13
The benefit of specialization…
General
purpose
solutions
[1] Dragojevic et al. FaRM: Fast Remote Memory. In NSDI’14.[2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15.
*=We extrapolated from the 5 node setup for a 3 node setup and removed estimated client overhead.
10-100x
||Consensus in a Box
Software*
3/18/2016 14
Predictable hardware performance
1000
10000
100000
1000000
10000000
1 10 100 1000
Th
rou
gp
ut (c
on
sen
su
s
rou
nd
s/s
)
Consensus latency (us)
FPGA (Direct) + 99th
FPGA (TCP) + 99th
||Consensus in a Box 3/18/2016 15
Throughput overviewM
illio
n re
plic
ate
dK
VS
wri
tes/s
||Consensus in a Box
Consensus is expensive but often necessary Solution: specialization and tight integration with networking
We built high-throughput low-latency consensus in hardware
Specialized hardware opens
up new opportunities for
smarter networks
3/18/2016 16
Conclusion
{zsolt.istvan},{david.sidler}@inf.ethz.ch