Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an...

16
Consensus in a Box Inexpensive Coordination in Hardware Zsolt István, David Sidler, Gustavo Alonso, Marko Vukolic * Systems Group, Department of Computer Science, ETH Zurich * IBM Research, Zurich 3/18/2016 1

Transcript of Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an...

Page 1: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Consensus in a BoxInexpensive Coordination in Hardware

Zsolt István, David Sidler, Gustavo Alonso, Marko Vukolic*

Systems Group, Department of Computer Science, ETH Zurich *IBM Research, Zurich 3/18/2016 1

Page 2: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Consensus is an essential function in datacenters

How can consensus be made inexpensive?

3/18/2016 2

Motivation: Cost of consensus

1000

10000

100000

1000000

10000000

1 10 100 1000

Th

rou

gh

pu

t (c

on

sen

su

s

rou

nd

s/s

)

Consensus latency (us)

General

purpose

systems

This talk:

3μs & 4MRPS

Page 3: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

3/18/2016

3

Related work

Speeding up consensus is an important problem

Related work in networking, systems, HPC, etc.

Specialized hardware can remove traditional limitations

[1] Zhang et al. Smartswitch: Blurring the line between network infrastructure

& cloud applications. In HotCloud’14

[2] Mai et al. NetAgg: Using Middleboxes for Application-Specific On-path

Aggregation in Data Centres. In CoNEXT’14

[3] Dang et al. NetPaxos: Consensus at Network Speed. In SOSR’15

[4] Poke et al. DARE: High-Performance State Machine Replication on

RDMA Networks. In HPDC’15.

Page 4: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Why? Consensus is expensive, but desired

What? Atomic broadcast – Zookeeper’s ZAB protocol

How? Specialized processor on FPGA Tight integration with 10Gbps networking + deep pipelining

Evaluation? Drop-in replacement for Memcachedwith Zookeeper’s replication

3/18/2016 4

Consensus in a Box

Page 5: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 5

Zookeeper’s Atomic Broadcast

Ordered, reliable channels

Write

Propose

Propose

Ack.

Ack.

Commit

Commit

Follower

Leader

Follower

Page 6: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

ZookeeperTCP

Atomic

BroadcastReplicated

Data Store

Writes

ReadsClients

3/18/2016 6

Zookeeper from 10000ft

Other Zookeeper

instances

Page 7: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Inherent parallelism

Tight integration

Very fast on-chip memory

3/18/2016 7

Specialized processor architecture

DDR Memory Controller

Atomic Broadcast (State

Machine)

10Gbps TCP/IP Networking

FPGA

Main Memory Key-value Store

DDR3

Page 8: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Networking optimizations Low-latency on-chip buffers for RX path

Datacenter and application-specific knowledge

Predictable behavior Fast local caches for common case behavior

Pipelined execution

3/18/2016 8

What makes it go fast…

Networking ConsensusKey-value

store

Latency

Throughput

Page 9: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 9

Deployment and Evaluation

Page 10: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 10

Hardware platform

Xilinx VC709 Evaluation Board

SFP+

SFP+

SFP+

SFP+

DRAM (8GB)

FPGA

Networking Atomic

Broadcast

Replicated

key-value

store

Reads

Writes

SW Clients /

Other nodes

Other nodes

Other nodes

TCP

Direct

Direct

Page 11: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 11

Evaluation setup

X 12

10Gbps Switch

3 FPGA cluster

Clients

• Drop-in replacement for Memcached with Zookeeper’s replication

• Standard tools for benchmarking (libmemcached)• Simulating 100s of clients

Comm. over TCP/IP

Comm. over direct

connections + Leader election

+ Recovery

Page 12: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 12

Latency of KVS writes (consensus)

Consensus

15-35μs ~10μs

Memaslap

(ixgbe)

TCP / 10Gbps Ethernet

~3μs

Direct connections

Page 13: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

1000

10000

100000

1000000

10000000

1 10 100 1000

Th

rou

gp

ut (c

on

sen

su

s

rou

nd

s/s

)

Consensus latency (us)

FPGA (Direct)

FPGA (TCP)

DARE* (Infiniband)

Libpaxos (TCP)

Etcd (TCP)

Zookeeper (TCP)

Specialized

solutions

3/18/2016 13

The benefit of specialization…

General

purpose

solutions

[1] Dragojevic et al. FaRM: Fast Remote Memory. In NSDI’14.[2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15.

*=We extrapolated from the 5 node setup for a 3 node setup and removed estimated client overhead.

10-100x

Page 14: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Software*

3/18/2016 14

Predictable hardware performance

1000

10000

100000

1000000

10000000

1 10 100 1000

Th

rou

gp

ut (c

on

sen

su

s

rou

nd

s/s

)

Consensus latency (us)

FPGA (Direct) + 99th

FPGA (TCP) + 99th

Page 15: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box 3/18/2016 15

Throughput overviewM

illio

n re

plic

ate

dK

VS

wri

tes/s

Page 16: Consensus in a BoxConsensus in a Box | | 3/18/2016 3 Related work Speeding up consensus is an important problem Related work in networking, systems, HPC, etc. Specialized hardware

||Consensus in a Box

Consensus is expensive but often necessary Solution: specialization and tight integration with networking

We built high-throughput low-latency consensus in hardware

Specialized hardware opens

up new opportunities for

smarter networks

3/18/2016 16

Conclusion

{zsolt.istvan},{david.sidler}@inf.ethz.ch