The Network is The Computer: Running Distributed Services on...

81
The Network is The Computer: Running Distributed Services on Programmable Switches Robert Soulé Università della Svizzera italiana and Barefoot Networks 1

Transcript of The Network is The Computer: Running Distributed Services on...

Page 1: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

The Network is The Computer: Running Distributed Services on Programmable Switches

Robert SouléUniversità della Svizzera italiana and Barefoot Networks

�1

Page 2: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Conventional Wisdom

The network is “just plumbing”

Teach systems grad students the end-to-end principle[Saltzer, Reed, and Clark, 1981]

Programmable networks are too expensive, too slow, or consume too much power

�2

Page 3: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

This Has ChangedA new breed of switch is now available:

They are programmable

No power or cost penalties

They are just as fast as fixed-function devices (6.5 Tbps!)*

�3

* Yes, I work at Barefoot Networks.

Page 4: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

If This Trend Continues…

�4

GPUs DSPs TPUs ASICsCPUs

Java OpenCL MatLab TensorFlow ?

Programmable ASICs will replace fixed-function chips in data centers

Page 5: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

What Functionality Belongs in the Network?

�5

Congestion ControlLoad Balancing Firewall

Page 6: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Tremendous Opportunity

�6

Stream ProcessingFault-tolerance Key-Value Store

Run important, widely used distributed services in the network

Page 7: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Tremendous Opportunity

�7

Fault-tolerance

A 10,000x improvement in

throughput

[NetPaxos SOSR ’15, P4xos CCR ’16]

Page 8: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Tremendous Opportunity

�8

2 billion queries / second

Key-Value Storewith 50%

reduction in latency

[NetCache, NSDI ’17]

Page 9: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Tremendous Opportunity

�9

Stream Processing

Process

4 billion events

per second.

[Linear Road, SOSR ’18]

Page 10: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Key Questions

This sounds good on paper, but…

How do we actually program network devices? What are the limitations? What are the abstractions?

What (parts of) applications could or should be in the network? What is the right architecture?

Given that we are asking the network to do so much more work, how can we be sure that it is implemented correctly?

�10

Page 11: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Agenda and Tools

�11

Programmablenetwork

hardware

Distributed applications

Thistalk

Logic and formal methods

Leverage emerging hardware…

… to accelerate distributed services…

… and prove that theimplementations are correct.

Page 12: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Outline of This Talk

Introduction

Programmable Network Hardware

Co-designing Networks and Distributed Systems

Proving Correctness

Outlook

�12

Page 13: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Programmable Network Hardware

�13

Page 14: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

What is A Programmable Network?

�14

Data Plane

Control Plane

“If ip.dst is 10.0.0.1, forward out port 1”

Packets

Rules

Page 15: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

What is A Programmable Network?

�15

Data Plane

Control Plane

Packets

CompilerSourceLanguage

RulesSourceLanguage Compiler

e.g., Merlin [CoNext ’14]

e.g., P4FPGA [SOSR ’17]

Controller

Page 16: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Match Action Table

�16

Match Action

Data plane programming specifies:- fields to read- possible actions - size of table

{

Main abstraction for data plane programming

Page 17: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Match Action Table

�17

Match Action

10.0.0.1 Drop

10.0.0.2 Forward out 1

10.0.0.3 Forward out 2

10.0.0.4 Modify header

Control plane programming specifiesthe rules inthe table

{

Page 18: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Match Action Unit

�18

MatchActionUnit

• SRAM for exact match • TCAM for ternary match

Match Action• Stateless ALU

• Limited instruction set • Arithmetic operations • Bitwise operations

• Stateful ALU • Counters • Meters

• Data Parallelism for performance • Pipelined stages for data dependencies

Massively Parallelized:

Page 19: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Programmable Data Plane

�19

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Ingress Egress

Programmable ASIC Architecture

Page 20: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4 Language Concepts

�20

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Page 21: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4 Language Concepts

�20

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Specify header format and

how to parse

Page 22: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4 Language Concepts

�20

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Specify header format and

how to parse

Define tables that match on

header fields and perform actions (e.g., modify

or drop)

Page 23: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4 Language Concepts

�20

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Specify header format and

how to parse

Define tables that match on

header fields and perform actions (e.g., modify

or drop)

Compose lookup tables

Page 24: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Target Constraints

�21

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Page 25: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Target Constraints

�21

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Fixed-length pipeline

Page 26: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Target Constraints

�21

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Fixed-length pipeline

LimitedMemory

Page 27: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Target Constraints

�21

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

…Parser

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

MatchAction

……

Queuesand

Crossbar

De-Parser

Fixed-length pipeline

LimitedMemory

Data and control

dependencies

Page 28: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Observations

Architecture is designed for speed and efficiency

Performance doesn’t come for free

Limited degree of programmability

Not Turing complete by design

Language syntax and hardware generations may change,but the basic design is fundamental

�22

Page 29: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Co-DesigningNetworks and Distributed Systems

�23

Page 30: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

What Applications Should We Put in the Network?

�24

Monte Carlo Simulation Fundamental Building Blocks

Page 31: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Building Blocks For Distributed Systems

�25

Building Block Description System

ConsensusEssential for building fault-

tolerant, replicated systems

NetPaxos SOSR ’15P4xos, CCR ’16

Caching Maximize utilization ofavailable resources

NetCache, SOSP ’17NetChain, NSDI ’18

Data Processing In-network computationand analytics Linear Road, SOSR ’18

Publish/Subscribe

Semantically meaningful communication In submission

Page 32: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus Protocols

Get a group of replicas to agree on next application state

Consensus protocols are the foundation for fault-tolerant systems

E.g., OpenReplica, Ceph, Chubby

Many distributed systems problems can be reduced to consensus

E.g., Atomic broadcast, atomic commit

�26

Page 33: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Ways to Improve Consensus Performance

�27

ConsensusProtocols

ProgrammableNetworks

Push logic into network

hardware

Enforce particular network behavior

Page 34: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�28

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Page 35: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�29

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Page 36: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�30

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Protocol 1

Protocol 2

Protocol 3

Protocol 4

Page 37: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�31

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Protocol 2

Protocol 3

Protocol 4

NetPaxos

Page 38: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�32

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Protocol 2

Protocol 3

Protocol 4

NetPaxos

99.9% of the time, assumptions held

Page 39: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�33

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Protocol 2

Protocol 3

Protocol 4

NetPaxos

Promising, but99.9% correct consensus

isn’t practical

Page 40: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�34

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

Best effort

No message loss, FIFO delivery

Forward packets Storage and logic

Fast Paxos

Protocol 4

NetPaxos

Speculative Paxos /

No Paxos

Page 41: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Consensus / Network Design Space

�35

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

NetPaxos

Best effort

No message loss, FIFO delivery

P4xos(this talk)

Fast Paxos

Forward packets Storage and logic

Speculative Paxos /

No Paxos

Page 42: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

PaxosOf the various consensus protocols, we focus on Paxos because:

One of the most widely used

Often considered the “gold standard”

Proven correct

“There are two kinds of consensus protocols: those that are Paxos, and those that are incorrect” — attributed to Butler Lampson

�36

Page 43: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In the Network

Key questions:

What parts of Paxos should be accelerated?

How to map the algorithm to stateful forwarding decisions (i.e., Paxos logic as sequence of match/actions)?

How do we map from complex protocol to low-level abstractions?

What are the right interfaces? How do we deploy?

�37

Page 44: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos in a Nutshell

An execution of Paxos is called an instance. Each instance is associated with an ID, called the instance number.

The protocol has two phases. Each phase may contain multiple rounds. There is a round number to identify the round.

Phase 1: “What instance number are we talking about?”

Phase 2: “What is the value for the instance number?”

Observation: Phase 1 does not depend on a particular value. We should accelerate Phase 2.

�38

Page 45: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�39

n m

PaxosPackets

Run Phase 1in a batch, declare the instance numbers to use

{• type • instance • round • vround • value

Union of all Paxos messages{

Page 46: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�40

n m

PaxosPackets

Page 47: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�40

n m

PaxosPackets

When batch fills up, we need to checkpoint

Page 48: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�40

n m

PaxosPackets

When batch fills up, we need to checkpoint

Tradeoff with performance

and memory

Page 49: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�40

n m

PaxosPackets

When batch fills up, we need to checkpoint

Tradeoff with performance

and memory

Access dependencies make ithard to implement ring

buffer

Page 50: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos In The Switch

�40

n m

PaxosPackets

When batch fills up, we need to checkpoint

Tradeoff with performance

and memory

Access dependencies make ithard to implement ring

buffer

Need to use “hacks” to trick the compiler

Page 51: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Phase 2 Roles and Communication

�41

Coordinator

Acceptor 1

. . .

Acceptor 2

Acceptor 3

Learners

. . .

(up to n)

Proposer

Phase 2B

Phase 2B

Phase 2B

Phase 2A

ProposalProposers propose a value via the Coordinator (Phase 2).

Acceptors accept value, promise not to accept any more proposals for instance (Phase 2).

Learners require a quorum of messages from Acceptors, “deliver” a value (Phase 2).

Page 52: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos Bottlenecks

�42

Observation: accelerate agreement: Coordinator and Acceptors

25%

50%

75%

100%

4 8 12 16 20Number of Learners

CPU

Util

izat

ion

ProposerCoordinatorAcceptorLearner

25%

50%

75%

100%

ProposerCoordinator

AcceptorLearner

CPU

util

izat

ion

Page 53: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos as Prose

�43

[Lamport, Distributed Computing ’06]

Page 54: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos as Match-Action

�44

1 void submit(struct paxos_ctx* ctx,

2 char* value, int size);

Figure 3: P4xos proposer API.

Algorithm 1 Leader logic.1: Initialize State:

2: instance[1] := {0}3: upon receiving pkt(msgtype, inst, rnd, vrnd, swid, value)4: match pkt.msgtype:5: case REQUEST:6: pkt.msgtype PHASE2A7: pkt.rnd 08: pkt.inst instance[0]9: instance[0] := instance[0] + 1

10: multicast pkt11: default :12: drop pkt

Proposer. A P4xos proposer mediates client requests,and encapsulates the request in a Paxos header. Ideally,this logic could be implemented by an operating systemkernel network stack, allowing it to add Paxos headers inthe same way that transport protocol headers are addedtoday. As a proof-of-concept, we have implemented theproposer as a user-space library that exposes a small APIto client applications.

The P4xos proposer library is a drop-in replacementfor existing software libraries. The API consists of a sin-gle submit function, shown in Figure 3. The submitfunction is called when the application using Paxos tosend a value. The application simply passes a charac-ter buffer containing the value, and the buffer size. Thepaxos ctx struct maintains Paxos-related state acrossinvocations (e.g., socket file descriptors).Leader. A leader brokers requests on behalf of pro-posers. The leader ensures that only one process submitsa message to the protocol for a particular instance (thusensuring that the protocol terminates), and imposes anordering of messages. When there is a single leader, amonotonically increasing sequence number can be usedto order the messages. This sequence number is writtento the inst field of the header.

Algorithm 1 shows the pseudocode for the primaryleader implementation. The leader receives REQUESTmessages from the proposer. REQUEST messages onlycontain a value. The leader must perform the following:write the current instance number and an initial roundnumber into the message header; increment the instancenumber for the next invocation; store the value of the newinstance number; and broadcast the packet to acceptors.

P4xos uses a well-known Paxos optimization [14],

where each instance is reserved for the primary leaderat initialization (i.e., round number zero). Thus, the pri-mary leader does not need to execute Phase 1 before sub-mitting a value (in a REQUEST message) to the accep-tors. Since this optimization only works for one leader,the backup leader must reserve an instance before sub-mitting a value to the acceptors. To reserve an instance,the backup leader must send a unique round number ina PHASE1A message to the acceptors. For brevity, weomit the backup leader algorithm since it essentially fol-lows the Paxos protocol.

Acceptor. Acceptors are responsible for choosing asingle value for a particular instance. For each instanceof consensus, each individual acceptor must “vote” for avalue. Acceptors must maintain and access the historyof proposals for which they have voted. This history en-sures that acceptors never vote for different values fora particular instance, and allows the protocol to toleratelost or duplicate messages.

Algorithm 2 shows logic for an acceptor. Acceptorscan receive either PHASE1A or PHASE2A messages.Phase 1A messages are used during initialization, andPhase 2A messages trigger a vote. The logic for han-dling both messages, when expressed as stateful rout-ing decisions, involves: (i) reading persistent state, (ii)modifying packet header fields, (iii) updating the persis-tent state, and (iv) forwarding the modified packets. Thelogic differs in which header fields are involved.

Learner. Learners are responsible for replicating avalue for a given consensus instance. Learners receivevotes from the acceptors, and “deliver” a value if a ma-jority of votes are the same (i.e., there is a quorum).

Algorithm 3 shows the pseudocode for the learnerlogic. Learners should only receive PHASE2B mes-sages. When a message arrives, each learner extractsthe instance number, switch id, and value. The learnermaintains a mapping from a pair of instance number andswitch id to a value. Each time a new value arrives, thelearner checks for a majority-quorum of acceptor votes.A majority is equal to f + 1 where f is the number offaulty acceptors that can be tolerated.

The learner provides the interface between the net-work consensus and the replicated application. The be-havior is split between the network, which listens fora quorum of messages, and a library, which is linkedto the application. To compute a quorum, the learnercounts the number of PHASE2B messages it receivesfrom different acceptors in a round. If there is no quo-rum of PHASE2B messages in an instance (e.g., becausethe primary leader fails), the learner may need to re-count PHASE2B messages in a quorum (e.g., after the

5

Coordinator Algorithm

Page 55: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Paxos as Match-Action

�45

Coordinator

ProposerProposerProposer

AcceptorAcceptorAcceptor

LearnerLearner

Encode value in a packet header.

If match, add sequence number, and forward

If match, compare round field in header, update state, and forward

De-encode and return value to the application.

Application

Network

NetworkApplication

Page 56: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Application Interface

�46

API Function Names Description

submit Application to network:Send a value

deliver Network to application:Deliver a value

recover Application to network:Discover a prior value

C wrapper provides a drop-in replacement for existing Paxos libraries!

Page 57: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4xos Deployment

�47

Proposer ToR Aggregate Spine/Coordinator

Aggregate/Acceptor ToR

Learner/Application

vs.Proposer Coordinator

Acceptor Learner/Application

Page 58: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Experiments

Focus on two questions:

What is the absolute performance?

What is the end-to-end performance?

�48

Page 59: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Absolute Performance

Measured each role separately on 64x40G ToR switch (Barefoot Tofino) and IXIA XGS12-H as packet sender

Throughput is over 2.5 billion consensus messages / second.This is a 10,000x improvement over software.

Data plane latency is less than 0.1 μs (measured inside the chip)

�49

Page 60: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

End-to-End Performance

�50

Application delivers to RocksDB with read andwrite commands

4.3x throughput improvement over software implementation

73% reduction in latency 0

200

400

600

0 50 100 150 200 250Throughput (1000 x Msgs / S)

99th

Per

cent

ile L

aten

cy (µ

s) LibpaxosNetwork Paxos

Page 61: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Accelerating Execution (Work-in-Progress)

�51

n m

PaxosPackets Run

multiplePaxi in parallel

{n mn mn m• type • instance • round • vround • value • partition

Partition application state{

Page 62: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Accelerating Execution (Work-in-Progress)

�52

Not yet done: handling“cross partition” requests

Must add barriers to synchronize learners

Fully partitioned workloadreaches 500K msgs/sec

RocksDB Throughput vs. Checkpoint Interval

Page 63: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Practical Application: Storage Class Memory

Fast network interconnect allows users to scale storage and compute separately (i.e., disaggregated storage)

Several companies, including Western Digital, have developed new types of non-volatile memory

Persistent, with latency comparable to DRAM

But, wears out over time…

Use in-network consensus to keep replicas consistent

�53

Page 64: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

To Recap

�54

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

NetPaxos

Best effort

No message loss, FIFO delivery

P4xos

Fast Paxos

Speculative Paxos

No Paxos

Forward packets Storage and logic

“It’s just Paxos!”

Page 65: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

To Recap

�55

ProgrammabilityWeak Strong

TraditionalPaxos

Assu

mpt

ions

NetPaxos

Best effort

No message loss, FIFO delivery

P4xos

Fast Paxos

Speculative Paxos

No Paxos

Forward packets Storage and logic

“It’s just Paxos!”

But, how can we be sure the implementation

is correct?

Page 66: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Proving Correctness (or How Do We Know OurImplementation is Correct?)

�56

Page 67: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

An Old Story You’ve Heard Before

We checked the Paxos algorithm with SPIN model checker. No problems!

We wrote the Paxos code.

We ran in the network, but didn’t get consensus.

�57

There is a bug in our implementation.

Page 68: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Verification is So Tempting…

To the extent networks are verified, the focus is on forwarding (e.g., no path loops)

If the network is going to take on more work, how can we be sure that is correct?

P4 is so tempting to verify: no loops, no pointers, etc.

�58

Page 69: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Verification Problem

�59

Data Plane

Control Plane

“If ip.dst is 10.0.0.1, forward out port 1”

P4

RulesThe specific behavior of a P4

program depends on the control plane

We only have half the program!

Page 70: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Hoare Logic

�60

c { Q }{ P }

If P holds and c executes, then Q holds.

Axioms capture relational properties: what is true before and after a command executes.

Standard approach to verification

Use automated theorem-prover to check if there is an initial state that leads to a violation

Generate a counter example via weakest pre-condition

Page 71: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

{ P + “control plane assumptions”}

P4 + Hoare Logic

�61

c { Q }If P plus some assumed knowledge holds

and c executes, then Q holds.

Axioms capture relational properties: what is true before and after a command executes.

Allow programmers to express symbolic constraints on the control plane in terms of predicates on data plane state

Combined, the control plane and data plane behave as expected

Page 72: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Verification Challenges

�62

Challenge Solution

P4 does not have a formal semantics

We had to define one via translation

What should the annotations look like?

Leveraged our domain-specificknowledge to define language

How do we make thesolver scale?

Standing on the shoulders of giants, e.g., passivization

[Flanagan and Saxe, POPL 2001]

Page 73: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4v : Basic Approach

�63

Translate P4 to logical formulas Define a

program logicfor P4

Annotate tocheck for properties

Reduce to SMT problem

action forward(p) { … }table T { reads { tcp.dstPort; eth.type;} actions { drop; forward; } }

Desired Property:

"If the tcp.dstPort is 22, then drop the packet.”

Page 74: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

P4v : Basic Approach

�64

Translate P4 to logical formulas Define a

program logicfor P4

Annotate tocheck for properties

Reduce to SMT problem

action forward(p) { … }table T { reads { tcp.dstPort; eth.type;} actions { drop; forward; } }

Desired Property:

"If the round number ofarriving packet is greaterthan the stored round number, then drop the packet.”

Page 75: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

CCR Paper Bug

�65

@pragma assume valid(paxos) implies local.round <= paxos.rnd apply(rount_table) { if (local.round <= paxos.rnd) { apply(acceptor_table) } }@pragma assert valid(paxos) implies local.set_drop == 0

Action failed to set the “drop flag” when the arriving round number is greater than the stored round number.

Page 76: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Evaluation

Ran our verifier on a diverse collection of 13 P4 programs

Conventional forwarding: Router, NAT, Switch

Source routing: ToR, VPC

In-network processing: Paxos, LinearRoad

Most finished in 10s of ms; switch.p4 finished in 15 seconds.

�66

Only system to verify switch.p4

Page 77: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Outlook

�67

Page 78: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

Summarizing

System artifact that can achieve orders-of-magnitude improvements in performance

Identified techniques for programming within fundamental hardware constraints

Novel re-interpretation of the Paxos algorithm

Hopefully add clarity through a different perspective

Mechanized proof of correctness of the implementation

�68

Page 79: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

A Few Lessons Learned

What are good candidate applications for network acceleration?

“Squint a little bit, and they look like routing”

Applications with transient state, rather than persistent

Services that are I/O bound

Network acceleration helps latency, but throughput is the big win

�69

Page 80: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

What’s Next?

Very exciting time for networking and systems

Network programmability provides an amazing opportunity to revisit the entire stack

Redesign systems using an integrated approach, combiningdatabases, networking, distributed systems, and PL

�70

Page 81: The Network is The Computer: Running Distributed Services on …conferences.sigcomm.org/sigcomm/2018/files/slides/net... · 2018-09-02 · Running Distributed Services on Programmable

�71

http://www.inf.usi.ch/faculty/soule/