SwarmKit in Theory and Practice

Container Orchestration from Theory to Practice

Laura Frank & Stephen DayOpen Source Summit, Los AngelesSeptember 2017

v0

2

Stephen DayDocker, [email protected]@stevvooe

Laura [email protected]@rhein_wein

3

Agenda

- Understanding the SwarmKit object model - Node topology - Distributed consensus and Raft

SwarmKitAn open source framework for building orchestration systemshttps://github.com/docker/swarmkit

SwarmKit Object Model

Laura Frank & Stephen DayContainer Orchestration from Theory to Practice

6

OrchestrationA control system for your cluster

ClusterO

-

Δ StD

D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D

https://en.wikipedia.org/wiki/Control_theory

7

ConvergenceA functional view

D = Desired StateO = OrchestratorC = ClusterSt = State at time t

f(D, Sn-1, C) → Sn | min(S-D)

8

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Process State User Input

9

Data Model Requirements

- Represent difference in cluster state- Maximize observability- Support convergence- Do this while being extensible and reliable

Show me your data structures and I’ll show you your orchestration system

11

Services- Express desired state of the cluster- Abstraction to control a set of containers- Enumerates resources, network availability, placement- Leave the details of runtime to container process- Implement these services by distributing processes across a cluster

Node 1 Node 2 Node 3

12

Declarative$ docker network create -d overlay backend vpd5z57ig445ugtcibr11kiiz

$ docker service create -p 6379:6379 --network backend redis pe5kyzetw2wuynfnt7so1jsyl

$ docker service scale serene_euler=3 serene_euler scaled to 3

$ docker service ls ID NAME REPLICAS IMAGE COMMANDpe5kyzetw2wu serene_euler 3/3 redis

13

ReconciliationSpec → Object

ObjectCurrent State

SpecDesired State

14

Demo Time!

Task Model

Prepare: setup resourcesStart: start the taskWait: wait until task exitsShutdown: stop task, cleanly

Runtime

Orchestrator

16

Task ModelAtomic Scheduling Unit of SwarmKit

ObjectCurrent State

SpecDesired State

Task0 Task1… Taskn Scheduler

Manager

TaskTask

Data Flow

ServiceSpec

TaskSpec

Service

ServiceSpec

TaskSpec

Task

TaskSpec

Worker

Consistency

19

Field Ownership

Only one component of the system can write to a field

Consistency

Worker

Pre-Run

Preparing

Manager

Terminal States

Task StateNew Allocated Assigned

Ready Starting

Running

Complete

Shutdown

Failed

Rejected

Field HandoffTask Status

State Owner

< Assigned Manager

>= Assigned Worker

22

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Process State User Input

23

OrchestrationA control system for your cluster

ClusterO

-

Δ StD

D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D

https://en.wikipedia.org/wiki/Control_theory

Orchestrator

24

Task ModelAtomic Scheduling Unit of SwarmKit

ObjectCurrent State

SpecDesired State

Task0 Task1… Taskn Scheduler

SwarmKit Doesn’t Quit!

Node Topology in SwarmKit


We’ve got a bunch of nodes… now what?

28

Push Model Pull Model

Manager

Worker

Discovery System

(ZooKeeper)

3 - Payload

1 - Register

2 - Discover Manager

Worker

Registration &Payload

29

Push Model Pull Model

Pros Provides better control over communication rate

- Managers decide when to contact Workers

Cons Requires a discovery mechanism

- More failure scenarios- Harder to troubleshoot

Pros Simpler to operate- Workers connect to Managers

and don’t need to bind- Can easily traverse networks- Easier to secure- Fewer moving parts

Cons Workers must maintain connection to Managers at all times

30

Push vs Pull• SwarmKit adopted the Pull model• Favored operational simplicity• Engineered solutions to provide rate control in pull mode

Rate Control in a Pull Model

32

Rate Control: Heartbeats• Manager dictates heartbeat rate to Workers• Rate is configurable (not by end user)• Managers agree on same rate by

consensus via Raft• Managers add jitter so pings are spread

over time (avoid bursts)

Manager

Worker

Ping? Pong!Ping me back in 5.2 seconds

33

Rate Control: Workloads• Worker opens a gRPC stream to

receive workloads• Manager can send data whenever it

wants to• Manager will send data in batches• Changes are buffered and sent in

batches of 100 or every 100 ms, whichever occurs first

• Adds little delay (at most 100ms) but drastically reduces amount of communication

Manager

Worker

Give me work to do

100ms - [Batch of 12 ]200ms - [Batch of 26 ]300ms - [Batch of 32 ]340ms - [Batch of 100]360ms - [Batch of 100]460ms - [Batch of 42 ]560ms - [Batch of 23 ]

ReplicationRunning multiple managers for high availability

35

Replication

Manager Manager Manager

Worker

Leader FollowerFollower • Worker can connect to any Manager

• Followers will forward traffic to the Leader

36

Replication


Worker

Leader FollowerFollower • Followers multiplex all workers to the Leader using a single connection

• Backed by gRPC channels (HTTP/2 streams)

• Reduces Leader networking load by spreading the connections evenly

Worker Worker

Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.

37

Replication


Worker

Leader FollowerFollower • Upon Leader failure, a new one is elected

• All managers start redirecting worker traffic to the new one

• Transparent to workers

Worker Worker

38

Replication


Worker

Follower FollowerLeader • Upon Leader failure, a new one is elected

• All managers start redirecting worker traffic to the new one

• Transparent to workers

Worker Worker

39

Replication

Manager 3

Manager 1

Manager 2

Worker

Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers

• When a new manager joins, all workers are notified

• Upon manager failure, workers will reconnect to a different manager

- Manager 1 Addr- Manager 2 Addr- Manager 3 Addr

40

Replication

Manager 3

Manager 1

Manager 2

Worker




41

Replication

Manager 3

Manager 1

Manager 2

Worker




Reconnect to random manager

PresenceScalable presence in a distributed environment

43

Presence• Leader commits Worker state (Up vs Down) into Raft

− Propagates to all managers− Recoverable in case of leader re-election

• Heartbeat TTLs kept in Leader memory− Too expensive to store “last ping time” in Raft

• Every ping would result in a quorum write− Leader keeps worker<->TTL in a heap (time.AfterFunc)− Upon leader failover workers are given a grace period to reconnect

• Workers considered Unknown until they reconnect• If they do they move back to Up• If they don’t they move to Down

Distributed Consensus


45

The Raft Consensus Algorithm

Orchestration systems typically use some kind of service to maintain state in a distributed system

- etcd- ZooKeeper- …

Many of these services are backed by the Raft consensus algorithm

46

SwarmKit and RaftDocker chose to implement the algorithm directly

- Fast- Don’t have to set up a separate service to get started with

orchestration- Differentiator between SwarmKit/Docker and other orchestration

systems

demo.consensus.group

secretlivesofdata.com

Consistency

Sequencer

● Every object in the store has a Version field● Version stores the Raft index when the object was last updated● Updates must provide a base Version; are rejected if it is out of date● Similar to CAS● Also exposed through API calls that change objects in the store

49

50

Versioned UpdatesConsistency

service := getCurrentService()

spec := service.Spec

spec.Image = "my.serv/myimage:mytag"

update(spec, service.Version)

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.3.0...

Version = 189

Original object:

51

Raft index when it was last updated

Sequencer

Service ABC


Version = 189

Service ABC


Version = 189

Update request: Original object:

52

Sequencer

Service ABC


Version = 189

Original object:

Service ABC


Version = 189

Update request:

53

Sequencer

Service ABC


Version = 190

Updated object:

54

Sequencer

Service ABC


Version = 190

Service ABC


Version = 189

Update request: Updated object:

55

Sequencer

Service ABC


Version = 190

Service ABC


Version = 189

Update request: Updated object:

56

github.com/docker/swarmkitgithub.com/coreos/etcd (Raft)

THANK YOU

SwarmKit in Theory and Practice

Software

Transcript of SwarmKit in Theory and Practice