SwarmKit in Theory and Practice
-
Upload
laura-frank -
Category
Software
-
view
2.439 -
download
0
Transcript of SwarmKit in Theory and Practice
Container Orchestration from Theory to Practice
Laura Frank & Stephen DayOpen Source Summit, Los AngelesSeptember 2017
v0
3
Agenda
- Understanding the SwarmKit object model - Node topology - Distributed consensus and Raft
SwarmKitAn open source framework for building orchestration systemshttps://github.com/docker/swarmkit
SwarmKit Object Model
Laura Frank & Stephen DayContainer Orchestration from Theory to Practice
6
OrchestrationA control system for your cluster
ClusterO
-
Δ StD
D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
7
ConvergenceA functional view
D = Desired StateO = OrchestratorC = ClusterSt = State at time t
f(D, Sn-1, C) → Sn | min(S-D)
8
Observability and ControllabilityThe Problem
Low Observability High Observability
Failure Process State User Input
9
Data Model Requirements
- Represent difference in cluster state- Maximize observability- Support convergence- Do this while being extensible and reliable
Show me your data structures and I’ll show you your orchestration system
11
Services- Express desired state of the cluster- Abstraction to control a set of containers- Enumerates resources, network availability, placement- Leave the details of runtime to container process- Implement these services by distributing processes across a cluster
Node 1 Node 2 Node 3
12
Declarative$ docker network create -d overlay backend vpd5z57ig445ugtcibr11kiiz
$ docker service create -p 6379:6379 --network backend redis pe5kyzetw2wuynfnt7so1jsyl
$ docker service scale serene_euler=3 serene_euler scaled to 3
$ docker service ls ID NAME REPLICAS IMAGE COMMANDpe5kyzetw2wu serene_euler 3/3 redis
13
ReconciliationSpec → Object
ObjectCurrent State
SpecDesired State
14
Demo Time!
Task Model
Prepare: setup resourcesStart: start the taskWait: wait until task exitsShutdown: stop task, cleanly
Runtime
Orchestrator
16
Task ModelAtomic Scheduling Unit of SwarmKit
ObjectCurrent State
SpecDesired State
Task0 Task1… Taskn Scheduler
Manager
TaskTask
Data Flow
ServiceSpec
TaskSpec
Service
ServiceSpec
TaskSpec
Task
TaskSpec
Worker
Consistency
19
Field Ownership
Only one component of the system can write to a field
Consistency
Worker
Pre-Run
Preparing
Manager
Terminal States
Task StateNew Allocated Assigned
Ready Starting
Running
Complete
Shutdown
Failed
Rejected
Field HandoffTask Status
State Owner
< Assigned Manager
>= Assigned Worker
22
Observability and ControllabilityThe Problem
Low Observability High Observability
Failure Process State User Input
23
OrchestrationA control system for your cluster
ClusterO
-
Δ StD
D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D
https://en.wikipedia.org/wiki/Control_theory
Orchestrator
24
Task ModelAtomic Scheduling Unit of SwarmKit
ObjectCurrent State
SpecDesired State
Task0 Task1… Taskn Scheduler
SwarmKit Doesn’t Quit!
Node Topology in SwarmKit
Laura Frank & Stephen DayContainer Orchestration from Theory to Practice
We’ve got a bunch of nodes… now what?
28
Push Model Pull Model
Manager
Worker
Discovery System
(ZooKeeper)
3 - Payload
1 - Register
2 - Discover Manager
Worker
Registration &Payload
29
Push Model Pull Model
Pros Provides better control over communication rate
- Managers decide when to contact Workers
Cons Requires a discovery mechanism
- More failure scenarios- Harder to troubleshoot
Pros Simpler to operate- Workers connect to Managers
and don’t need to bind- Can easily traverse networks- Easier to secure- Fewer moving parts
Cons Workers must maintain connection to Managers at all times
30
Push vs Pull• SwarmKit adopted the Pull model• Favored operational simplicity• Engineered solutions to provide rate control in pull mode
Rate Control in a Pull Model
32
Rate Control: Heartbeats• Manager dictates heartbeat rate to Workers• Rate is configurable (not by end user)• Managers agree on same rate by
consensus via Raft• Managers add jitter so pings are spread
over time (avoid bursts)
Manager
Worker
Ping? Pong!Ping me back in 5.2 seconds
33
Rate Control: Workloads• Worker opens a gRPC stream to
receive workloads• Manager can send data whenever it
wants to• Manager will send data in batches• Changes are buffered and sent in
batches of 100 or every 100 ms, whichever occurs first
• Adds little delay (at most 100ms) but drastically reduces amount of communication
Manager
Worker
Give me work to do
100ms - [Batch of 12 ]200ms - [Batch of 26 ]300ms - [Batch of 32 ]340ms - [Batch of 100]360ms - [Batch of 100]460ms - [Batch of 42 ]560ms - [Batch of 23 ]
ReplicationRunning multiple managers for high availability
35
Replication
Manager Manager Manager
Worker
Leader FollowerFollower • Worker can connect to any Manager
• Followers will forward traffic to the Leader
36
Replication
Manager Manager Manager
Worker
Leader FollowerFollower • Followers multiplex all workers to the Leader using a single connection
• Backed by gRPC channels (HTTP/2 streams)
• Reduces Leader networking load by spreading the connections evenly
Worker Worker
Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.
37
Replication
Manager Manager Manager
Worker
Leader FollowerFollower • Upon Leader failure, a new one is elected
• All managers start redirecting worker traffic to the new one
• Transparent to workers
Worker Worker
38
Replication
Manager Manager Manager
Worker
Follower FollowerLeader • Upon Leader failure, a new one is elected
• All managers start redirecting worker traffic to the new one
• Transparent to workers
Worker Worker
39
Replication
Manager 3
Manager 1
Manager 2
Worker
Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers
• When a new manager joins, all workers are notified
• Upon manager failure, workers will reconnect to a different manager
- Manager 1 Addr- Manager 2 Addr- Manager 3 Addr
40
Replication
Manager 3
Manager 1
Manager 2
Worker
Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers
• When a new manager joins, all workers are notified
• Upon manager failure, workers will reconnect to a different manager
41
Replication
Manager 3
Manager 1
Manager 2
Worker
Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers
• When a new manager joins, all workers are notified
• Upon manager failure, workers will reconnect to a different manager
Reconnect to random manager
PresenceScalable presence in a distributed environment
43
Presence• Leader commits Worker state (Up vs Down) into Raft
− Propagates to all managers− Recoverable in case of leader re-election
• Heartbeat TTLs kept in Leader memory− Too expensive to store “last ping time” in Raft
• Every ping would result in a quorum write− Leader keeps worker<->TTL in a heap (time.AfterFunc)− Upon leader failover workers are given a grace period to reconnect
• Workers considered Unknown until they reconnect• If they do they move back to Up• If they don’t they move to Down
Distributed Consensus
Laura Frank & Stephen DayContainer Orchestration from Theory to Practice
45
The Raft Consensus Algorithm
Orchestration systems typically use some kind of service to maintain state in a distributed system
- etcd- ZooKeeper- …
Many of these services are backed by the Raft consensus algorithm
46
SwarmKit and RaftDocker chose to implement the algorithm directly
- Fast- Don’t have to set up a separate service to get started with
orchestration- Differentiator between SwarmKit/Docker and other orchestration
systems
demo.consensus.group
secretlivesofdata.com
Consistency
Sequencer
● Every object in the store has a Version field● Version stores the Raft index when the object was last updated● Updates must provide a base Version; are rejected if it is out of date● Similar to CAS● Also exposed through API calls that change objects in the store
49
50
Versioned UpdatesConsistency
service := getCurrentService()
spec := service.Spec
spec.Image = "my.serv/myimage:mytag"
update(spec, service.Version)
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.3.0...
Version = 189
Original object:
51
Raft index when it was last updated
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.3.0...
Version = 189
Service ABC
SpecReplicas = 4Image = registry:2.4.0...
Version = 189
Update request: Original object:
52
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.3.0...
Version = 189
Original object:
Service ABC
SpecReplicas = 4Image = registry:2.4.0...
Version = 189
Update request:
53
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.4.0...
Version = 190
Updated object:
54
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.4.0...
Version = 190
Service ABC
SpecReplicas = 5Image = registry:2.3.0...
Version = 189
Update request: Updated object:
55
Sequencer
Service ABC
SpecReplicas = 4Image = registry:2.4.0...
Version = 190
Service ABC
SpecReplicas = 5Image = registry:2.3.0...
Version = 189
Update request: Updated object:
56
github.com/docker/swarmkitgithub.com/coreos/etcd (Raft)
THANK YOU