1
Kafka Needs No Keeper
Colin McCabe
2
● Kafka has gotten its mileage out of Zookeeper● But it is still a second system● KIP-500 has been adopted by the community● This is not a 1-1 replacement● We’ve been headed this direction for years
Introduction
3
Evolution of Apache Kafka Clients
4
Producer
Consumer
Admin Tools
5
write to topics
Producer
Consumer
Admin Tools
6
write to topics
read fromtopics
Producer
Consumer
Admin Tools
7
write to topics
read fromtopics
offset fetch/commit
group partitionassignment
Producer
Consumer
Admin Tools
8
write to topics
read fromtopics
offset fetch/commit
group partitionassignment
topic create/delete
Producer
Consumer
Admin Tools
9
Consumer Group Coordinator
10
Consumeroffset
fetch/commitgroup partition
assignment
read from topics
11
Consumeroffset
fetch/commitgroup partition
assignment
read from topics
Consumer APIs● Fetch
12
Consumeroffset
fetch/commitgroup partition
assignment
read from topics
Consumer APIs● Fetch
13
ConsumerConsumer APIs
● Fetchoffset fetch/commit
group partitionassignment
read from topics
__offsets
14
offset fetch/commit
Consumer
group partitionassignment
read from topics
Consumer APIs● Fetch● OffsetCommit● OffsetFetch
__offsets
15
Consumer
group partitionassignment
read from topics
offset fetch/commit Consumer APIs
● Fetch● OffsetCommit● OffsetFetch
__offsets
16
Consumer
group partitionassignment
read from topics
offset fetch/commit Consumer APIs
● Fetch● OffsetCommit● OffsetFetch
__offsets
17
group partitionassignment
Consumer
read from topics
offset fetch/commit Consumer APIs
● Fetch● OffsetCommit● OffsetFetch● JoinGroup● SyncGroup● Heartbeat
__offsets
18
Consumer
read from topics
offset fetch/commit
group partitionassignment
Consumer APIs● Fetch● OffsetCommit● OffsetFetch● JoinGroup● SyncGroup● Heartbeat
__offsets
19
Consumer
read from topics
offset fetch/commit
group partitionassignment
Consumer APIs● Fetch● OffsetCommit● OffsetFetch● JoinGroup● SyncGroup● Heartbeat
__offsets
20
read from topics
offset fetch/commit
group partitionassignment
ConsumerConsumer APIs
● Fetch● OffsetCommit● OffsetFetch● JoinGroup● SyncGroup● Heartbeat
__offsets
21
Consumer
Producer
Admin Toolscreate/delete
topics
22
Kafka Security and the
Admin Client
23
Consumer
Producer
create/delete topics
Admin Tools
24
ACL Enforcement
create/delete topics
Admin Tools
Consumer
Producer
25
create/delete topics
ACL Enforcement
Admin Tools
Consumer
Producer
26
create/delete topics
ACL Enforcement
Admin Tools
27
AdminClient
Admin Tools
ACL Enforcement
create/delete topics
28
AdminClient
Admin Tools
ACL Enforcement
create/delete topics
Admin APIs:● CreateTopics● DeleteTopics● AlterConfigs● ...
29
Admin APIs:● CreateTopics● DeleteTopics● AlterConfigs● ...
AdminClient
Admin Tools
ACL Enforcement
30
Producer
Consumer
AdminClient
Client APIs:● Produce● Fetch● Metadata● CreateTopics● DeleteTopics● ...
31
Producer
Consumer
AdminClient
Client APIs:● Produce● Fetch● Metadata● CreateTopics● DeleteTopics● ...
● Encapsulation● Security● Validation● Compatibility
32
Inter BrokerCommunication
33
34
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
35
Controller
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
36
Controller
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
37
Controller
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
38
Controller
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
39
Controller Controller APIs:● LeaderAndIsr● UpdateMetadata● StopReplica
Leader/ISR PushUpdate Metadata
Stop/Delete Replica
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
40
Controller Controller APIs:● LeaderAndIsr● UpdateMetadata● StopReplica
Leader/ISR PushUpdate Metadata
Stop/Delete Replica
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
41
Controller Controller APIs:● LeaderAndIsr● UpdateMetadata● StopReplica
Leader/ISR PushUpdate Metadata
Stop/Delete Replica
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
42
Controller Controller APIs:● LeaderAndIsr● UpdateMetadata● StopReplica● AlterIsr
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
Leader/ISR PushUpdate Metadata
Stop/Delete Replica
43
Controller
Leader/ISR PushUpdate Metadata
Stop/Delete ReplicaISR Management
Controller APIs:● LeaderAndIsr● UpdateMetadata● StopReplica● AlterIsr
Broker RegistrationACL Management
Dynamic ConfigurationISR Management
Controller Election
44
45
● Encapsulation● Compatibility● Ownership
46
Broker Liveness
47
Zk Session
48
/brokers/1 -> { host: 10.10.10.1:9092 rack: rack-1}
49
/brokers/1 -> { host: 10.10.10.1:9092 rack: rack-1}
50
51
Watch trigger
52
Watch triggerBroker 1 is offline
53
Network PartitionResilience
54
55
Case 1: Total partition
56
Case 2: Broker partition
57
Case 3: Zk Partition
58
Case 4: Controller partition
59
Metadata Inconsistency
60
61
Metadata Sourceof Truth
62
Metadata Sourceof Truth
Metadata Cache- sync writes- async updates
63
Metadata Sourceof Truth
Metadata Cache- async update
Metadata Cache- sync writes- async updates
Metadata Cache- async update
64
65
66
67
Last Resort:> rmr /controller
68
Last Resort:> rmr /controller
New controller!
69
Last Resort:> rmr /controller
Load ALLMetadata
70
Last Resort:> rmr /controller
Load ALLMetadata
71
Last Resort:> rmr /controller
Push ALLMetadata
72
Last Resort:> rmr /controller
Push ALLMetadata
73
Last Resort:> rmr /controller
Push ALLMetadata
How do you know the metadata has diverged?
74
Performance of ControllerInitialization
75
76
77
New controller!
78
Load ALLMetadata
79
Load ALLMetadata
Complexity: O(N)N = number of partitions
80
81
Push ALLMetadata
82
Push ALLMetadata
Complexity: O(N*M)N = number of partitionsM = number of brokers
83
Metadata as an Event Log
8484
Metadata as an Event Log
- Each change becomes a message
- Changes are propagated to all brokers
...
924 Create topic ”foo”
925 Delete topic “bar”
926 Add node 4 to the cluster
927 Create topic “baz”
928 Alter ISR for “foo-0”
929 Add node 5 to the cluster
8585
Metadata as an Event Log
- Clear ordering- Can send deltas- Offset tracks consumer
position- Easy to measure lag
...
924 Create topic ”foo”
925 Delete topic “bar”
926 Add node 4 to the cluster
927 Create topic “baz”
928 Alter ISR for “foo-0”
929 Add node 5 to the cluster
86
Consumer
Consumer
Consumer
offset=3
offset=1
offset=2
87
offset=3
offset=1
offset=2
Broker
Broker
Broker
?
88
offset=3
offset=1
offset=2
Broker
Broker
Broker
Controller
89
Can we use the existing Kafka log replication protocol?
- How do we elect the leader?
We need a self-managed quorum.
Implementing the Controller Log
90
Can we use the existing Kafka log replication protocol?
- How do we elect the leader?
We need a self-managed quorum.
Implementing the Controller Log
Enter Raft.
Leader election is by simple majority.
91
Kafka Raft
Writes Single Leader Single Leader
Fencing Monotonically increasing epoch
Monotonically increasing term
Log reconciliation Offset and epoch Term and index
Push/Pull Pull Push
Commit Semantics ISR Majority
Leader Election From ISR through Zookeeper
Majority
92
The Controller Quorum
93
offset=1
offset=2
Broker
Broker
Controller
Controller
Controller
The Controller Raft Quorum- The leader is the active controller- Controls reads / writes to the log- Typically 3 or 5 nodes, like ZK
94
offset=1
offset=2
Broker
Broker
Controller
Controller
Controller
Instant Failover- Low-latency failover via Raft election- Standbys contain all data in memory- Brokers do not need to re-fetch
95
/mnt/logs/kafka/metadataoffset=1
Broker
Broker
Controller
Controller
Controller
Metadata Caching- Brokers can persist metadata to disk- Only fetch what they need- Use snapshots if we’re too far behind
/mnt/logs/kafka/metadataoffset=2
96
Broker Registration- Building a map of the
cluster- What brokers exist in
the cluster?- How can they be
reached?
Controller
97
Broker Registration- Brokers send
heartbeats to the active controller
- The controller uses this to build a map of the cluster
Controller
98
Controller
Broker Registration- Brokers send
heartbeats to the active controller
- The controller uses this to build a map of the cluster
- The controller also tells brokers if they should be fenced or shut down
99
Controller
Fencing- Brokers need to be
fenced if they’re partitioned from the controller, or can’t keep up
- Brokers self-fence if they can’t talk to the controller
100
Handling network partitions
101
Case 1: Total partition
102
Case 1: Total partition
103
Case 2: Broker partition
104
Case 3: Controller partition
105
Case 3: Controller partition
106
DeploymentCurrent KIP-500
Configuration File Kafka and ZooKeeper
Kafka
Metrics Kafka and ZK Kafka
Administrative Tools
ZK Shell, Four letter words, Kafka tools
Kafka tools
Security Kafka and ZK Kafka
107
Shared Controller Nodes
- Fewer resources used
- Single node clusters (eventually)
108
Separate Controller Nodes
- Better resource isolation
- Good for big clusters
109
Roadmap
110
Remove Client-side ZK dependencies
Remove Broker-side ZK dependencies
Controller Quorum
111
Remove Client-side ZK dependencies
Remove Broker-side ZK dependencies
Controller Quorum
Incremental KIP-4 Improvements
- Create new APIs- Deprecate direct ZK
access
112
Remove Client-side ZK dependencies
Remove Broker-side ZK dependencies
Controller Quorum
Broker-Side Fixes- Remove deprecated
direct ZK access for tools
- Create broker-side APIs
- Centralize ZK access in the controller
113
Remove Client-side ZK dependencies
Remove Broker-side ZK dependencies
Controller Quorum
First Release without ZooKeeper
- Raft- Controller quorum
114
Upgrade Issues- Tools using ZK- Brokers
accessing ZK- State in ZK
KIP-500 Release
Older Kafka Release
115
Bridge Release
KIP-500 Release
Older Kafka ReleaseBridge Release- No ZK access
from tools, brokers (except controller)
116
Upgrading- Starting from the
bridge release
117
Upgrading- Start new controller
nodes (possibly combined)
- Quorum elects leader- Claims leadership in
ZK
118
Upgrading- Roll nodes one by
one as usual- Controller continues
sending LeaderAndIsr, etc. to old nodes
119
Upgrading- When all brokers
have been rolled, decommission ZK nodes
120
Conclusion
121
Apache ZooKeeper has served us well- KIP-500 is not a 1:1 replacement, but a different
paradigmWe have already started removing ZK from clients- Consumer, AdminClient- Improved encapsulation, security, upgradability
122
Metadata should be managed as a log- Deltas, ordering, caching- Controller Failover, Fencing- Improved scalability, robustness, easier deployment
The metadata log must be self-managed- Raft- Controller quorum
123
It will take a few releases to implement KIP-500- Additional KIPs for APIs, Raft, Metadata, etc.
Rolling upgrades will be supported- Bridge release- Post-ZK release
Kafka needs no Keeper
Top Related