Apache Kafka Topic Design - Instaclustr...Apache Kafka August 2018 2 Today’s Agenda Key concepts:...
Transcript of Apache Kafka Topic Design - Instaclustr...Apache Kafka August 2018 2 Today’s Agenda Key concepts:...
Topic DesignApache Kafka
August 2018
2
Today’s Agenda
● Key concepts: brokers, producers, consumers, topics, partitions, keys● Topic design● Partition design● Key design● Design Process
Introducing Instaclustr
4
Brokers, Producers and Consumers
Producers Brokers Consumers
● Applications using a Kafka consumer library
● Generate and send message to the brokers
● The “Kafka Cluster”● Receives and stores messages from
producers● Makes messages sequentially
available to consumers● Replicates messages for HA
● Applications using a Kafka producer library to read messages
● Can be grouped to spread work across multiple consumer instance
● Messages can be consumed multiple times
5
Topics and Partitions
1
2
3 4
1
2
3 4
3 4
2 1 2
3 4
2
3
1
21
43
4
1
Broker
Topic
Partition - Master
Partition - Replica
Legend
● Topic○ Logical grouping of data○ Settings such as replication, num partitions, log
retention, compaction, etc controllable at topic level
● Partition○ Subset of messages in a partition that:
■ Have a single master broker■ Guarantee ordered delivery within that
subset■ Within consumer groups, 1 consumer is
assigned to read from each partition○ Number of partitions is set on topic creation○ Messages are mapped to partition by key
6
Topic Design
● Minimum number of topics is implied by the minimum different retention, etc settings you require○ -> you probably don’t want to mix message types with different scalability or latency requirements
● Maximum number of topics is largely limited by imagination● In between is a set of design trade-offs:
● In general, pick the minimum number of topics that allows for required replication, retention, etc settings, separates message types with different scale or latency profiles and does not result in consumers reading excess numbers of extra messages
Less Topics More Topics
Consumers may have to filter messages Consumers can read only from topics they care about
Less processing overhead of managing masters and consumers
Slower restarts, other processing overheads
Less configuration to manage More flexibility in configuration
7
Partition Design
● Partitions are the fundamental enabler of scale in Kafka○ you can’t have more master brokers for a topic than partitions○ you can’t have more than one consumer (in a consumer group) reading from a partition
● Too many partitions per broker can lead to long failover/restart times and higher replication latency○ # partitions can be increased over time but may be a complex operation
● “Just right” number of partitions is therefore greater of:○ total target throughput divide by max throughput per broker or ○ total target throughput divide by max throughput per consumer○ !! assumes that data is equally distributed to partitions !!
8
Key Design
● Message may optionally have a key● Where a key is defined it is used to map the message to a particular partition -> messages with the same key will
have guaranteed ordered delivery (no key = round robin partition assignment)● Keys are are also vital when compaction is used: only the most recent value for a key is retained● Keys are required for Kafka Streams and some Kafka Connect functions
● Choice of key will be significantly driven by functional design● However, poor keys can lead to performance issues as partitions receive uneven load
● The number of potential key values (cardinality) will be determined be your problem domain but in general, more is good
● If using default partitioner, then want a minimum of 10x partitions in potential key values
● Ideally, message volume is roughly equal per key value○ If large deviation but high number of key values then may be OK○ If one key value >> average volume then likely an issue -> split to separate topic or use bucketing
9
Key distribution
10
Design Process
● What topics do I need?○ Are there distinct streams of message types that are require different processing?○ Are there any different requirements for message retention or relisiency?○ Would splitting by topic help to reduce consumer load?
For each topic:
● Do I care about ordering?○ What level (key) is ordering important?○ Are there sufficient keys to distribute across Kafka partitions?○ Is the message distribution per-key relatively consistent?
● How many partitions?○ What is max expect throughput by broker and consumer?○ Partitions = total target throughput / min (broker throughput, consumer throughput) * buffer factor○ Buffer factor dependent on how evenly your keys distribute data to partitions
The open source-as-a-service company, delivering reliability at scale.
Questions?