Living Waters Wednesday Teaching C. Holoman-#15 Attack and Defence January 21, 2009.
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff...
-
Upload
confluent -
Category
Engineering
-
view
3.092 -
download
1
Transcript of When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff...
![Page 1: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/1.jpg)
When it absolutely, positively, has to be there
Reliability Guarantees in Apache Kafka
@jeffholoman @gwenshap
![Page 2: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/2.jpg)
Kafka• High Throughput• Low Latency• Scalable• Centralized• Real-time
![Page 3: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/3.jpg)
“If data is the lifeblood of high technology, Apache Kafka is the circulatory system”
--Todd PalinoKafka SRE @ LinkedIn
![Page 4: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/4.jpg)
If Kafka is a critical piece of our pipeline Can we be 100% sure that our data will get there? Can we lose messages? How do we verify? Who’s fault is it?
![Page 5: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/5.jpg)
Distributed Systems Things Fail Systems are designed to
tolerate failure
We must expect failures and design our code and configure our systems to handle them
![Page 6: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/6.jpg)
Network
Broker MachineClient Machine
Data Flow
Kafka Client
Broker
O/S Socket Buffer
NIC
NIC
Page Cache
Disk
Application Thread
O/S Socket Buffer
async
callback
✗
✗✗
✗
✗
✗✗✗ data
ack / exception
![Page 7: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/7.jpg)
Client Machine
Kafka Client
O/S Socket Buffer
NIC
Application Thread
✗
✗✗Broker Machine
Broker
NIC
Page Cache
Disk
O/S Socket Buffer
miss
✗
✗
✗
✗Network
Data Flow
✗
data
offsets
ZK
Kafka✗
![Page 8: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/8.jpg)
Replication is your friend Kafka protects against failures by replicating data The unit of replication is the partition One replica is designated as the Leader Follower replicas fetch data from the leader The leader holds the list of “in-sync” replicas
![Page 9: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/9.jpg)
Replication and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker 100
Broker 101
Broker 102
Topic:Partitions
:Replicas:
my_topic33
Partition:
Leader:ISR:
1101
100,102
Partition:
Leader:ISR:
2102
101,100
Partition:
Leader:ISR:
0100
101,102
![Page 10: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/10.jpg)
ISR
• 2 things make a replica in-sync– Lag behind leader
• replica.lag.time.max.ms – replica that didn’t fetch or is behind • replica.lag.max.messages – will go away has gone away in 0.9
– Connection to Zookeeper
![Page 11: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/11.jpg)
Terminology• Acked
– Producers will not retry sending. – Depends on producer setting
• Committed– Consumers can read. – Only when message got to all
ISR.• replica.lag.time.max.ms
– how long can a dead replica prevent consumers from reading?
![Page 12: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/12.jpg)
Replication• Acks = all
– only waits for in-sync replicas to reply.
Replica 3
100
Replica 2
100
Replica 1
100
Time
![Page 13: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/13.jpg)
• Replica 3 stopped replicating for some reason
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101
Time
Acked in acks = all“committed”
Acked in acks = 1but not
“committed”
![Page 14: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/14.jpg)
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101
Time
• One replica drops out of ISR, or goes offline• All messages are now acked and committed
![Page 15: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/15.jpg)
• 2nd Replica drops out, or is offline
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101102103104Time
![Page 16: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/16.jpg)
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101102103104Time
• Now we’re in trouble
✗
![Page 17: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/17.jpg)
Replication• If Replica 2 or 3 come back online before the leader, you can will lose data.
Replica 3
100
Replica 2
100101
Replica 1
100101102103104Time
All those are “acked” and “committed”
![Page 18: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/18.jpg)
So what to do
• Disable Unclean Leader Election– unclean.leader.election.enable = false
• Set replication factor– default.replication.factor = 3
• Set minimum ISRs– min.insync.replicas = 2
![Page 19: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/19.jpg)
Warning
• min.insync.replicas is applied at the topic-level.• Must alter the topic configuration manually if created
before the server level change• Must manually alter the topic < 0.9.0 (KAFKA-2114)
![Page 20: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/20.jpg)
Replication• Replication = 3• Min ISR = 2
Replica 3
100
Replica 2
100
Replica 1
100
Time
![Page 21: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/21.jpg)
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101
Time
• One replica drops out of ISR, or goes offline
![Page 22: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/22.jpg)
Replication
Replica 3
100
Replica 2
100101
Replica 1
100101102
103104
Time
• 2nd Replica fails out, or is out of sync
Buffers in
Producer
![Page 23: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/23.jpg)
![Page 24: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/24.jpg)
Producer Internals• Producer sends batches of messages to a buffer
M3
Application Thread
Application Thread
Application Thread
send()M2 M1 M0
Batch 3Batch 2Batch 1
Fail? response
retry
Update Future
callback
drain
Metadata orException
![Page 25: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/25.jpg)
Basics
• Durability can be configured with the producer configuration request.required.acks– 0 The message is written to the network (buffer)– 1 The message is written to the leader– all The producer gets an ack after all ISRs receive the data; the
message is committed
• Make sure producer doesn’t just throws messages away!– block.on.buffer.full = true
![Page 26: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/26.jpg)
“New” Producer
• All calls are non-blocking async• 2 Options for checking for failures:
– Immediately block for response: send().get()– Do followup work in Callback, close producer after error threshold
• Be careful about buffering these failures. Future work? KAFKA-1955• Don’t forget to close the producer! producer.close() will block until in-flight txns
complete
• retries (producer config) defaults to 0 • message.send.max.retries (server config) defaults to 3• In flight requests could lead to message re-ordering
![Page 27: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/27.jpg)
![Page 28: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/28.jpg)
Consumer
• Three choices for Consumer API– Simple Consumer– High Level Consumer (ZookeeperConsumer)– New KafkaConsumer
![Page 29: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/29.jpg)
New Consumer – attempt #1props.put("enable.auto.commit", "true");props.put("auto.commit.interval.ms", "10000"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); } } What if we crash
after 8 seconds?
Commit automatically every 10 seconds
![Page 30: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/30.jpg)
New Consumer – attempt #2props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); consumer.commitSync();
What are you really committing?
![Page 31: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/31.jpg)
New Consumer – attempt #3props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record);
TopicPartition tp = new TopicPartition(record.topic(), record.partition()); OffsetAndMetadata oam = new OffsetAndMetadata(record.offset() +1); consumer.commitSync(Collections.singletonMap(tp,oam));
Is this fast enough?
![Page 32: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/32.jpg)
New Consumer – attempt #4props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar"));
int counter = 0;while (true) { ConsumerRecords<String, String> records = consumer.poll(500); for (ConsumerRecord<String, String> record : records) { counter ++; processAndUpdateDB(record); if (counter % 100 == 0) { TopicPartition tp = new TopicPartition(record.topic(), record.partition()); OffsetAndMetadata oam = new OffsetAndMetadata(record.offset() + 1); consumer.commitSync(Collections.singletonMap(tp, oam));
![Page 33: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/33.jpg)
Almost.
![Page 34: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/34.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
✗Commit
![Page 35: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/35.jpg)
Consumer OffsetsP0 P2 P3 P4 P5 P6
Consumer
Thread 1 Thread 2 Thread 3 Thread 4
Duplicates
![Page 36: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/36.jpg)
Rebalance Listener
public class MyRebalanceListener implements ConsumerRebalanceListener { @Override public void onPartitionsAssigned(Collection<TopicPartition> partitions) { } @Override public void onPartitionsRevoked(Collection<TopicPartition> partitions) { commitOffsets(); }}
consumer.subscribe(Arrays.asList("foo", "bar"), new MyRebalanceListener());
Careful! This method will need to know the topic, partition and
offset of last record you got
![Page 37: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/37.jpg)
At Least Once Consuming
1. Commit your own offsets - Set autocommit.enable = false
2. Use Rebalance Listener to limit duplicates3. Make sure you commit only what you are done processing4. Note: New consumer is single threaded – one consumer
per thread.
![Page 38: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/38.jpg)
Exactly Once Semantics
• At most once is easy• At least once is not bad either – commit after 100% sure
data is safe• Exactly once is tricky
– Commit data and offsets in one transaction– Idempotent producer
![Page 39: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/39.jpg)
Using External Store
• Don’t use commitSync() • Implement your own “commit” that saves both data and
offsets to external store.• Use the RebalanceListener to find the correct offset
![Page 40: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/40.jpg)
Seeking right offsetpublic class SaveOffsetsOnRebalance implements ConsumerRebalanceListener { private Consumer<?,?> consumer; public void onPartitionsRevoked(Collection<TopicPartition> partitions) { // save the offsets in an external store using some custom code not described here for (TopicPartition partition : partitions) saveOffsetInExternalStore(consumer.position(partition)); } public void onPartitionsAssigned(Collection<TopicPartition> partitions) { // read the offsets from an external store using some custom code not described here for (TopicPartition partition : partitions) consumer.seek(partition, readOffsetFromExternalStore(partition)); }}
![Page 41: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/41.jpg)
Monitoring for Data Loss
• Monitor for producer errors – watch the retry numbers• Monitor consumer lag – MaxLag or via offsets• Standard schema:
– Each message should contain timestamp and originating service and host• Each producer can report message counts and offsets to a special
topic• “Monitoring consumer” reports message counts to another special topic• “Important consumers” also report message counts• Reconcile the results
![Page 42: When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka, Gwen Shapira, Jeff Holoman](https://reader035.fdocuments.in/reader035/viewer/2022062401/587081661a28ab57368b66eb/html5/thumbnails/42.jpg)
Be Safe, Not Sorry• Acks = all• Block.on.buffer.full = true• Retries = MAX_INT• ( Max.inflight.requests.per.connect = 1 )• Producer.close()• Replication-factor >= 3• Min.insync.replicas = 2• Unclean.leader.election = false• Auto.offset.commit = false• Commit after processing• Monitor!