Kafkaesque days at linked in in 2015
-
Upload
joel-koshy -
Category
Data & Analytics
-
view
2.778 -
download
5
Transcript of Kafkaesque days at linked in in 2015
Kafkaesque daysat LinkedIn in 2015
Joel KoshyKafka Summit 2016
Kafkaesqueadjective Kaf·ka·esque \ˌkäf-kə-ˈesk, ˌkaf-\
: of, relating to, or suggestive of Franz Kafka or his writings; especially : having a nightmarishly complex, bizarre, or illogical quality
Merriam-Webster
Kafka @ LinkedIn
What @bonkoif said:
More clusters
More use-cases
More problems …
Kafka @ LinkedIn
Incidents that we will cover● Offset rewinds● Data loss● Cluster unavailability● (In)compatibility● Blackout
Offset rewinds
What are offset rewinds?
valid offsetsinvalid offsetsinvalid offsets
yet to arrive messages
purged messages
If a consumer gets an OffsetOutOfRangeException:
What are offset rewinds?
valid offsetsinvalid offsetsinvalid offsets
auto.offset.reset ← earliest auto.offset.reset ← latest
What are offset rewinds… and why do they matter?
HADOOP Kafka(CORP)
Pushjob
Kafka(PROD) StorkMirror
MakerEmailcampaigns
What are offset rewinds… and why do they matter?
HADOOP KafkaPushjob
Kafka(PROD) StorkMirror
MakerEmailcampaigns
Real-life incident courtesy of xkcd
offset rewind
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
Offset rewinds: the first incident
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM
Offset rewinds: the first incident
What are offset rewinds… and why do they matter?
HADOOP Kafka(CORP)
Pushjob
Kafka(PROD) StorkMirror
MakerEmailcampaigns
Good practice to havesome filtering logic here
Offset rewinds: detection
Offset rewinds: detection
Offset rewinds: detection - just use this
Offset rewinds: a typical cause
Offset rewinds: a typical cause
valid offsetsinvalid offsetsinvalid offsets
consumerposition
Offset rewinds: a typical cause
valid offsetsinvalid offsetsinvalid offsets
consumerposition
Unclean leader election truncates the log
Offset rewinds: a typical cause
valid offsetsinvalid offsetsinvalid offsets
consumerposition
Unclean leader election truncates the log… and consumer’s offset goes out of range
But there were no ULEs when this happened
But there were no ULEs when this happened… and we set auto.offset.reset to latest
Offset management - a quick overview
(broker)
Consumer
Consumer group
Consumer Consumer
(broker)(broker)
Consume (fetch requests)
Offset management - a quick overview
OffsetManager(broker)
Consumer
Consumer group
Consumer Consumer
Periodic OffsetCommitRequest
(broker)(broker)
Offset management - a quick overview
OffsetManager(broker)
Consumer
Consumer group
Consumer Consumer
OffsetFetchRequest(after rebalance)
(broker) (broker)
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
New offset commitsappend to the topic
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
New offset commits append to the topic
mirror-makerPageViewEvent-0 321
mirror-makerLoginEvent-8 512
… …
Maintain offset cache to serve offset fetch requests quickly
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
New offset commits append to the topic
mirror-makerPageViewEvent-0 321
mirror-makerLoginEvent-8 512
… …
Purge old offsets via log compaction
Maintain offset cache to serve offset fetch requests quickly
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
When a new broker becomes the leader (i.e., offset manager) it loads offsets into its cache
Offset management - a quick overview
mirror-makerPageViewEvent-0
240
mirror-makerLoginEvent-8
456
mirror-makerLoginEvent-8
512
mirror-makerPageViewEvent-0
321
__consumer_offsets topic
mirror-makerPageViewEvent-0 321
mirror-makerLoginEvent-8 512
… …
See this deck for more details
Back to the incident…
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
Back to the incident…
... <rebalance>
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
... <rebalance>
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
... <rebalance>
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper <zookeeperConnect> --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config config/consumer.properties
(must set exclude.internal.topics=false in consumer.properties)
While debugging offset rewinds, do this first!
...…
[mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711][mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711]...
...
[mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231]
...
Inside the __consumer_offsets topic
Jul 10 (today)
Jun 1 !!
So why did the offset manager return a stale offset?Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
So why did the offset manager return a stale offset?Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
... ...
mirror-makersome-log_event, 13 6811737
... ...
Leader moved and new offset manager hit KAFKA-2117 while loading offsets
old offsets recent offsets
… caused a ton of offset resets
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
...
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
...
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
[some-log_event, 13]
846232 9581225
purged
… but why the duplicate email?
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
… but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
Also from Jun 1
… but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
[crt-event, 12]
0 11464
… but still valid!
Time-based retention does not work wellfor low-volume topics
Addressed by KIP-32/KIP-33
Offset rewinds: the second incident
mirror makersgot wedged
restarted sent duplicate emails to (few) members
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
⇒ log cleaner had failed a while ago…but why did offset fetch return -1?
Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?
dead-groupPageViewEvent-0 321
timestamp older than a week
active-groupLoginEvent-8 512 recent
timestamp
… …
__consumer_offsets
Offset cache
cleanuptask
Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?
dead-groupPageViewEvent-0 321
timestamp older than a week
active-groupLoginEvent-8 512 recent
timestamp
… …
__consumer_offsets
Offset cache
cleanuptask
Append tombstones for dead-group
and delete entry in offset cache
Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-makerPageViewEvent-0 45 very old
timstamp
mirror-makerLoginEvent-8 12 very old
timestamp
... ... ...
old offsets recent offsets
load offsets
Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-makerPageViewEvent-0 45 very old
timstamp
mirror-makerLoginEvent-8 12 very old
timestamp
... ... ...
old offsets recent offsets
load offsets
Cleanup task happened to run during the load
Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-makerPageViewEvent-0 321 recent
timestamp
mirror-makerLoginEvent-8 512 recent
timestamp
... ... ...
old offsets recent offsets
load offsets
Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
Root cause of this rewind● Log cleaner had failed (separate bug)
○ ⇒ offsets topic grew big○ ⇒ offset load on leader movement took a while
● Cache cleanup ran during the load○ which appended tombstones○ and overrode the most recent offsets
● (Fixed in KAFKA-2163)
Offset rewinds: wrapping it up● Monitor log cleaner health● If you suspect a rewind:
○ Check for unclean leader elections○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)○ Take a dump of the offsets topic○ … stare long and hard at the logs (both consumer and offset manager)
● auto.offset.reset ← closest ?● Better lag monitoring via Burrow
Critical data loss
PROD
B
PROD
A
CORP
Y
CORP
X
Data loss: the first incident
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
CORP
Y
CORP
X
PROD
B
PROD
A
Audit trail
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
data
audit
Data loss: detection (example 1)
PROD
B
PROD
A
CORP
Y
CORP
X
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Data loss: detection (example 1)
PROD
B
PROD
A
CORP
Y
CORP
X
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Data loss: detection (example 2)
PROD
B
PROD
A
CORP
Y
CORP
X
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Data loss? (The actual incident)
PROD
B
PROD
A
CORP
Y
CORP
X
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Data loss or audit issue? (The actual incident)
PROD
B
PROD
A
CORP
Y
CORP
X
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Kafkaaggregate
Kafkalocal
Kafkaaggregate
Hadoop
Producers
Sporadic discrepancies in Kafka-aggregate-CORP-X counts for several topics
However, Hadoop-X tier is complete
✔
✔ ✔
✔
✔
✔✔
Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Likely caused by dropping audit events
Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Possible sources of discrepancy:
● Cluster auditor● Cluster itself (i.e., data loss in audit topic)● Audit front-end
Likely caused by dropping audit events
Possible causes
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
emit audit counts
Cluster auditor
● Counting incorrectly○ but same version of auditor everywhere
and only CORP-X has issues● Not consuming all data for audit or failing
to send all audit events○ but no errors in auditor logs
● … and auditor bounces did not help
Data loss in audit topic
● … but no unclean leader elections● … and no data loss in sampled topics
(counted manually)
Possible causes
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
emit audit counts
Audit front-end fails to insert audit events into DB
● … but other tiers (e.g., CORP-Y) are correct● … and no errors in logs
Possible causes
CORP
X
Kafkaaggregate
Hadoop
Auditfront-end
consumeaudit
Audit DB insert
fromCORP-Y
● Emit counts to new test tier
Attempt to reproduce
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
… fortunately worked:
● Emit counts to new test tier● test tier counts were also sporadically off
Attempt to reproduce
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
● Enabled select TRACE logs to log audit events before sending
● Audit counts were correct● … and successfully emitted
… and debug
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
● Enabled select TRACE logs to log audit events before sending
● Audit counts were correct● … and successfully emitted● Verified from broker public access logs
that audit event was sent
… and debug
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
● Enabled select TRACE logs to log audit events before sending
● Audit counts were correct● … and successfully emitted● Verified from broker public access logs
that audit event was sent● … but on closer look realized it was not
the leader for that partition of the audit topic
… and debug
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
● Enabled select TRACE logs to log audit events before sending
● Audit counts were correct● … and successfully emitted● Verified from broker public access logs
that audit event was sent● … but on closer look realized it was not
the leader for that partition of the audit topic
● So why did it not return NotLeaderForPartition?
… and debug
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
consume all topics
Tier � CORP-X
Clusterauditor
Tier � test
That broker was part of another cluster!
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
Some otherKafka clusterTier � test
siphoned audit events
… and we had a VIP misconfiguration
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
Some otherKafka cluster
VIP
stray broker entry
● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)
for the audit topic● ⇒ sometimes fetches metadata from the
other cluster
So audit events leaked into the other cluster
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
Some otherKafka cluster
VIP
AuditTopicMetadataRequest
Metadataresponse
● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)
for the audit topic● ⇒ sometimes fetches metadata from the
other cluster● and leaks audit events to that cluster until
at least next metadata refresh
So audit events leaked into the other cluster
CORP
X
Kafkaaggregate
Hadoop
Clusterauditor
Some otherKafka cluster
VIP
emit audit counts
Some takeaways● Could have been worse if mirror-makers to CORP-X had been bounced
○ (Since mirror makers could have started siphoning actual data to the other cluster)
● Consider using round-robin DNS instead of VIPs○ … which is also necessary for using per-IP connection limits
Data loss: the second incidentProlonged period of data loss from our Kafka REST proxy
Data loss: the second incident
Alerts fire that a broker in tracking cluster had gone offline
NOC engages SYSOPS to investigate
NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views
On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in tracking messages starting shortly after the broker failure
NOC engages Traffic SRE to investigate why their tracking events had stopped
Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy
Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure
Tracking events return to normal (expected) counts after the bounce
Prolonged period of data loss from our Kafka REST proxy
Reproducing the issue
BrokerA
Producerperformance
BrokerB
Reproducing the issue
BrokerA
Producerperformance
BrokerB
Isolate the broker(iptables)
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 1
Partition 2
Partition n
send
Leader forpartition 1
in-flightrequests
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 1
Partition 2
Partition n
send
New leader for partition 1
in-flightrequests
Old leader for partition 1
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 1
Partition 2
Partition n
send
New leader for partition 1
in-flightrequests
New producer did not implement a request timeout
Old leader for partition 1
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 1
Partition 2
Partition n
send
in-flightrequests
New producer did not implement a request timeout⇒ awaiting response⇒ unaware of leader change until next metadata refresh
New leader for partition 1
Old leader for partition 1
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 1
Partition 2
Partition n
send
in-flightrequests
So client continues to sendto partition 1
New leader for partition 1
Old leader for partition 1
Sender
Accumulator
Reproducing the issue
BrokerA
BrokerB
Partition 2
Partition n
send
batches pile up in partition 1 andeat up accumulator memory
in-flightrequests
New leader for partition 1
Old leader for partition 1
Sender
Accumulator
Reproducing the issue
BrokerB
Partition 2
Partition n
send
in-flightrequests
subsequent sends drop/blockper block.on.buffer .full config
New leader for partition 1
Old leader for partition 1
BrokerA
Reproducing the issue● netstat
tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java
● Producer metrics○ zero retry/error rate
● Thread dumpjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit)org.apache.kafka.clients.producer.internals.BufferPool.allocate(int)org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback)
● Resolved by KAFKA-2120 (KIP-19)
Cluster unavailability
(This is an abridged version of my earlier talk.)
The incidentOccurred a few days after upgrading to pick up quotas and SSL
Multi-portKAFKA-1809
KAFKA-1928
SSLKAFKA-1690x25 x38
October 13
Various quota patches
June 3April 5 August 18
The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster
The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such impacted application
The incidentTwo brokers report high request and response queue sizes
The incidentTwo brokers report high request queue size and request latencies
The incident● Other observations
○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs
RemediationShifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation● Controller moves did not help● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
RemediationFriday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce
● Upgraded the queuing cluster to x38 again○ Could not reproduce
● So nothing… �
Attempts at reproducing the issue
Unraveling queue backups…
API layer
Life-cycle of a Kafka requestNetwork layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clie
nt c
onne
ctio
ns
Purgatory
Newconnections
Readrequest
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quotaviolated
+ quota-time
Await processor
+ response-queue-time
Writeresponse
+ response-send-time
Investigating high request times● First look for high local time
○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
High local times during incident (e.g., fetch)
How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outsideAPI thread
Test this…
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
Quota metrics - a quick benchmark
for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack) }}
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…due to climbing client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
How to fix high local times● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356
● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:
○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
● Request queue size● Response queue sizes● Request latencies:
○ Total time○ Local time○ Response send time○ Remote time
● Request handler pool idle ratio
Monitor these closely!
Breaking compatibility
The first incident: new clients old clusters
Testcluster
(old version)
Certificationcluster
(old version)
Metricscluster
(old version)
metric events
metric events
The first incident: new clients old clusters
Testcluster
(new version)
Certificationcluster
(old version)
Metricscluster
(old version)
metric events
metric events
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.BufferUnderflowException
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)
...
New clients old clusters: remediation
Testcluster
(new version)
Certificationcluster
(new version)
Metricscluster
(old version)
metric events
metric events
Set acks to zero
New clients old clusters: remediation
Testcluster
(new version)
Certificationcluster
(new version)
Metricscluster
(new version)
metric events
metric events
Reset acks to 1
New clients old clusters: remediation(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)
KIP-35 would help a ton!
The second incident: new endpoints
{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}
x14older broker versions
ZooKeeperregistration
{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}
x14client
oldclient
ignore endpoints v2 ⇒ use endpoints
The second incident: new endpoints
{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}
x14older broker versions
ZooKeeperregistration
{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}
x36
{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}
x14client
oldclient
java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.SSL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.java:24)
New endpoints: remediation
{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}
x14older broker versions
ZooKeeperregistration
{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}
x36
{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}
x14client
oldclient
v1 ⇒ ignore endpoints
New endpoints: remediation
{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}
x14older broker versions
ZooKeeperregistration
{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}
x36
{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}
x14client
x36client
oldclient
v1 ⇒ ignore endpointsv1 ⇒ use endpoints if present
New endpoints: remediation● Fix in KAFKA-2584● Also related: KAFKA-3100
Power outage
Widespread FS corruption after power outage● Mount settings at the time
○ type ext4 (rw,noatime,data=writeback,commit=120)
● Restarts were successful but brokers subsequently hit corruption● Subsequent restarts also hit corruption in index files
Summary
● Monitoring beyond per-broker/controller metrics
○ Validate SLAs○ Continuously test admin functionality (in
test clusters)● Automate release validation● https://github.com/linkedin/streaming
Kafka monitor
Kafkacluster
producer
Monitorinstance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Availability %
● Monitoring beyond per-broker/controller metrics
○ Validate SLAs○ Continuously test admin functionality (in
test clusters)● Automate release validation● https://github.com/linkedin/streaming
Kafka monitor
Kafkacluster
producer
Monitorinstance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Monitorinstance
AdminUtils
Monitorinstance
checkReassign
checkPLE
Q&A
Software developers and Site Reliability Engineers at all levels
Streams infrastructure @ LinkedIn
● Kafka pub-sub ecosystem● Stream processing platform built on Apache Samza● Next Gen Change capture technology (incubating)
Contact Kartik Paramasivam
Where LinkedIn campus2061 Stierlin Ct., Mountain View, CA
When May 11 at 6.30 PM
Register http://bit.ly/1Sv8ach
We are hiring! LinkedIn Data Infrastructure meetup