Kafkaesque days at linked in in 2015

138
Kafkaesque days at LinkedIn in 2015 Joel Koshy Kafka Summit 2016

Transcript of Kafkaesque days at linked in in 2015

Page 1: Kafkaesque days at linked in in 2015

Kafkaesque daysat LinkedIn in 2015

Joel KoshyKafka Summit 2016

Page 2: Kafkaesque days at linked in in 2015

Kafkaesqueadjective Kaf·ka·esque \ˌkäf-kə-ˈesk, ˌkaf-\

: of, relating to, or suggestive of Franz Kafka or his writings; especially : having a nightmarishly complex, bizarre, or illogical quality

Merriam-Webster

Page 3: Kafkaesque days at linked in in 2015

Kafka @ LinkedIn

Page 4: Kafkaesque days at linked in in 2015

What @bonkoif said:

More clusters

More use-cases

More problems …

Kafka @ LinkedIn

Page 5: Kafkaesque days at linked in in 2015

Incidents that we will cover● Offset rewinds● Data loss● Cluster unavailability● (In)compatibility● Blackout

Page 6: Kafkaesque days at linked in in 2015

Offset rewinds

Page 7: Kafkaesque days at linked in in 2015

What are offset rewinds?

valid offsetsinvalid offsetsinvalid offsets

yet to arrive messages

purged messages

Page 8: Kafkaesque days at linked in in 2015

If a consumer gets an OffsetOutOfRangeException:

What are offset rewinds?

valid offsetsinvalid offsetsinvalid offsets

auto.offset.reset ← earliest auto.offset.reset ← latest

Page 9: Kafkaesque days at linked in in 2015

What are offset rewinds… and why do they matter?

HADOOP Kafka(CORP)

Pushjob

Kafka(PROD) StorkMirror

MakerEmailcampaigns

Page 10: Kafkaesque days at linked in in 2015

What are offset rewinds… and why do they matter?

HADOOP KafkaPushjob

Kafka(PROD) StorkMirror

MakerEmailcampaigns

Real-life incident courtesy of xkcd

offset rewind

Page 11: Kafkaesque days at linked in in 2015

Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy

CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM

Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy

Offset rewinds: the first incident

Page 12: Kafkaesque days at linked in in 2015

Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy

CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM

Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM

Offset rewinds: the first incident

Page 13: Kafkaesque days at linked in in 2015

What are offset rewinds… and why do they matter?

HADOOP Kafka(CORP)

Pushjob

Kafka(PROD) StorkMirror

MakerEmailcampaigns

Good practice to havesome filtering logic here

Page 14: Kafkaesque days at linked in in 2015

Offset rewinds: detection

Page 15: Kafkaesque days at linked in in 2015

Offset rewinds: detection

Page 16: Kafkaesque days at linked in in 2015

Offset rewinds: detection - just use this

Page 17: Kafkaesque days at linked in in 2015

Offset rewinds: a typical cause

Page 18: Kafkaesque days at linked in in 2015

Offset rewinds: a typical cause

valid offsetsinvalid offsetsinvalid offsets

consumerposition

Page 19: Kafkaesque days at linked in in 2015

Offset rewinds: a typical cause

valid offsetsinvalid offsetsinvalid offsets

consumerposition

Unclean leader election truncates the log

Page 20: Kafkaesque days at linked in in 2015

Offset rewinds: a typical cause

valid offsetsinvalid offsetsinvalid offsets

consumerposition

Unclean leader election truncates the log… and consumer’s offset goes out of range

Page 21: Kafkaesque days at linked in in 2015

But there were no ULEs when this happened

Page 22: Kafkaesque days at linked in in 2015

But there were no ULEs when this happened… and we set auto.offset.reset to latest

Page 23: Kafkaesque days at linked in in 2015

Offset management - a quick overview

(broker)

Consumer

Consumer group

Consumer Consumer

(broker)(broker)

Consume (fetch requests)

Page 24: Kafkaesque days at linked in in 2015

Offset management - a quick overview

OffsetManager(broker)

Consumer

Consumer group

Consumer Consumer

Periodic OffsetCommitRequest

(broker)(broker)

Page 25: Kafkaesque days at linked in in 2015

Offset management - a quick overview

OffsetManager(broker)

Consumer

Consumer group

Consumer Consumer

OffsetFetchRequest(after rebalance)

(broker) (broker)

Page 26: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

Page 27: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

New offset commitsappend to the topic

Page 28: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

New offset commits append to the topic

mirror-makerPageViewEvent-0 321

mirror-makerLoginEvent-8 512

… …

Maintain offset cache to serve offset fetch requests quickly

Page 29: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

New offset commits append to the topic

mirror-makerPageViewEvent-0 321

mirror-makerLoginEvent-8 512

… …

Purge old offsets via log compaction

Maintain offset cache to serve offset fetch requests quickly

Page 30: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

When a new broker becomes the leader (i.e., offset manager) it loads offsets into its cache

Page 31: Kafkaesque days at linked in in 2015

Offset management - a quick overview

mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456

mirror-makerLoginEvent-8

512

mirror-makerPageViewEvent-0

321

__consumer_offsets topic

mirror-makerPageViewEvent-0 321

mirror-makerLoginEvent-8 512

… …

See this deck for more details

Page 32: Kafkaesque days at linked in in 2015

Back to the incident…

2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],

Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225

Page 33: Kafkaesque days at linked in in 2015

Back to the incident…

... <rebalance>

2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205

... <rebalance>

2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223

... <rebalance>

2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737

...

2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],

Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225

Page 34: Kafkaesque days at linked in in 2015

./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper <zookeeperConnect> --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config config/consumer.properties

(must set exclude.internal.topics=false in consumer.properties)

While debugging offset rewinds, do this first!

Page 35: Kafkaesque days at linked in in 2015

...…

[mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711][mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711]...

...

[mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231]

...

Inside the __consumer_offsets topic

Jul 10 (today)

Jun 1 !!

Page 36: Kafkaesque days at linked in in 2015

So why did the offset manager return a stale offset?Offset manager logs:

2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)

Page 37: Kafkaesque days at linked in in 2015

So why did the offset manager return a stale offset?Offset manager logs:

2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)

... ...

mirror-makersome-log_event, 13 6811737

... ...

Leader moved and new offset manager hit KAFKA-2117 while loading offsets

old offsets recent offsets

Page 38: Kafkaesque days at linked in in 2015

… caused a ton of offset resets

2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205

...

2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223

...

2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737

...

2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],

Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225

[some-log_event, 13]

846232 9581225

purged

Page 39: Kafkaesque days at linked in in 2015

… but why the duplicate email?

Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy

CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM

Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy

Page 40: Kafkaesque days at linked in in 2015

… but why the duplicate email?

2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464

...

2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464

...

2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539

...

Also from Jun 1

Page 41: Kafkaesque days at linked in in 2015

… but why the duplicate email?

2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464

...

2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464

...

2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539

...

[crt-event, 12]

0 11464

… but still valid!

Page 43: Kafkaesque days at linked in in 2015

Offset rewinds: the second incident

mirror makersgot wedged

restarted sent duplicate emails to (few) members

Page 44: Kafkaesque days at linked in in 2015

Offset rewinds: the second incident

Consumer logs

2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)

Page 45: Kafkaesque days at linked in in 2015

Offset rewinds: the second incident

Consumer logs

2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)

Broker (offset manager) logs

2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

Page 46: Kafkaesque days at linked in in 2015

Offset rewinds: the second incident

Consumer logs

2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)

Broker (offset manager) logs

2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

⇒ log cleaner had failed a while ago…but why did offset fetch return -1?

Page 47: Kafkaesque days at linked in in 2015

Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?

dead-groupPageViewEvent-0 321

timestamp older than a week

active-groupLoginEvent-8 512 recent

timestamp

… …

__consumer_offsets

Offset cache

cleanuptask

Page 48: Kafkaesque days at linked in in 2015

Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?

dead-groupPageViewEvent-0 321

timestamp older than a week

active-groupLoginEvent-8 512 recent

timestamp

… …

__consumer_offsets

Offset cache

cleanuptask

Append tombstones for dead-group

and delete entry in offset cache

Page 49: Kafkaesque days at linked in in 2015

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

mirror-makerPageViewEvent-0 45 very old

timstamp

mirror-makerLoginEvent-8 12 very old

timestamp

... ... ...

old offsets recent offsets

load offsets

Page 50: Kafkaesque days at linked in in 2015

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

mirror-makerPageViewEvent-0 45 very old

timstamp

mirror-makerLoginEvent-8 12 very old

timestamp

... ... ...

old offsets recent offsets

load offsets

Cleanup task happened to run during the load

Page 51: Kafkaesque days at linked in in 2015

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

... ... ...

old offsets recent offsets

load offsets

Page 52: Kafkaesque days at linked in in 2015

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

mirror-makerPageViewEvent-0 321 recent

timestamp

mirror-makerLoginEvent-8 512 recent

timestamp

... ... ...

old offsets recent offsets

load offsets

Page 53: Kafkaesque days at linked in in 2015

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

... ... ...

old offsets recent offsets

load offsets

Page 54: Kafkaesque days at linked in in 2015

Root cause of this rewind● Log cleaner had failed (separate bug)

○ ⇒ offsets topic grew big○ ⇒ offset load on leader movement took a while

● Cache cleanup ran during the load○ which appended tombstones○ and overrode the most recent offsets

● (Fixed in KAFKA-2163)

Page 55: Kafkaesque days at linked in in 2015

Offset rewinds: wrapping it up● Monitor log cleaner health● If you suspect a rewind:

○ Check for unclean leader elections○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)○ Take a dump of the offsets topic○ … stare long and hard at the logs (both consumer and offset manager)

● auto.offset.reset ← closest ?● Better lag monitoring via Burrow

Page 56: Kafkaesque days at linked in in 2015

Critical data loss

Page 57: Kafkaesque days at linked in in 2015

PROD

B

PROD

A

CORP

Y

CORP

X

Data loss: the first incident

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Page 58: Kafkaesque days at linked in in 2015

CORP

Y

CORP

X

PROD

B

PROD

A

Audit trail

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

data

audit

Page 59: Kafkaesque days at linked in in 2015

Data loss: detection (example 1)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Page 60: Kafkaesque days at linked in in 2015

Data loss: detection (example 1)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Page 61: Kafkaesque days at linked in in 2015

Data loss: detection (example 2)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Page 62: Kafkaesque days at linked in in 2015

Data loss? (The actual incident)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Page 63: Kafkaesque days at linked in in 2015

Data loss or audit issue? (The actual incident)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Sporadic discrepancies in Kafka-aggregate-CORP-X counts for several topics

However, Hadoop-X tier is complete

✔ ✔

✔✔

Page 64: Kafkaesque days at linked in in 2015

Verified actual data completeness by recounting events in a few low-volume topics

… so definitely an audit-only issue

Likely caused by dropping audit events

Page 65: Kafkaesque days at linked in in 2015

Verified actual data completeness by recounting events in a few low-volume topics

… so definitely an audit-only issue

Possible sources of discrepancy:

● Cluster auditor● Cluster itself (i.e., data loss in audit topic)● Audit front-end

Likely caused by dropping audit events

Page 66: Kafkaesque days at linked in in 2015

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

emit audit counts

Cluster auditor

● Counting incorrectly○ but same version of auditor everywhere

and only CORP-X has issues● Not consuming all data for audit or failing

to send all audit events○ but no errors in auditor logs

● … and auditor bounces did not help

Page 67: Kafkaesque days at linked in in 2015

Data loss in audit topic

● … but no unclean leader elections● … and no data loss in sampled topics

(counted manually)

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

emit audit counts

Page 68: Kafkaesque days at linked in in 2015

Audit front-end fails to insert audit events into DB

● … but other tiers (e.g., CORP-Y) are correct● … and no errors in logs

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Auditfront-end

consumeaudit

Audit DB insert

fromCORP-Y

Page 69: Kafkaesque days at linked in in 2015

● Emit counts to new test tier

Attempt to reproduce

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 70: Kafkaesque days at linked in in 2015

… fortunately worked:

● Emit counts to new test tier● test tier counts were also sporadically off

Attempt to reproduce

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 71: Kafkaesque days at linked in in 2015

● Enabled select TRACE logs to log audit events before sending

● Audit counts were correct● … and successfully emitted

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 72: Kafkaesque days at linked in in 2015

● Enabled select TRACE logs to log audit events before sending

● Audit counts were correct● … and successfully emitted● Verified from broker public access logs

that audit event was sent

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 73: Kafkaesque days at linked in in 2015

● Enabled select TRACE logs to log audit events before sending

● Audit counts were correct● … and successfully emitted● Verified from broker public access logs

that audit event was sent● … but on closer look realized it was not

the leader for that partition of the audit topic

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 74: Kafkaesque days at linked in in 2015

● Enabled select TRACE logs to log audit events before sending

● Audit counts were correct● … and successfully emitted● Verified from broker public access logs

that audit event was sent● … but on closer look realized it was not

the leader for that partition of the audit topic

● So why did it not return NotLeaderForPartition?

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

Page 75: Kafkaesque days at linked in in 2015

That broker was part of another cluster!

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka clusterTier � test

siphoned audit events

Page 76: Kafkaesque days at linked in in 2015

… and we had a VIP misconfiguration

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka cluster

VIP

stray broker entry

Page 77: Kafkaesque days at linked in in 2015

● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)

for the audit topic● ⇒ sometimes fetches metadata from the

other cluster

So audit events leaked into the other cluster

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka cluster

VIP

AuditTopicMetadataRequest

Metadataresponse

Page 78: Kafkaesque days at linked in in 2015

● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)

for the audit topic● ⇒ sometimes fetches metadata from the

other cluster● and leaks audit events to that cluster until

at least next metadata refresh

So audit events leaked into the other cluster

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka cluster

VIP

emit audit counts

Page 79: Kafkaesque days at linked in in 2015

Some takeaways● Could have been worse if mirror-makers to CORP-X had been bounced

○ (Since mirror makers could have started siphoning actual data to the other cluster)

● Consider using round-robin DNS instead of VIPs○ … which is also necessary for using per-IP connection limits

Page 80: Kafkaesque days at linked in in 2015

Data loss: the second incidentProlonged period of data loss from our Kafka REST proxy

Page 81: Kafkaesque days at linked in in 2015

Data loss: the second incident

Alerts fire that a broker in tracking cluster had gone offline

NOC engages SYSOPS to investigate

NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views

On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in tracking messages starting shortly after the broker failure

NOC engages Traffic SRE to investigate why their tracking events had stopped

Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy

Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure

Tracking events return to normal (expected) counts after the bounce

Prolonged period of data loss from our Kafka REST proxy

Page 82: Kafkaesque days at linked in in 2015

Reproducing the issue

BrokerA

Producerperformance

BrokerB

Page 83: Kafkaesque days at linked in in 2015

Reproducing the issue

BrokerA

Producerperformance

BrokerB

Isolate the broker(iptables)

Page 84: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

Leader forpartition 1

in-flightrequests

Page 85: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

New leader for partition 1

in-flightrequests

Old leader for partition 1

Page 86: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

New leader for partition 1

in-flightrequests

New producer did not implement a request timeout

Old leader for partition 1

Page 87: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

in-flightrequests

New producer did not implement a request timeout⇒ awaiting response⇒ unaware of leader change until next metadata refresh

New leader for partition 1

Old leader for partition 1

Page 88: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

in-flightrequests

So client continues to sendto partition 1

New leader for partition 1

Old leader for partition 1

Page 89: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerA

BrokerB

Partition 2

Partition n

send

batches pile up in partition 1 andeat up accumulator memory

in-flightrequests

New leader for partition 1

Old leader for partition 1

Page 90: Kafkaesque days at linked in in 2015

Sender

Accumulator

Reproducing the issue

BrokerB

Partition 2

Partition n

send

in-flightrequests

subsequent sends drop/blockper block.on.buffer .full config

New leader for partition 1

Old leader for partition 1

BrokerA

Page 91: Kafkaesque days at linked in in 2015

Reproducing the issue● netstat

tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java

● Producer metrics○ zero retry/error rate

● Thread dumpjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit)org.apache.kafka.clients.producer.internals.BufferPool.allocate(int)org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback)

● Resolved by KAFKA-2120 (KIP-19)

Page 92: Kafkaesque days at linked in in 2015

Cluster unavailability

(This is an abridged version of my earlier talk.)

Page 93: Kafkaesque days at linked in in 2015

The incidentOccurred a few days after upgrading to pick up quotas and SSL

Multi-portKAFKA-1809

KAFKA-1928

SSLKAFKA-1690x25 x38

October 13

Various quota patches

June 3April 5 August 18

Page 94: Kafkaesque days at linked in in 2015

The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster

Page 95: Kafkaesque days at linked in in 2015

The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster

Posts search was one such impacted application

Page 96: Kafkaesque days at linked in in 2015

The incidentTwo brokers report high request and response queue sizes

Page 97: Kafkaesque days at linked in in 2015

The incidentTwo brokers report high request queue size and request latencies

Page 98: Kafkaesque days at linked in in 2015

The incident● Other observations

○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs

Page 99: Kafkaesque days at linked in in 2015

RemediationShifted site traffic to another data center

“Kafka outage ⇒ member impact

Multi-colo is critical!

Page 100: Kafkaesque days at linked in in 2015

Remediation● Controller moves did not help● Firewall the affected brokers

● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure

○ … but x25 never exhibited this issue

sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT

sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

Page 101: Kafkaesque days at linked in in 2015

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Page 102: Kafkaesque days at linked in in 2015

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Move leaders

Page 103: Kafkaesque days at linked in in 2015

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade

Firewall

Page 104: Kafkaesque days at linked in in 2015

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38

Rolling downgrade

Firewall

x25

Page 105: Kafkaesque days at linked in in 2015

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38

Rolling downgrade

x25

Move leaders

Page 106: Kafkaesque days at linked in in 2015

● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce

● Upgraded the queuing cluster to x38 again○ Could not reproduce

● So nothing… �

Attempts at reproducing the issue

Page 107: Kafkaesque days at linked in in 2015

Unraveling queue backups…

Page 108: Kafkaesque days at linked in in 2015

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Await processor

+ response-queue-time

Writeresponse

+ response-send-time

Page 109: Kafkaesque days at linked in in 2015

Investigating high request times● First look for high local time

○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)

● High request queue/response queue times are effects, not causes

Page 110: Kafkaesque days at linked in in 2015

High local times during incident (e.g., fetch)

Page 111: Kafkaesque days at linked in in 2015

How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Page 112: Kafkaesque days at linked in in 2015

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Not using acks -1

Should be fast

Should be fast

Delayed outsideAPI thread

Test this…

Page 113: Kafkaesque days at linked in in 2015

Maintains byte-rate metrics on a per-client-id basis

2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

Quota metrics

??!

Page 114: Kafkaesque days at linked in in 2015

Quota metrics - a quick benchmark

for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack) }}

Page 115: Kafkaesque days at linked in in 2015

Quota metrics - a quick benchmark

Page 116: Kafkaesque days at linked in in 2015

Quota metrics - a quick benchmark

Fixed in KAFKA-2664

Page 117: Kafkaesque days at linked in in 2015

meanwhile in our queuing cluster…due to climbing client-id counts

Page 118: Kafkaesque days at linked in in 2015

Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts

○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident

■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

Page 119: Kafkaesque days at linked in in 2015

How to fix high local times● Optimize the request’s handling. For e.g.,:

○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356

● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911

● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:

○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)

○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request

Page 120: Kafkaesque days at linked in in 2015

● Request queue size● Response queue sizes● Request latencies:

○ Total time○ Local time○ Response send time○ Remote time

● Request handler pool idle ratio

Monitor these closely!

Page 121: Kafkaesque days at linked in in 2015

Breaking compatibility

Page 122: Kafkaesque days at linked in in 2015

The first incident: new clients old clusters

Testcluster

(old version)

Certificationcluster

(old version)

Metricscluster

(old version)

metric events

metric events

Page 123: Kafkaesque days at linked in in 2015

The first incident: new clients old clusters

Testcluster

(new version)

Certificationcluster

(old version)

Metricscluster

(old version)

metric events

metric events

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.BufferUnderflowException

at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)

...

Page 124: Kafkaesque days at linked in in 2015

New clients old clusters: remediation

Testcluster

(new version)

Certificationcluster

(new version)

Metricscluster

(old version)

metric events

metric events

Set acks to zero

Page 125: Kafkaesque days at linked in in 2015

New clients old clusters: remediation

Testcluster

(new version)

Certificationcluster

(new version)

Metricscluster

(new version)

metric events

metric events

Reset acks to 1

Page 126: Kafkaesque days at linked in in 2015

New clients old clusters: remediation(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)

KIP-35 would help a ton!

Page 127: Kafkaesque days at linked in in 2015

The second incident: new endpoints

{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}

x14older broker versions

ZooKeeperregistration

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}

x14client

oldclient

ignore endpoints v2 ⇒ use endpoints

Page 128: Kafkaesque days at linked in in 2015

The second incident: new endpoints

{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}

x14older broker versions

ZooKeeperregistration

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}

x36

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

oldclient

java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.SSL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.java:24)

Page 129: Kafkaesque days at linked in in 2015

New endpoints: remediation

{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}

x14older broker versions

ZooKeeperregistration

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}

x36

{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

oldclient

v1 ⇒ ignore endpoints

Page 130: Kafkaesque days at linked in in 2015

New endpoints: remediation

{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}

x14older broker versions

ZooKeeperregistration

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}

x36

{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

x36client

oldclient

v1 ⇒ ignore endpointsv1 ⇒ use endpoints if present

Page 131: Kafkaesque days at linked in in 2015

New endpoints: remediation● Fix in KAFKA-2584● Also related: KAFKA-3100

Page 132: Kafkaesque days at linked in in 2015

Power outage

Page 133: Kafkaesque days at linked in in 2015

Widespread FS corruption after power outage● Mount settings at the time

○ type ext4 (rw,noatime,data=writeback,commit=120)

● Restarts were successful but brokers subsequently hit corruption● Subsequent restarts also hit corruption in index files

Page 134: Kafkaesque days at linked in in 2015

Summary

Page 135: Kafkaesque days at linked in in 2015

● Monitoring beyond per-broker/controller metrics

○ Validate SLAs○ Continuously test admin functionality (in

test clusters)● Automate release validation● https://github.com/linkedin/streaming

Kafka monitor

Kafkacluster

producer

Monitorinstance

ackLatencyMs

e2eLatencyMs

duplicateRate

retryRate

failureRate

lossRate

consumer

Availability %

Page 136: Kafkaesque days at linked in in 2015

● Monitoring beyond per-broker/controller metrics

○ Validate SLAs○ Continuously test admin functionality (in

test clusters)● Automate release validation● https://github.com/linkedin/streaming

Kafka monitor

Kafkacluster

producer

Monitorinstance

ackLatencyMs

e2eLatencyMs

duplicateRate

retryRate

failureRate

lossRate

consumer

Monitorinstance

AdminUtils

Monitorinstance

checkReassign

checkPLE

Page 137: Kafkaesque days at linked in in 2015

Q&A

Page 138: Kafkaesque days at linked in in 2015

Software developers and Site Reliability Engineers at all levels

Streams infrastructure @ LinkedIn

● Kafka pub-sub ecosystem● Stream processing platform built on Apache Samza● Next Gen Change capture technology (incubating)

Contact Kartik Paramasivam

Where LinkedIn campus2061 Stierlin Ct., Mountain View, CA

When May 11 at 6.30 PM

Register http://bit.ly/1Sv8ach

We are hiring! LinkedIn Data Infrastructure meetup