Kafkaesque days at linked in in 2015

Kafkaesque daysat LinkedIn in 2015

Joel KoshyKafka Summit 2016

https://www.linkedin.com/in/jjkoshy

https://www.linkedin.com/in/jjkoshy

Kafkaesqueadjective Kaf·ka·esque \ˌkäf-kə-ˈesk, ˌkaf-\

: of, relating to, or suggestive of Franz Kafka or his writings; especially : having a nightmarishly complex, bizarre, or illogical quality

Merriam-Webster

http://www.merriam-webster.com/dictionary/kafka

Kafka @ LinkedIn

What @bonkoif said:

More clusters

More use-cases

More problems …

Kafka @ LinkedIn

https://twitter.com/bonkoif

Incidents that we will cover● Offset rewinds● Data loss● Cluster unavailability● (In)compatibility● Blackout

Offset rewinds

What are offset rewinds?

valid offsetsinvalid offsetsinvalid offsets

yet to arrive messages

purged messages

If a consumer gets an OffsetOutOfRangeException:

What are offset rewinds?


auto.offset.reset ← earliest auto.offset.reset ← latest

What are offset rewinds… and why do they matter?

HADOOP Kafka(CORP)

Pushjob

Kafka(PROD) StorkMirror

MakerEmailcampaigns


HADOOP KafkaPushjob


MakerEmailcampaigns

Real-life incident courtesy of xkcd

offset rewind

http://xkcd.com/1642

Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy

CRT Notifications <[email protected]> Fri, Jul 10, 2015 at 8:27 PM

Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy

Offset rewinds: the first incident

mailto:[email protected]



Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM

Offset rewinds: the first incident



HADOOP Kafka(CORP)

Pushjob


MakerEmailcampaigns

Good practice to havesome filtering logic here

Offset rewinds: detection

Offset rewinds: detection - just use this

https://github.com/linkedin/Burrow

Offset rewinds: a typical cause



consumerposition



consumerposition

Unclean leader election truncates the log



consumerposition

Unclean leader election truncates the log… and consumer’s offset goes out of range

But there were no ULEs when this happened

But there were no ULEs when this happened… and we set auto.offset.reset to latest

Offset management - a quick overview

(broker)

Consumer

Consumer group

Consumer Consumer

(broker)(broker)

Consume (fetch requests)


OffsetManager(broker)

Consumer

Consumer group

Consumer Consumer

Periodic OffsetCommitRequest

(broker)(broker)


OffsetManager(broker)

Consumer

Consumer group

Consumer Consumer

OffsetFetchRequest(after rebalance)

(broker) (broker)


mirror-makerPageViewEvent-0

240

mirror-makerLoginEvent-8

456


512


321

__consumer_offsets topic



240


456


512


321


New offset commitsappend to the topic



240


456


512


321


New offset commits append to the topic

mirror-makerPageViewEvent-0 321

mirror-makerLoginEvent-8 512

… …

Maintain offset cache to serve offset fetch requests quickly



240


456


512


321


New offset commits append to the topic



… …

Purge old offsets via log compaction

Maintain offset cache to serve offset fetch requests quickly



240


456


512


321


When a new broker becomes the leader (i.e., offset manager) it loads offsets into its cache



240


456


512


321




… …

See this deck for more details

http://www.slideshare.net/jjkoshy/offset-management-in-kafka

Back to the incident…

2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],

Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225

Back to the incident…

... <rebalance>

2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205

... <rebalance>


... <rebalance>


...



./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper <zookeeperConnect> --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --consumer.config config/consumer.properties

(must set exclude.internal.topics=false in consumer.properties)

While debugging offset rewinds, do this first!

...…

[mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711][mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711]...

...

[mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231]

...

Inside the __consumer_offsets topic

Jul 10 (today)

Jun 1 !!

So why did the offset manager return a stale offset?Offset manager logs:

2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)

So why did the offset manager return a stale offset?Offset manager logs:

2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63]java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Stringatkafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)

... ...

mirror-makersome-log_event, 13 6811737

... ...

Leader moved and new offset manager hit KAFKA-2117 while loading offsets

old offsets recent offsets

https://issues.apache.org/jira/browse/KAFKA-2117

… caused a ton of offset resets


...


...


...



[some-log_event, 13]

846232 9581225

purged

… but why the duplicate email?



Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy



2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464

...


...


...

Also from Jun 1



...


...


...

[crt-event, 12]

0 11464

… but still valid!

Time-based retention does not work wellfor low-volume topics

Addressed by KIP-32/KIP-33

https://cwiki.apache.org/confluence/display/KAFKA/KIP-32+-+Add+timestamps+to+Kafka+message

https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index

Offset rewinds: the second incident

mirror makersgot wedged

restarted sent duplicate emails to (few) members


Consumer logs

2015/04/29 17:22:48.952 <rebalance started>...2015/04/29 17:36:37.790 <rebalance ended> initOffset -1 (for various partitions)


Consumer logs


Broker (offset manager) logs

2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)


Consumer logs


Broker (offset manager) logs

2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

⇒ log cleaner had failed a while ago…but why did offset fetch return -1?

Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?

dead-groupPageViewEvent-0 321

timestamp older than a week

active-groupLoginEvent-8 512 recent

timestamp

… …

__consumer_offsets

Offset cache

cleanuptask

Offset management - a quick overviewHow are stale offsets (for dead consumers) cleaned up?

dead-groupPageViewEvent-0 321

timestamp older than a week

active-groupLoginEvent-8 512 recent

timestamp

… …

__consumer_offsets

Offset cache

cleanuptask

Append tombstones for dead-group

and delete entry in offset cache

Back to the incident...2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84]...2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

mirror-makerPageViewEvent-0 45 very old

timstamp

mirror-makerLoginEvent-8 12 very old

timestamp

... ... ...


load offsets


mirror-makerPageViewEvent-0 45 very old

timstamp

mirror-makerLoginEvent-8 12 very old

timestamp

... ... ...


load offsets

Cleanup task happened to run during the load


... ... ...


load offsets


mirror-makerPageViewEvent-0 321 recent

timestamp

mirror-makerLoginEvent-8 512 recent

timestamp

... ... ...


load offsets


... ... ...


load offsets

Root cause of this rewind● Log cleaner had failed (separate bug)

○ ⇒ offsets topic grew big○ ⇒ offset load on leader movement took a while

● Cache cleanup ran during the load○ which appended tombstones○ and overrode the most recent offsets

● (Fixed in KAFKA-2163)


Offset rewinds: wrapping it up● Monitor log cleaner health● If you suspect a rewind:

○ Check for unclean leader elections○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)○ Take a dump of the offsets topic○ … stare long and hard at the logs (both consumer and offset manager)

● auto.offset.reset ← closest ?● Better lag monitoring via Burrow

https://github.com/linkedin/Burrow

Critical data loss

PROD

B

PROD

A

CORP

Y

CORP

X

Data loss: the first incident

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

CORP

Y

CORP

X

PROD

B

PROD

A

Audit trail

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

data

audit

Data loss: detection (example 1)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Data loss: detection (example 2)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Data loss? (The actual incident)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Data loss or audit issue? (The actual incident)

PROD

B

PROD

A

CORP

Y

CORP

X

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Kafkaaggregate

Kafkalocal

Kafkaaggregate

Hadoop

Producers

Sporadic discrepancies in Kafka-aggregate-CORP-X counts for several topics

However, Hadoop-X tier is complete

✔

✔ ✔

✔

✔

✔✔

Verified actual data completeness by recounting events in a few low-volume topics

… so definitely an audit-only issue

Likely caused by dropping audit events

Verified actual data completeness by recounting events in a few low-volume topics

… so definitely an audit-only issue

Possible sources of discrepancy:

● Cluster auditor● Cluster itself (i.e., data loss in audit topic)● Audit front-end

Likely caused by dropping audit events

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

emit audit counts

Cluster auditor

● Counting incorrectly○ but same version of auditor everywhere

and only CORP-X has issues● Not consuming all data for audit or failing

to send all audit events○ but no errors in auditor logs

● … and auditor bounces did not help

Data loss in audit topic

● … but no unclean leader elections● … and no data loss in sampled topics

(counted manually)

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

emit audit counts

Audit front-end fails to insert audit events into DB

● … but other tiers (e.g., CORP-Y) are correct● … and no errors in logs

Possible causes

CORP

X

Kafkaaggregate

Hadoop

Auditfront-end

consumeaudit

Audit DB insert

fromCORP-Y

● Emit counts to new test tier

Attempt to reproduce

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

… fortunately worked:

● Emit counts to new test tier● test tier counts were also sporadically off

Attempt to reproduce

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

● Enabled select TRACE logs to log audit events before sending

● Audit counts were correct● … and successfully emitted

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test


● Audit counts were correct● … and successfully emitted● Verified from broker public access logs

that audit event was sent

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test



that audit event was sent● … but on closer look realized it was not

the leader for that partition of the audit topic

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test



that audit event was sent● … but on closer look realized it was not

the leader for that partition of the audit topic

● So why did it not return NotLeaderForPartition?

… and debug

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

consume all topics

Tier � CORP-X

Clusterauditor

Tier � test

That broker was part of another cluster!

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka clusterTier � test

siphoned audit events

… and we had a VIP misconfiguration

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor

Some otherKafka cluster

VIP

stray broker entry

● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)

for the audit topic● ⇒ sometimes fetches metadata from the

other cluster

So audit events leaked into the other cluster

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor


VIP

AuditTopicMetadataRequest

Metadataresponse

● Auditor still uses the old producer● Periodically refreshes metadata (via VIP)

for the audit topic● ⇒ sometimes fetches metadata from the

other cluster● and leaks audit events to that cluster until

at least next metadata refresh

So audit events leaked into the other cluster

CORP

X

Kafkaaggregate

Hadoop

Clusterauditor


VIP

emit audit counts

Some takeaways● Could have been worse if mirror-makers to CORP-X had been bounced

○ (Since mirror makers could have started siphoning actual data to the other cluster)

● Consider using round-robin DNS instead of VIPs○ … which is also necessary for using per-IP connection limits

https://en.wikipedia.org/wiki/Round-robin_DNS


Data loss: the second incidentProlonged period of data loss from our Kafka REST proxy

Data loss: the second incident

Alerts fire that a broker in tracking cluster had gone offline

NOC engages SYSOPS to investigate

NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views

On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in tracking messages starting shortly after the broker failure

NOC engages Traffic SRE to investigate why their tracking events had stopped

Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy

Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure

Tracking events return to normal (expected) counts after the bounce

Prolonged period of data loss from our Kafka REST proxy

Reproducing the issue

BrokerA

Producerperformance

BrokerB


BrokerA

Producerperformance

BrokerB

Isolate the broker(iptables)

Sender

Accumulator


BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

Leader forpartition 1

in-flightrequests

Sender

Accumulator


BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

New leader for partition 1

in-flightrequests

Old leader for partition 1

Sender

Accumulator


BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send


in-flightrequests

New producer did not implement a request timeout


Sender

Accumulator


BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

in-flightrequests

New producer did not implement a request timeout⇒ awaiting response⇒ unaware of leader change until next metadata refresh



Sender

Accumulator


BrokerA

BrokerB

Partition 1

Partition 2

Partition n

send

in-flightrequests

So client continues to sendto partition 1



Sender

Accumulator


BrokerA

BrokerB

Partition 2

Partition n

send

batches pile up in partition 1 andeat up accumulator memory

in-flightrequests



Sender

Accumulator


BrokerB

Partition 2

Partition n

send

in-flightrequests

subsequent sends drop/blockper block.on.buffer .full config



BrokerA

Reproducing the issue● netstat

tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java

● Producer metrics○ zero retry/error rate

● Thread dumpjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit)org.apache.kafka.clients.producer.internals.BufferPool.allocate(int)org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback)

● Resolved by KAFKA-2120 (KIP-19)


https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient

Cluster unavailability

(This is an abridged version of my earlier talk.)

http://www.slideshare.net/jjkoshy/troubleshooting-kafkas-socket-server-from-incident-to-resolution

The incidentOccurred a few days after upgrading to pick up quotas and SSL

Multi-portKAFKA-1809

KAFKA-1928

SSLKAFKA-1690x25 x38

October 13

Various quota patches

June 3April 5 August 18

The incidentBroker (which happened to be controller) failed in our queuing Kafka cluster

The incidentMultiple applications begin to report “issues”: socket timeouts to Kafka cluster

Posts search was one such impacted application

The incidentTwo brokers report high request and response queue sizes

The incidentTwo brokers report high request queue size and request latencies

The incident● Other observations

○ High CPU load on those brokers○ Throughput degrades to ~ half the normal throughput○ Tons of broken pipe exceptions in server logs○ Application owners report socket timeouts in their logs

RemediationShifted site traffic to another data center

“Kafka outage ⇒ member impact

Multi-colo is critical!

Remediation● Controller moves did not help● Firewall the affected brokers

● The above helped, but cluster fell over again after dropping the rules● Suspect misbehaving clients on broker failure

○ … but x25 never exhibited this issue

sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT

sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

RemediationFriday night ⇒ roll-back to x25 and debug later

… but SREs had to babysit the rollback

x38 x38 x38 x38

Rolling downgrade



x38 x38 x38 x38

Rolling downgrade

Move leaders



x38 x38 x38 x38

Rolling downgrade

Firewall



x38 x38 x38

Rolling downgrade

Firewall

x25



x38 x38 x38

Rolling downgrade

x25

Move leaders

● Test cluster○ Tried killing controller○ Multiple rolling bounces○ Could not reproduce

● Upgraded the queuing cluster to x38 again○ Could not reproduce

● So nothing… �

Attempts at reproducing the issue

Unraveling queue backups…

API layer

Life-cycle of a Kafka requestNetwork layer

Acceptor

Processor

Processor

Processor

Processor

Response queue

Response queue

Response queue

Response queue

Request queue

API handler

API handler

API handler

Clie

nt c

onne

ctio

ns

Purgatory

Newconnections

Readrequest

Quota manager

Await handling

Total time = queue-time

Handle request

+ local-time + remote-time

long-poll requests

Hold if quotaviolated

+ quota-time

Await processor

+ response-queue-time

Writeresponse

+ response-send-time

Investigating high request times● First look for high local time

○ then high response send time■ then high remote (purgatory) time → generally non-issue (but caveats described later)

● High request queue/response queue times are effects, not causes

High local times during incident (e.g., fetch)

How are fetch requests handled?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Could these cause high local times?● Get physical offsets to be read from local log during response● If fetch from follower (i.e., replica fetch):

○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)○ Maybe satisfy eligible delayed produce requests (with acks -1)

● Else (i.e., consumer fetch):○ Record/update byte-rate of this client○ Throttle the request on quota violation

Not using acks -1

Should be fast

Should be fast

Delayed outsideAPI thread

Test this…

Maintains byte-rate metrics on a per-client-id basis

2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:ANONYMOUS

Quota metrics

??!

Quota metrics - a quick benchmark

for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack) }}


Fixed in KAFKA-2664


meanwhile in our queuing cluster…due to climbing client-id counts

Rolling bounce of cluster forced the issue to recur on brokers that had high client-id metric counts

○ Used jmxterm to check per-client-id metric counts before experiment○ Hooked up profiler to verify during incident

■ Generally avoid profiling/heapdumps in production due to interference○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

https://cwiki.apache.org/confluence/display/KAFKA/jmxterm+quickstart

How to fix high local times● Optimize the request’s handling. For e.g.,:

○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)○ and KAFKA-1356

● Make it asynchronous○ E.g., we will do this for StopReplica in KAFKA-1911

● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats:

○ Higher memory pressure if request purgatory size grows○ Expired requests are handled in purgatory expiration thread (which is good)

○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request




● Request queue size● Response queue sizes● Request latencies:

○ Total time○ Local time○ Response send time○ Remote time

● Request handler pool idle ratio

Monitor these closely!

Breaking compatibility

The first incident: new clients old clusters

Testcluster

(old version)

Certificationcluster

(old version)

Metricscluster

(old version)

metric events

metric events

The first incident: new clients old clusters

Testcluster

(new version)


(old version)

Metricscluster

(old version)

metric events

metric events

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.BufferUnderflowException

at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)

...

New clients old clusters: remediation

Testcluster

(new version)


(new version)

Metricscluster

(old version)

metric events

metric events

Set acks to zero

New clients old clusters: remediation

Testcluster

(new version)


(new version)

Metricscluster

(new version)

metric events

metric events

Reset acks to 1

New clients old clusters: remediation(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)

KIP-35 would help a ton!

https://cwiki.apache.org/confluence/display/KAFKA/KIP-35+-+Retrieving+protocol+version

https://cwiki.apache.org/confluence/display/KAFKA/KIP-35+-+Retrieving+protocol+version

The second incident: new endpoints

{ "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092}

x14older broker versions

ZooKeeperregistration

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}]}

x14client

oldclient

ignore endpoints v2 ⇒ use endpoints

The second incident: new endpoints





x36

{ "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

oldclient

java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.SSL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.java:24)

New endpoints: remediation





x36

{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

oldclient

v1 ⇒ ignore endpoints

New endpoints: remediation





x36

{ "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost:9092"}, {“ssl://localhost:9093”} ]}

x14client

x36client

oldclient

v1 ⇒ ignore endpointsv1 ⇒ use endpoints if present

New endpoints: remediation● Fix in KAFKA-2584● Also related: KAFKA-3100



Power outage

Widespread FS corruption after power outage● Mount settings at the time

○ type ext4 (rw,noatime,data=writeback,commit=120)

● Restarts were successful but brokers subsequently hit corruption● Subsequent restarts also hit corruption in index files

Summary

● Monitoring beyond per-broker/controller metrics

○ Validate SLAs○ Continuously test admin functionality (in

test clusters)● Automate release validation● https://github.com/linkedin/streaming

Kafka monitor

Kafkacluster

producer

Monitorinstance

ackLatencyMs

e2eLatencyMs

duplicateRate

retryRate

failureRate

lossRate

consumer

Availability %

http://kafka.apache.org/documentation.html#monitoring



https://github.com/linkedin/streaming


https://github.com/linkedin/kafka-monitor


● Monitoring beyond per-broker/controller metrics

○ Validate SLAs○ Continuously test admin functionality (in

test clusters)● Automate release validation● https://github.com/linkedin/streaming

Kafka monitor

Kafkacluster

producer

Monitorinstance

ackLatencyMs

e2eLatencyMs

duplicateRate

retryRate

failureRate

lossRate

consumer

Monitorinstance

AdminUtils

Monitorinstance

checkReassign

checkPLE








Software developers and Site Reliability Engineers at all levels

Streams infrastructure @ LinkedIn

● Kafka pub-sub ecosystem● Stream processing platform built on Apache Samza● Next Gen Change capture technology (incubating)

Contact Kartik Paramasivam

Where LinkedIn campus2061 Stierlin Ct., Mountain View, CA

When May 11 at 6.30 PM

Register http://bit.ly/1Sv8ach

We are hiring! LinkedIn Data Infrastructure meetup

https://www.linkedin.com/in/kartik-paramasivam-b71b0711

http://bit.ly/1Sv8ach

http://bit.ly/1Sv8ach

Kafkaesque days at linked in in 2015

Data & Analytics

Transcript of Kafkaesque days at linked in in 2015