Kafka aws

Post on 18-Jul-2015

159 views 0 download

Transcript of Kafka aws

Building A kafka cluster in AWS that will survive AZ crash

A Little bit about our Production

● 2.5 Billion requests per day and growing

● Located at AWS

● Micro service architecture

● Kafka is our main message bus

● Most of the code Is written in Clojure

● Almost all of the services are consuming and/or producing from/to Kafka

What This lecture include

● Quick overview on Kafka● Why did we choose Kafka?● Decisions to make when building a Kafka cluster● Planing for fault tolerance● Setting the defaults● Automate the fault tolerance● Reassign partitions and changing retention on the fly● Adding Metrics● Testing the cluster● Demo for managing the cluster

Quick Kafka overview● Open Source Message Bus developed by Linkedin

● Designed as a distributed system

● Offers high throughput for both publishing and subscribing

● Persist messages on disk

● Supports multi-subscribers and automatically balances the consumers during failure

TERMS● A stream of messages of a particular type is defined as a topic

● A Message is defined as a payload and a Topic is a category to which messages are published

● A Producer can be anyone who can publish messages to a Topic

● The published messages stored at a set of servers called Brokers or cluster

● A Consumer can subscribe to one or more Topics and consume messages from brokers

Why Kafka?

● It fit's our Architecture for stream of events: most of our services consume, run logic and then act or produce a new event

● We need resilient solution since it's our main message bus

● It's nicely scale out● Same messages are often consumed by

different services, that enabled natively by Kafka. Messages are not deleted when consumed but after retention period

● Our large and growing number of message require high throughput

Key decisions when building a cluster

● Which instance type to use?

● How many brokers do we need?

● How to spread brokers between AZ?

● Whats the right defaults regarding retention, number of partitions, replication factor, flush intervals, etc

● What the right setting for each topic

● Log directories split up

● Zookeeper ensemble size

Planing fault tolerance

● Launch enough brokers to support failures

● Spread brokers between AZ● Set the replication factor to match at

least the number of AZ ● Grantee that each partition is spread

between all configured AZ● Make sure that Zookeeper instances

are spread between AZ● Add automation to add new brokers

fast● Add alerts for failures

Automate the fault tolerance

● Auto calculation the spread of brokers per topic that will guarantee at least one broker in each AZ and evenly spread partitions between the brokers to balance load.

● Generate a JSON with the data above that compatible with Kafka reassign format.

● Script the above steps per topic and get topic,partitions as parameters.

● Add the automation above also to new created topics.

● Automate broker addition with chef and aws prepared AMI.

● Enable auto leader rebalance


● Show basic commands / scripts / Kafka Usage● Show internal scripts to automate brokers split up of a topic between AZ

● Show how to reassign partitions and keep fault tolerance

● Show how to change retention● Show how to run console consumer for testing● Show AppsFlyer cluster / Kafawebview / metrics / dashboards

Collecting metrics, Building dashboard and alerts

● Metrics are being send to statsd and graphite via AirBnb reporter

● Additional application metrics are being sent by internal service that measure lag

● We create dashboard with all the relevant metrics

● Alerts are being setup upon the relevant metrics to monitor

● Health check is being set for each broker

● Metrics are being send to statsd and graphite via AirBnb reporter● Additional application metrics are being sent by internal service that measure lag ● We create dashboard with all the relevant metrics● Alerts are being setup upon the relevant metrics to monitor● Health check is being set for each broker

Testing the cluster

● Build a dashboard to see the testing effects

● Stopping one/two brokers● Kill an entire AZ● Stop one ZK● Reassign partitions in runtime● Change retention in Runtime● Generate additional load and check

performance● Do combination of all the above