Kafka aws
-
Upload
ariel-moskovich -
Category
Technology
-
view
159 -
download
0
Transcript of Kafka aws
Building A kafka cluster in AWS that will survive AZ crash
A Little bit about our Production
● 2.5 Billion requests per day and growing
● Located at AWS
● Micro service architecture
● Kafka is our main message bus
● Most of the code Is written in Clojure
● Almost all of the services are consuming and/or producing from/to Kafka
What This lecture include
● Quick overview on Kafka● Why did we choose Kafka?● Decisions to make when building a Kafka cluster● Planing for fault tolerance● Setting the defaults● Automate the fault tolerance● Reassign partitions and changing retention on the fly● Adding Metrics● Testing the cluster● Demo for managing the cluster
Quick Kafka overview● Open Source Message Bus developed by Linkedin
● Designed as a distributed system
● Offers high throughput for both publishing and subscribing
● Persist messages on disk
● Supports multi-subscribers and automatically balances the consumers during failure
TERMS● A stream of messages of a particular type is defined as a topic
● A Message is defined as a payload and a Topic is a category to which messages are published
● A Producer can be anyone who can publish messages to a Topic
● The published messages stored at a set of servers called Brokers or cluster
● A Consumer can subscribe to one or more Topics and consume messages from brokers
Why Kafka?
● It fit's our Architecture for stream of events: most of our services consume, run logic and then act or produce a new event
● We need resilient solution since it's our main message bus
● It's nicely scale out● Same messages are often consumed by
different services, that enabled natively by Kafka. Messages are not deleted when consumed but after retention period
● Our large and growing number of message require high throughput
Key decisions when building a cluster
● Which instance type to use?
● How many brokers do we need?
● How to spread brokers between AZ?
● Whats the right defaults regarding retention, number of partitions, replication factor, flush intervals, etc
● What the right setting for each topic
● Log directories split up
● Zookeeper ensemble size
Planing fault tolerance
● Launch enough brokers to support failures
● Spread brokers between AZ● Set the replication factor to match at
least the number of AZ ● Grantee that each partition is spread
between all configured AZ● Make sure that Zookeeper instances
are spread between AZ● Add automation to add new brokers
fast● Add alerts for failures
Automate the fault tolerance
● Auto calculation the spread of brokers per topic that will guarantee at least one broker in each AZ and evenly spread partitions between the brokers to balance load.
● Generate a JSON with the data above that compatible with Kafka reassign format.
● Script the above steps per topic and get topic,partitions as parameters.
● Add the automation above also to new created topics.
● Automate broker addition with chef and aws prepared AMI.
● Enable auto leader rebalance
DEMO
● Show basic commands / scripts / Kafka Usage● Show internal scripts to automate brokers split up of a topic between AZ
● Show how to reassign partitions and keep fault tolerance
● Show how to change retention● Show how to run console consumer for testing● Show AppsFlyer cluster / Kafawebview / metrics / dashboards
Collecting metrics, Building dashboard and alerts
● Metrics are being send to statsd and graphite via AirBnb reporter
● Additional application metrics are being sent by internal service that measure lag
● We create dashboard with all the relevant metrics
● Alerts are being setup upon the relevant metrics to monitor
● Health check is being set for each broker
● Metrics are being send to statsd and graphite via AirBnb reporter● Additional application metrics are being sent by internal service that measure lag ● We create dashboard with all the relevant metrics● Alerts are being setup upon the relevant metrics to monitor● Health check is being set for each broker
Testing the cluster
● Build a dashboard to see the testing effects
● Stopping one/two brokers● Kill an entire AZ● Stop one ZK● Reassign partitions in runtime● Change retention in Runtime● Generate additional load and check
performance● Do combination of all the above