Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe...

Introduction to Apache Kafka And Real-Time ETL

for Oracle DBAs

and Data Analysts 1

About Myself

• Gwen Shapira – System Architect @Confluent

• Committer @ Apache Kafka, Apache Sqoop

• Author of “Hadoop Application Architectures”, “Kafka – The Definitive Guide”

• Previously:

• Software Engineer @ Cloudera

• Oracle ACE Director

• Senior Consultants @ Pythian

• DBA @ Mercury Interactive

• Find me: • gwen@confluent.io • @gwenshap

There’s a Book on That!

Apache Kafka is

publish-subscribe messaging

rethought as a

distributed commit log.

turned into

a stream data platform

An Optical Illusion

• Redo Logs

• Pub-sub message queues

• So What is Kafka?

• Awesome use-cases for Kafka

• Data streams and real-time ETL

• Where can you learn more

We’ll talk about:

Transaction Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.

Important Point

The redo log is the only reliable source of information about current state of the database.

Redo Log is used for

• Recover consistent state of a database

• Replicate the database (Dataguard, Streams, GoldenGate…)

If you look far enough into archive logs – you can reconstruct the entire database

Publish-Subcribe Messaging

The Problem

We have more user requests than database connections

What do we do?

1. “Your call is important to us…”

2. Show them static content

3. Distract them with funny cat photos

4. Prioritize them

5. Acknowledge them and handle the request later

6. Jump them to the head of line

7. Tell them how long the wait is

Solution – Message Queue

Application Business Layer

Application Data Layer

DataSource

JDBC Driver Connection

DataSource Interface

Message

Queues decouple systems: Database is protected from user load

Raise your hand if this sounds familiar

“My next project was to get a working Hadoop setup…

Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “

--Jay Kreps, Kafka PMC

Client Source

Data Pipelines Start like this.

Client Source

Client

Then we reuse them

Client Backend

Client

Then we add consumers to the existing sources

Another Backend

Client Backend

Client

Then it starts to look like this

Another Backend

Client Backend

Client

With maybe some of this

Another Backend

Queues decouple systems: Adding new systems doesn’t require changing Existing systems

This is where we are trying to get

Source System Source System Source System Source System

Kafka decouples Data Pipelines

Hadoop Security Systems Real-time

monitoring Data Warehouse

Producers

Brokers

Consumers

Important notes:

• Producers and Consumers dont need to know about each other

• Performance issues on Consumers dont impact Producers

• Consumers are protected from herds of Producers

• Lots of flexibility in handling load

• Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting

http://www.slideshare.net/gwenshap/queues-pools-caches

That’s nice, but what is Kafka?

Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. In turn this solves part of a much harder problem: Communication and integration between components of large software systems

Click to enter confidentiality information

The Basics

•Messages are organized into topics

•Producers push messages

•Consumers pull messages

•Kafka runs in a cluster. Nodes are called brokers

Topics, Partitions and Logs

Each partition is a log

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Order retained with in partition

Order retained with in partition but not over

partitions O

Off sets are kept per consumer group

Kafka “Magic” – Why is it so fast?

• 250M Events per sec on one node at 3ms latency

• Scales to any number of consumers

• Stores data for set amount of time – Without tracking who read what data

• Replicates – but no need to sync to disk

• Zero-copy writes from memory / disk to network

How do people use Kafka?

• As a message bus

• As a buffer for replication systems (Like AdvancedQueue in Streams)

• As reliable feed for event processing

• As a buffer for event processing

• Decouple apps from database (both OLTP and DWH)

My Favorite Use Cases

• Shops consume inventory updates

• Clicking around an online shop? Your clicks go to Kafka and recommendations come back.

• Flagging credit card transactions as fraudulant

• Flagging game interactions as abuse

• Least favorite: Surge pricing in Uber

• Huge list of users at kafka.apache.org

Got it! But what about real-time ETL?

Remember This?

Source System Source System Source System Source System

Hadoop Security Systems Real-time

monitoring Data Warehouse

Producers

Brokers

Consumers

Kafka is smack in middle of all Data Pipelines

If data flies into Kafka in real time Why wait 24h before pulling it into a DWH?

Why Kafka makes real-time ETL better?

• Can integrate with any data source • RDBMS, NoSQL, Applications, web applications, logs

• Consumers can be real-time But they do not have to

• Reading and writing to/from Kafka is cheap • So this is a great place to store intermediate state

• You can fix mistakes by rereading some of the data again • Same data in same order

• Adding more pipelines / aggregations has no impact on source systems = low risk

It is all valuable data

Raw data

Raw data Clean data

Aggregated data Clean data Enriched data

Filtered data Dash board

Report

Data scientist

Alerts OMG

• However you want: • You just consume data, modify it, and produce it back

• Built into Kafka: • Kprocessor

• Kstream

• Popular choices: • Storm

• SparkStreaming

But wait, how do we process the data?

Need More Kafka?

• https://kafka.apache.org/documentation.html

• My video tutorial: http://shop.oreilly.com/product/0636920038603.do

• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/

• Our website: http://confluent.io

• Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/data-integrator/overview/best-practices-for-realtime-data-wa-132882.pdf

Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe...

Documents

Transcript of Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe...

Kelly Technologies · Kafka Introduction to kafka kafka Architecture Zookeeper quorum and Brokers Creating Topics , producers and consumers Kafka API Flume and Kafka PIG HBASE Introduction

Kafka Tutorial - basics of the Kafka streaming platform

MLOVE living rethought

Kafka Streams: The Stream Processing Engine of Apache Kafka

Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka

Apache Kafka - RainFocus · Apache Kafka Scalable Message ... Introduction& Motivation Apache Kafka -Scalable Message Processing and more! Apache Kafka -Overview ... • Apache Spark

Metamorfoze Kafka

Kafka Performances 1perfug.github.io/assets/files/PerfUG68.pdf · Apache Kafka. 34 Consuming From Kafka - Single Consumer C. 35 Consuming From Kafka - Grouped Consumers ... programming.”

Forcepoint Behavioral Analytics Installation Manual › content › support › library › ... · Kafka kafka 9092-9095 API, Rose Kafka Manager kafka 9000 Administrator Workstation

Rethought Patrict

Kafka Connect & Streams - the ecosystem around Kafka

Kafka Reliability Guarantees ATL Kafka User Group

GROUP 7 - cursosats.com · Cursos de Inglés ATS RETHINK / RETHOUGHT / RETHOUGHT He RETHOUGHT his plans after his family disapproved them.

101 ways to configure kafka - badly (Kafka Summit)

· Apache Kafka Introduction to Apache Kafka Apache Kafka Architecture explanation Practical Examples on Apache Kafka SCALA, PYTHON, SPARK Course Content

Formatted: Figure [PACKT] cm, Width: 21.59 cm, Height: 27 ... · Kafka 0.7.x Consumer Kafka 0.7.x Cluster Kafka Migration Kafka 0.8 Cluster Kafka 0.8 Producer Producer (Front End)

Kafka Streams: Hands-on Session - ce.uniroma2.it · Kafka Streams Kafka Streams: • Kafka Streams is a client library for processing and analyzing data stored in Kafka • Supports

Metamorfoza - Franz Kafka - Franz Kafka.pdf · Title: Metamorfoza - Franz Kafka Author: Franz Kafka Keywords: Metamorfoza - Franz Kafka Created Date: 5/13/2019 11:17:21 AM

Kafka to the Maxka - (Kafka Performance Tuning)

Round Robin Reading Rethought Dr. Kristen Pennycuff Trent kpennycuff@tntech.edu.