Introduction to Kafka Streams

Kafka Streams Stream processing Made Simple with Kafka

1

Guozhang Wang Hadoop Summit, June 28, 2016

2

What is NOT Stream Processing?

3

Stream Processing isn’t (necessarily)

• Transient, approximate, lossy…

• .. that you must have batch processing as safety net

8

Stream Processing

• A different programming paradigm

• .. that brings computation to unbounded data

• .. with tradeoffs between latency / cost / correctness

9

Why Kafka in Stream Processing?

10

• Persistent Buffering

• Logical Ordering

• Scalable “source-of-truth”

Kafka: Real-time Platforms

11

Stream Processing with Kafka

12

• Option I: Do It Yourself !


13



while (isRunning) { // read some messages from Kafka inputMessages = consumer.poll();

// do some processing…

// send output messages back to Kafka producer.send(outputMessages); }

15

• Ordering

• Partitioning &

Scalability

• Fault tolerance

DIY Stream Processing is Hard

• State Management

• Time, Window &

Out-of-order Data

• Re-processing

16


• Option II: full-fledged stream processing system

• Storm, Spark, Flink, Samza, ..


17

MapReduce Heritage?

• Config Management

• Resource Management

• Configuration

• etc..

18

MapReduce Heritage?



• Deployment

• etc..

19

MapReduce Heritage?



• Deployment

• etc..

Can I just use my own?!

20


• Option II: full-fledged stream processing system

• Option III: lightweight stream processing library


Kafka Streams

• In Apache Kafka since v0.10, May 2016

• Powerful yet easy-to-use stream processing library• Event-at-a-time, Stateful

• Windowing with out-of-order handling

• Highly scalable, distributed, fault tolerant

• and more..21

22

Anywhere, anytime

Ok. Ok. Ok. Ok.

23

Anywhere, anytime

<dependency>

<groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.0</version>

</dependency>

24

Anywhere, anytime

War File

Rsync

Puppet/

Chef

YARN

Mesos

Docker

Kuberne

tes

Very Uncool Very Cool

25

Simple is Beautiful

Kafka Streams DSL

26

public static void main(String[] args) { // specify the processing topology by first reading in a stream from a topic KStream<String, String> words = builder.stream(”topic1”);

// count the words in this stream as an aggregated table KTable<String, Long> counts = words.countByKey(”Counts”);

// write the result table to a new topic counts.to(”topic2”);

// create a stream processing instance and start running it KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }

Kafka Streams DSL

27





Kafka Streams DSL

28





Kafka Streams DSL

29





Kafka Streams DSL

30





Kafka Streams DSL

31





32

Native Kafka IntegrationProperty cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

…

KafkaStreams streams = new KafkaStreams(builder, config);

33

Property cfg = new Properties();

cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);

cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);

cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);

cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);

cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);

StreamsConfig config = new StreamsConfig(cfg);

…

KafkaStreams streams = new KafkaStreams(builder, config);

Native Kafka Integration

34

API, coding

“Full stack” evaluation

Operations, debugging, …

35

API, coding

“Full stack” evaluation

Operations, debugging, …

Simple is Beautiful

36

Key Idea:

Outsource hard problems to Kafka!

Kafka Concepts: the Log

4 5 5 7 8 9 10 11 12...

Producer Write

Consumer1 Reads (offset 7)

Consumer2 Reads (offset 10)

Messages

3

Topic 1

Topic 2

Partitions

Producers

Producers

Consumers

Consumers

Brokers

Kafka Concepts: the Log

39

Kafka Streams: Key Concepts

Stream and Records

40

Key Value Key Value Key Value Key Value

Stream

Record

Processor Topology

41

Stream

Processor Topology

42

StreamProcessor

Processor Topology

43

KStream<..> stream1 = builder.stream(”topic1”);


KStream<..> joined = stream1.leftJoin(stream2, ...);

KTable<..> aggregated = joined.aggregateByKey(...);

aggregated.to(”topic3”);

Processor Topology

44






Processor Topology

45






Processor Topology

46






Processor Topology

47






Processor Topology

48

Source Processor

Sink Processor

KStream<..> stream1 = builder.stream(

KStream<..> stream2 = builder.stream(

aggregated.to(

Processor Topology

49


KStream<..> stream2 = builder.table(”topic2”);




builder.addSource(”Source1”, ”topic1”) .addSource(”Source2”, ”topic2”)

.addProcessor(”Join”, MyJoin:new, ”Source1”, ”Source2”) .addProcessor(”Aggregate”, MyAggregate:new, ”Join”)

.addStateStore(Stores.persistent().build(), ”Aggregate”)

.addSink(”Sink”, ”topic3”, ”Aggregate”)

Processor Topology

50





Processor Topology

51





Processor Topology

52





Processor Topology

53Kafka Streams Kafka

Processor Topology

54

…

sink1.to(”topic1”);

source1 = builder.table(”topic1”);

source2 = sink1.through(”topic2”);

…

Processor Topology

55

…




…

Processor Topology

56

…




…

Processor Topology

57

…




…

Processor Topology

58

…




…

Sub-Topology

Processor Topology


Stream Partitions and Tasks

63

Kafka Topic B Kafka Topic A

P1

P2

P1

P2


64

Kafka Topic B Kafka Topic A

Processor TopologyP1

P2

P1

P2


65

Kafka Topic AKafka Topic B

Kafka Topic B

Task2Task1


66

Kafka Topic A

Kafka Topic B


67

Kafka Topic A

Task2Task1

Kafka Topic B

Stream Threads

68

Kafka Topic A

MyApp.1Task2Task1

Kafka Topic B

Stream Threads

69

Kafka Topic A

Task2Task1MyApp.1 MyApp.2

Kafka Topic B

Stream Threads

70

Kafka Topic A

MyApp.1 MyApp.2Task2Task1

Stream Threads

71



Stream Threads

72

Task3MyApp.3



Stream Threads

73

Task3


Task2Task1MyApp.1 MyApp.2 MyApp.3

Stream Threads

74Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3

Kafka Topic AKafka Topic A

Stream Threads

75Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3


Stream Threads

76Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3


Stream Threads

77Thread1

Kafka Topic B

Task2Task1Thread2

Task4Task3


78

• Ordering

• Partitioning &

Scalability

• Fault tolerance

Stream Processing Hard Parts


• Time, Window &

Out-of-order Data

• Re-processing

States in Stream Processing

79

• filter

• map

• join

• aggregate

Stateless

Stateful


81






State

82




.addSink(”Sink”, ”topic3”, ”Aggregate”) State


Kafka Topic B

Task2Task1


83

Kafka Topic A

State State

It’s all about Time

• Event-time (when an event is created)

• Processing-time (when an event is processed)

84

Event-time 1 2 3 4 5 6 7Processing-time 1999 2002 2005 1997 1980 1983 2015

85

PHAN

TOM

MEN

ACE

ATTA

CK O

F TH

E CL

ON

ES

REV

ENG

E O

F TH

E SI

TH

A N

EW H

OPE

THE

EMPI

RE

STR

IKES

BAC

K

RET

UR

N O

F TH

E JE

DI

THE

FORC

E AW

AKEN

S

Out-of-Order

Timestamp Extractor

86

public long extract(ConsumerRecord<Object, Object> record) {

return System.currentTimeMillis();

}


return record.timestamp();

}

Timestamp Extractor

87



}



}

processing-time

Timestamp Extractor

88



}



}

processing-time

event-time

Timestamp Extractor

89



} processing-time

event-time


return ((JsonNode) record.value()).get(”timestamp”).longValue();

}

Windowing

90

t…

Windowing

91

t…

Windowing

92

t…

Windowing

93

t…

Windowing

94

t…

Windowing

95

t…

Windowing

96

t…

97

• Ordering

• Partitioning &

Scalability

• Fault tolerance



• Time, Window &

Out-of-order Data

• Re-processing

Stream v.s. Table?

98






State

99

Tables ≈ Streams

The Stream-Table Duality

• A stream is a changelog of a table

• A table is a materialized view at time of a stream

• Example: change data capture (CDC) of databases

103

KStream = interprets data as record stream

~ think: “append-only”

KTable = data as changelog stream

~ continuously updated materialized view

104

105

alice eggs bob lettuce alice milk

alice lnkd bob googl alice msft

KStream

KTable

User purchase history

User employment profile

106



KStream

KTable



time

“Alice bought eggs.”

“Alice is now at LinkedIn.”

107



KStream

KTable



time

“Alice bought eggs and milk.”

“Alice is now at LinkedIn Microsoft.”

108

alice 2 bob 10 alice 3

timeKStream.aggregate()

KTable.aggregate()

(key: Alice, value: 2)

(key: Alice, value: 2)

109

alice 2 bob 10 alice 3

time

(key: Alice, value: 2 3)

(key: Alice, value: 2+3)

KStream.aggregate()

KTable.aggregate()

110

KStream KTable

reduce() aggregate() …

toStream()

map() filter() join() …

map() filter() join() …

111

KTable aggregated

KStream joined

KStream stream1KStream stream2

Updates Propagation in KTable

State

112

KTable aggregated

KStream joined


State


113

KTable aggregated

KStream joined


State


114

KTable aggregated

KStream joined


State


115

• Ordering

• Partitioning &

Scalability

• Fault tolerance



• Time, Window &

Out-of-order Data

• Re-processing

116

Remember?

117

StateProcess

StateProcess

StateProcess

Kafka ChangelogFault ToleranceKafka

Kafka Streams

Kafka

118

StateProcess

StateProcess Protoco

l

StateProcess

Fault ToleranceKafka

Kafka Streams

Kafka Changelog

Kafka

119

StateProcess

StateProcess Protoco

l

StateProcess

Fault Tolerance

StateProcess

KafkaKafka Streams

Kafka Changelog

Kafka

124

• Ordering

• Partitioning &

Scalability

• Fault tolerance



• Time, Window &

Out-of-order Data

• Re-processing

125

• Ordering

• Partitioning &

Scalability

• Fault tolerance



• Time, Window &

Out-of-order Data

• Re-processing

Simple is Beautiful

Ongoing Work (0.10+)

• Beyond Java APIs

• SQL support, Python client, etc

• End-to-End Semantics (exactly-once)

• Queryable States

• … and more126

Queryable States

127

State

Real-time Analytics

select Count(*), Sum(*)

from “MyAgg”

where windowId > now() - 10;

128

But how to get data in / out Kafka?

Take-aways

• Stream Processing: a new programming paradigm

133

Take-aways


• Kafka Streams: stream processing made easy

134

Take-aways


• Kafka Streams: stream processing made easy

135

THANKS!

Guozhang Wang | [email protected] | @guozhangwang

Visit Confluent at the Syncsort Booth (#1303), live demos @ 29thDownload Kafka Streams: www.confluent.io/product

mailto:[email protected]

http://www.confluent.io/product

136

We are Hiring!

Introduction to Kafka Streams

Engineering

Transcript of Introduction to Kafka Streams