Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
-
Upload
spark-summit -
Category
Data & Analytics
-
view
457 -
download
2
Transcript of Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
![Page 1: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/1.jpg)
How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming
Ralf Sigmund, Sebastian SchröderOtto GmbH & Co. KG
![Page 2: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/2.jpg)
Agenda• Who are we?• Our Setup• The Problem• Our Approach• Requirements• Advanced Requirements• Lessons learned
![Page 3: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/3.jpg)
Who are we?• Ralf
– Technical Designer Team Tracking– Cycling, Surfing– Twitter: @sistar_hh
• Sebastian– Developer Team Tracking– Taekwondo, Cycling– Twitter: @Sebasti0n
![Page 4: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/4.jpg)
Who are we working for?
• 2nd largest ecommerce company in Germany• more than 1 million visitors every day• our blog: https://dev.otto.de• our jobs: www.otto.de/jobs
![Page 5: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/5.jpg)
Our Setupotto.de
Varnish Reverse Proxy Tracking-Server
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
![Page 6: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/6.jpg)
The Problem
otto.de
Varnish Reverse Proxy
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
Has a limit on the size of messages
Needs to send large tracking messages
Tracking-Server
![Page 7: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/7.jpg)
Our Approach
otto.de
Varnish Reverse Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
S0
S1
Sn
Spark Service
![Page 8: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/8.jpg)
Requirements
![Page 9: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/9.jpg)
Messages with the same key are merged
t: 1id: 1
t: 3id: 2
t: 2id: 3
t: 0id: 2
t: 4id: 0
t: 2id: 3
t: 1id: 1
t: 0id: 2
t: 3id: 2
t: 2id: 3
t: 2id: 3 t: 4
id: 0
![Page 10: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/10.jpg)
Messages have to be timed out
t: 1id: 1
t: 3id: 2
t: 2id: 3
t: 0id: 2
t: 4id: 0
t: 2id: 3
t: 1id: 1
t: 0id: 2
t: 3id: 2
t: 2id: 3
t: 2id: 3 t: 4
id: 0
max(business event timestamp)
time-out after 3 sec
![Page 11: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/11.jpg)
Event Streams might get stuck
t: 1id: 1
t: 3id: 2
t: 2id: 3
t: 0id: 2
t: 4id: 0
t: 2id: 3
t: 1id: 1
t: 0id: 2
t: 3id: 2
t: 2id: 3
t: 2id: 3 t: 4
id: 0
clock A
clock B
clock: min(clock A, clock B)
![Page 12: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/12.jpg)
There might be no messages
t: 1id: 1
t: 4id: 2
t: 2id: 3
t: 5heart beat
t: 1id: 1
t: 4id: 2
t: 2id: 3
time-out after 3 sec
t: 5heart beat
clock: min(clock A, clock B) : 4
![Page 13: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/13.jpg)
Requirements Summary• Messages with the same key are merged• Messages are timed out by the combined event
time of both source topics • Timed out messages are sorted by event time
![Page 14: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/14.jpg)
Solution
![Page 15: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/15.jpg)
Merge messages with the same Id• UpdateStateByKey, which is defined on pairwise
RDDs• It applies a function to all new messages and
existing state for a given key• Returns a StateDStream with the resulting
messages
![Page 16: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/16.jpg)
UpdateStateByKey
DStream
[Inpu
tMes
sage
]
DStream
[Merg
edMes
sage
]
1
2
3
1
2
3
History
RDD
[Merg
edMes
sage
]
1
2
3
UpdateFunction: (Seq[InputMessage], Option[MergedMessage]) => Option[MergedMessage]
![Page 17: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/17.jpg)
Merge messages with the same Id
![Page 18: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/18.jpg)
Merge messages with the same Id
![Page 19: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/19.jpg)
Flag And Remove Timed Out Msgs
DStream
[Inpu
tMes
sage
]
DStream
[Merg
edMes
sage
]
1
MMTO2
2
3
History
RDD
[Merg
edMes
sage
]
MM
MMTO MM
TO
Filter (t
imed out)
Filter !(
timed out)
2
3
1 MM
Flag newly timed out
1 MM
3
needs globalBusiness Clock
UpdateFunction
![Page 20: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/20.jpg)
Messages are timed out by event time
• We need access to all messages from the current micro batch and the history (state) => Custom StateDStream
• Compute the maximum event time of both topics and supply it to the updateStateByKey function
![Page 21: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/21.jpg)
Messages are timed out by event time
![Page 22: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/22.jpg)
Messages are timed out by event time
![Page 23: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/23.jpg)
Messages are timed out by event time
![Page 24: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/24.jpg)
Messages are timed out by event time
![Page 25: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/25.jpg)
Advanced Requirements
![Page 26: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/26.jpg)
topi
c B
topi
c A
Effect of different request rates in source topics
10K
5K
0
10K
5K
0
catching upwith 5K per microbatch
data read too early
![Page 27: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/27.jpg)
Handle different request rates in topics
• Stop reading from one topic if the event time of other topic is too far behind
• Store latest event time of each topic and partition on driver
• Use custom KafkaDStream with clamp method which uses the event time
![Page 28: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/28.jpg)
Handle different request rates in topics
![Page 29: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/29.jpg)
Handle different request rates in topics
![Page 30: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/30.jpg)
Handle different request rates in topics
![Page 31: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/31.jpg)
Handle different request rates in topics
![Page 32: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/32.jpg)
Guarantee at least once semantics
o: 1t: a
o: 7 t: a
o: 2t: a
o: 5t: b
o: 9t: b
t: 6t: b
Current State Sink Topic
a-offset: 1b-offset: 5a-offset: 1
b-offset: 5a-offset: 1b-offset: 5
On Restart/RecoveryRetrieve last message
Timeout
Spark Merge Service
![Page 33: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/33.jpg)
Guarantee at least once semantics
![Page 34: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/34.jpg)
Guarantee at least once semantics
![Page 35: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/35.jpg)
Everything put together
![Page 36: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/36.jpg)
Lessons learned• Excellent Performance and Scalability• Extensible via Custom RDDs and Extension to
DirectKafka• No event time windows in Spark Streaming
See: https://github.com/apache/spark/pull/2633
• Checkpointing cannot be used when deploying new artifact versions
• Driver/executor model is powerful, but also complex
![Page 37: Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund](https://reader031.fdocuments.in/reader031/viewer/2022021422/586f75da1a28ab10258b61f9/html5/thumbnails/37.jpg)
THANK YOU.Twitter: @sistar_hh@Sebasti0n
Blog: dev.otto.deJobs: otto.de/jobs