Reintroducing the Stream Processor: A universal tool for continuous data analysis
-
Upload
paris-carbone -
Category
Data & Analytics
-
view
222 -
download
0
Transcript of Reintroducing the Stream Processor: A universal tool for continuous data analysis
![Page 1: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/1.jpg)
Re-introducing the Stream Processor
A Universal Tool for Continuous Data Analytical Needs
A Universal Tool for Continuous Data Analysis
Paris CarboneCommitter @ Apache Flink
PhD Candidate @ KTH
![Page 2: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/2.jpg)
Data Stream Processors
Data Stream Processor
can set up any data pipeline for you
http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
![Page 3: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/3.jpg)
Is this really a step forward in data processing?
A growing open-source ecosystem:
kafkaflink beam apex
e.g.
General Idea of the tech:• Processes pipeline computation in a cluster • Computation is continuous and parallel (like data) • Event-processing logic <-> Application state• It’s production-ready and aims to simplify analytics
Data Stream Processors
streams
![Page 4: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/4.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
![Page 5: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/5.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
![Page 6: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/6.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
1. Speed
stream processor
![Page 7: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/7.jpg)
1. SpeedLow-Latency Data Processing
Traditionally the sole reason stream processing was used
• No intermediate scheduling (you let it run) • No physical blocking (pre-compute on the go) • Copy-on-write for state and output
How do stream processors achieve low latency?
But Is this is only relevant for live data?
CEP semantics etc. are nowadays provided as additional libraries for stream processors
![Page 8: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/8.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
1. Speed 2. History
stream processor
![Page 9: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/9.jpg)
2. HistoryOffline Data Processing
It is possible and better over bulk historical data analysis
• Ability to define custom state to build up models • Large-scale support is a given (inherits cluster computing benefits) • Separation of notions of time and out-of-order processing
What can stream processors do for historical data?
But isn’t streaming hard to deal with failures?
session
windows
event-timewindowse.g.,
![Page 10: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/10.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream processor
![Page 11: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/11.jpg)
3. DurabilityExactly-Once Data Processing
Traditionally streaming ~ lossy, approximate processingThis is no longer true. Forget the ‘lambda architecture’.
• Input records are durably stored and indexed in logs (e.g., Kafka) • Systems handle state snapshotting & transactions with external
stores transparently. • Idempontent and transactional writes to external stores
part 1 part 2 part 3 part 4
on Flink each stream computation either completes or repeatse.g.
![Page 12: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/12.jpg)
3. DurabilityExactly-Once Data Processing
input streams
application states
stream processor
rollback
![Page 13: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/13.jpg)
complex event procfast approximate streamingETL
event logs
production database
4 Aspects of Data Processing
rules
data warehouses + historical data
application state+ failover
“microservices"
complex analytics
large-scale processing systems
interactivequeries
data sciencereports
dev
user analyst
data engineer
1. Speed 2. History
3. Durability
stream processor
4. Interactivity
![Page 14: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/14.jpg)
4. InteractivityQuerying Data Processing State
Stream Processor ~ Inverse DBMS
Application state holds fresh knowledge we want to query:
• In some systems (e.g. Kafka-Streams) we can use the changelog • In other systems (i.e., Flink) we can query the state externally…or
stream queries on custom query processor on-top of them*
Alice
Bob? Bob=…
*https://techblog.king.com/rbea-scalable-real-time-analytics-king/
![Page 15: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/15.jpg)
4 Aspects of Data Processing1. Speed 2. History
3. Durability 4. Interactivity
stream processor
• no physical blocking/staging • no rescheduling • efficient pipelining • copy-on-write data structures
• different notions of time • flexible stateful processing • high throughput
• durable input logging is a standard • automated state management • exactly-once processing • output commit & Idempotency
• external access to state/changelogs
• ability to ‘stream queries’ over state
![Page 16: Reintroducing the Stream Processor: A universal tool for continuous data analysis](https://reader031.fdocuments.in/reader031/viewer/2022022412/58f2a7021a28ab9c3a8b45bf/html5/thumbnails/16.jpg)
@SenorCarbone
Try out Stream Processing
https://flink.apache.org/
https://kafka.apache.org/https://beam.apache.org/