Building Distributed Data Streaming System
-
Upload
ashish-tadose -
Category
Data & Analytics
-
view
315 -
download
0
Transcript of Building Distributed Data Streaming System
Proprietary and Confidential 1
Building Distributed Data Streaming
SystemAshish Tadose
Lead Software EngineerBig Data Analytics
2
Agenda
• What is stream processing
• Streaming architecture
• Scalable Data Ingestion
• RealTime Streaming Processing system
3
What is Streaming Process ?
Reactive
Programmin
gStreaming
Server – Sent Events
Change Data Capture
Event SourcingComplex Event
Processing
4
In simple words, Streaming is…
Processing events in the order they occur
Batch & Streaming processing
Data Generator
Ingestion
Distributed File
systemProcessin
gData Store
Batch processing
Data Generator
Ingestion
MessageQueue
Processing
Data Store
Stream Data processing
6
Batch & Streaming processing
Data Generator
Ingestion
MessageQueue
Processing
Data Store
Stream Data processing
Distributed File
systemProcessin
gData Store
Batch processing
7
Batch & Streaming processing
Data Generator
Ingestion
MessageQueue
Processing
Data Store
Stream Data processing
Distributed File
systemProcessin
gData Store
Batch processing
Lambda Architecture: Velocity & Volume
8
9
StreamingIngestion
Technologies
10
Ingestion Ecosystem• Sources • Machine data• External stream & syslogs
• Data Collection • Flume • Kafka• Kinesis• Confluent
11
Flume
• Easier to setup • Rich set of in-build tools • No inherent support for data replication • Nodes works in isolation • Memory channel vs File Channel
12
Kinesis
13
Kafka
http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011 Implemented in Scala, some Java 9 core committers, plus ~ 20 contributors
14
Why is Kafka so fast?• Fast writes:• While Kafka persists all data to disk, essentially all writes go to
thepage cache of OS, i.e. RAM.
• Fast reads:• Very efficient to transfer data from page cache to a network
socket• Linux: sendfile() system call
• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers
are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.
14
http://kafka.apache.org/documentation.html#persistence
15
Flafka – Flume meets Kafka
16
Confluent - Centralized Ingestion with Kafka Pipeline
17
StreamProcessing
18
RealTime Stream Processing
• Processing system• Apache Storm • Apache Samza• Apache Spark (Streaming) • Project Apex - DataTorrent
• Storage • Hive HDFS• Hbase• MySql • Custom
• Access• Depend of data storage • Scalable query interface - Kafka
19
Streaming Design Patterns
• Micro batching • Unpredictable incoming data • Creating multiple streams • Out of sequence events• Stream joins • Top N metrics • External Lookup
Thank You
20