Building Distributed Data Streaming System

Proprietary and Confidential 1

Building Distributed Data Streaming

SystemAshish Tadose

Lead Software EngineerBig Data Analytics

2

Agenda

• What is stream processing

• Streaming architecture

• Scalable Data Ingestion

• RealTime Streaming Processing system

3

What is Streaming Process ?

Reactive

Programmin

gStreaming

Server – Sent Events

Change Data Capture

Event SourcingComplex Event

Processing

4

In simple words, Streaming is…

Processing events in the order they occur

Batch & Streaming processing

Data Generator

Ingestion

Distributed File

systemProcessin

gData Store

Batch processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Stream Data processing

6


Data Generator

Ingestion

MessageQueue

Processing

Data Store


Distributed File

systemProcessin

gData Store

Batch processing

Ashish Tadose

stream data processing is always costlier in nature as compared to batch processing. So always make it sure that you take only required fields to stream processing platform.

Ashish Tadose

Data ingestion should provide capability to fork out another data stream from existing data flow of smaller size (only required fields) and pass on that stream to a different destination (a message buffer/queue).

7


Data Generator

Ingestion

MessageQueue

Processing

Data Store


Distributed File

systemProcessin

gData Store

Batch processing

Lambda Architecture: Velocity & Volume

8

9

StreamingIngestion

Technologies

10

Ingestion Ecosystem• Sources • Machine data• External stream & syslogs

• Data Collection • Flume • Kafka• Kinesis• Confluent

11

Flume

• Easier to setup • Rich set of in-build tools • No inherent support for data replication • Nodes works in isolation • Memory channel vs File Channel

12

Kinesis

13

Kafka

http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011 Implemented in Scala, some Java 9 core committers, plus ~ 20 contributors

http://kafka.apache.org/

14

Why is Kafka so fast?• Fast writes:• While Kafka persists all data to disk, essentially all writes go to

thepage cache of OS, i.e. RAM.

• Fast reads:• Very efficient to transfer data from page cache to a network

socket• Linux: sendfile() system call

• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers

are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

14

http://kafka.apache.org/documentation.html#persistence

http://kafka.apache.org/documentation.html#persistence

15

Flafka – Flume meets Kafka

16

Confluent - Centralized Ingestion with Kafka Pipeline

17

StreamProcessing

18

RealTime Stream Processing

• Processing system• Apache Storm • Apache Samza• Apache Spark (Streaming) • Project Apex - DataTorrent

• Storage • Hive HDFS• Hbase• MySql • Custom

• Access• Depend of data storage • Scalable query interface - Kafka

19

Streaming Design Patterns

• Micro batching • Unpredictable incoming data • Creating multiple streams • Out of sequence events• Stream joins • Top N metrics • External Lookup

Thank You

20

Building Distributed Data Streaming System

Data & Analytics

Transcript of Building Distributed Data Streaming System