Building Distributed Data Streaming System

20
Proprietary and Confidential 1 Building Distributed Data Streaming System Ashish Tadose Lead Software Engineer Big Data Analytics

Transcript of Building Distributed Data Streaming System

Page 1: Building Distributed Data Streaming System

Proprietary and Confidential 1

Building Distributed Data Streaming

SystemAshish Tadose

Lead Software EngineerBig Data Analytics

Page 2: Building Distributed Data Streaming System

2

Agenda

• What is stream processing

• Streaming architecture

• Scalable Data Ingestion

• RealTime Streaming Processing system

Page 3: Building Distributed Data Streaming System

3

What is Streaming Process ?

Reactive

Programmin

gStreaming

Server – Sent Events

Change Data Capture

Event SourcingComplex Event

Processing

Page 4: Building Distributed Data Streaming System

4

In simple words, Streaming is…

Processing events in the order they occur

Page 5: Building Distributed Data Streaming System

Batch & Streaming processing

Data Generator

Ingestion

Distributed File

systemProcessin

gData Store

Batch processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Stream Data processing

Page 6: Building Distributed Data Streaming System

6

Batch & Streaming processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Stream Data processing

Distributed File

systemProcessin

gData Store

Batch processing

Ashish Tadose
stream data processing is always costlier in nature as compared to batch processing. So always make it sure that you take only required fields to stream processing platform.
Ashish Tadose
Data ingestion should provide capability to fork out another data stream from existing data flow of smaller size (only required fields) and pass on that stream to a different destination (a message buffer/queue).
Page 7: Building Distributed Data Streaming System

7

Batch & Streaming processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Stream Data processing

Distributed File

systemProcessin

gData Store

Batch processing

Page 8: Building Distributed Data Streaming System

Lambda Architecture: Velocity & Volume

8

Page 9: Building Distributed Data Streaming System

9

StreamingIngestion

Technologies

Page 10: Building Distributed Data Streaming System

10

Ingestion Ecosystem• Sources • Machine data• External stream & syslogs

• Data Collection • Flume • Kafka• Kinesis• Confluent

Page 11: Building Distributed Data Streaming System

11

Flume

• Easier to setup • Rich set of in-build tools • No inherent support for data replication • Nodes works in isolation • Memory channel vs File Channel

Page 12: Building Distributed Data Streaming System

12

Kinesis

Page 13: Building Distributed Data Streaming System

13

Kafka

http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011 Implemented in Scala, some Java 9 core committers, plus ~ 20 contributors

Page 14: Building Distributed Data Streaming System

14

Why is Kafka so fast?• Fast writes:• While Kafka persists all data to disk, essentially all writes go to

thepage cache of OS, i.e. RAM.

• Fast reads:• Very efficient to transfer data from page cache to a network

socket• Linux: sendfile() system call

• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers

are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

14

http://kafka.apache.org/documentation.html#persistence

Page 15: Building Distributed Data Streaming System

15

Flafka – Flume meets Kafka

Page 16: Building Distributed Data Streaming System

16

Confluent - Centralized Ingestion with Kafka Pipeline

Page 17: Building Distributed Data Streaming System

17

StreamProcessing

Page 18: Building Distributed Data Streaming System

18

RealTime Stream Processing

• Processing system• Apache Storm • Apache Samza• Apache Spark (Streaming) • Project Apex - DataTorrent

• Storage • Hive HDFS• Hbase• MySql • Custom

• Access• Depend of data storage • Scalable query interface - Kafka

Page 19: Building Distributed Data Streaming System

19

Streaming Design Patterns

• Micro batching • Unpredictable incoming data • Creating multiple streams • Out of sequence events• Stream joins • Top N metrics • External Lookup

Page 20: Building Distributed Data Streaming System

Thank You

20