Streaming, Database & Distributed Systems Bridging the Divide

Streaming, Database & Distributed Systems:

Bridging the Divide Ben Stopford (@benstopford)

Codemesh 2016

Event Driven Systems

Most stateful systems have to pull from these three worlds

Today we have 2 goals

1.  Understand Stateful Stream Processing (now & near future)

2.  Case for SSP as a general framework for building data-centric systems.

Data systems come in different forms

•  Database (OLTP)

•  Analytics Database (OLAP/Hadoop)

•  Messaging

•  Distributed log

•  Stream Processing

•  Stateful Stream Processing

Database (OLTP)

Focuses on providing a consistent view that supports updates and queries on individual tuples.

Analytics Database (OLAP/Hadoop) 1.  Focuses on aggregations via table scans.

2.  Executes as distributed system

Messaging Focuses on asynchronous information transfer with limited state

Distributed Log

1.  Similar to messaging, but data can be retained

2.  Executes as distributed system (scale + fault tolerance)

Stream Processing

Manipulate concurrent streams of events

Comes from CEP background (ephemeral)

Stateful Stream Processing Moves stream processing to be a more general framework for building data-centric systems.

What is stream processing?

Data Index

Query Engine

Database Finite source

Stream Processor Infinite source

Infinite streams need windows

How many items will we bring into the machine at one time?

Windows bound a computation

Buffering allows us to handle late events

Some query Over some time window Emitting at some frequency

Continually executing query

Stream(s)

Stream Processing Engine

Derived Stream

Avg(p.time – o.time) From orders, payment Group by payment.region over 1 day window emitting every second

Stream Processing

orders !

payments!

Completion time, by region!

Avg(o.time – p.time) From orders, payment Group by payment.region over 1 day window emitting every second

Materialised View (DB )

orders !

payments!

Completion time, by region!

Avg(o.time – p.time) From orders, payment, user Group by user.region over 1 day window emitting every second

Stateful Stream Processing

Streams

Stream Processing Engine

Derived Stream

Derived “Table” Table

“View” is output as table or stream

Table == Stream + Window0n

== 0 N

Table is a stream with an infinite window (i.e. buffer from 0 -> now)

window !

SSP is about creating materialised views.

Materialised as a table, or materialised as a stream

Features: similar to database query engine

Join Filter Aggr- egate

View Windowed Streams

Can distribute over many machines in two dimensions

Scale Out Scale Forward

Stateful Stream Processing engines typically use Kafka (a distributed commit log)

Kafka (a distributed log)

A log is very simple idea

Messages are added at the end of the log

Just think of the log as a file

Old New

Readers have a position & scan

Sally is here

George is here

Fred is here

Old New

Scan Scan

Can “Rewind & Replay” the log

Rewind & Replay

Compacted Log (Tabular View)

Version 3

Version 2

Version 1

Version 2

Version 1

Version 5

Version 4

Version 3

Version 2

Version 1

Version 2

Version 3

Version 5

STEAM (All versions)

COMPACTED STREAM (Latest Key only)

The log is a Distributed System

For scalability and fault tolerance

Shard on the way in

Producers

Consumers

Each shard is a queue

Producers

Consumers

Producers

Many consumers share partitions

in one topic

Consumers share consumption of a single topic

The Log reassigns data on failure

Producers

Many consumers share partitions in

one topic

Kafka supplies two levels of leader election

Replicas in Kafka have an elected leader

Consumers in Kafka have an elected leader

The log is important for SSP

Maintains History: Acts like a “push based” distributed file system

The log is important: Two Primitives

Stream

Compacted Stream (‘table’)

The Log is, to a streaming engine, what HDFS is to Hadoop

But it’s a bit more than a HDFS replacement: Processors inherit the idea of “membership” from the log

So stateful Stream Processors use the Log

Kafka (Distributed Log)

They also use local storage

(1) a Kafka

(2) Local KV Store

Local KV store has a few uses

(1)  It caches streams on disk (2)  It caches “tables” on disk

This makes join operations fast as they’re entirely local

Streams just cache recent messages to help with joins

Tables are fully “realised” locally

stream

Compacted stream

Stream data

Stream-Tabular Data

Infinite Stream

Locally Cached Table

(disk resident)

Kafka Kafka Streams

e.g. Useful for Enrichment

stream

Compacted stream

Orders

Customers

Kafka Kafka Streams

Local DB

Aggregates need intermediary state

stream

Compacted stream

Orders

Customers

Kafka Sum(orders) group by region

Persist current value, in case we fail

State store inherits durability from the log

State store flushes back to the log

Separate Data, Processing & View

Orders Payments View

Storage Layer (a Kafka)

Processing & View

You can query the views from anywhere

Processing & View

So what happens on failure?

Processing & View

Clustering Reroutes Data to surviving node

Storage Layer (Kafka)

Ownership of partitions is re-routed from dead node

Processing & View

But what about state?

Storage Layer (Kafka)

“Cold” replica of state takes over

Processing & View

Primitives for sharding & replication

Orders Payments Stock

Redundant copies are cached on other nodes

Sharding spread data over processors

So processors inherit much from the log

Clustering comes from the log

You just write the functional bit

General framework for distributed, realtime data computation

Protection from broker failure

Protection from engine failure

Join tables & streams (in process)

Event Driven

Create views which can be queried

But stream processing has a

problem

Correctness Guarantees in multi layer topologies

Duplicates are a side effect of all at-least-once delivery mechanisms

Data is rerouted, on failure, which can cause duplicates

Idempotance isn’t enough

Filter

Distributed Snapshots* (transactions)

Transaction markers: [Begin], [Prepare], [Commit], [Abort]

Buffer

Chandy, Lamport - Distributed Snapshots: Determining Global States of Distributed Systems

*In development in Kafka

So why use these tools?

(1) Streaming is a superset of batch

Databases look backwards

Batch == Streaming from offset 0

Distributed File System (HDFS)

Distributed Log (Kafka)

MPP Batch System MPP Streaming System

Streaming is the superset of batch

Streaming

Database

Global, Linearisible consistency model

(2) Separates store & view

“Engine” part is lightweight but stateful

Storage Just a java process which uses a library

Log handles fault tolerance of both layers

Separates Concerns of Model & View – Think MVC

Storage View & Controller

Physically Separates Read & Write – Think CQRS

Storage View & Controller

Database vs SSP

Data Index

Query Engine

Database Stateful Stream Processor

Index Data

(3) Decentralised approaches are more general

Rather than pushing processing into an “appliance”

(code -> data)

Centralised Processing

Data Decentric Architecture

Distributed Log

Decentralised Processing over many user-specific views

This more general than than just

analytics use cases

It’s more than taking a database and adding push

notifications

Whether you’re building a hulking, multistage, analytic platform

Final View

Intermediary View (2)

Intermediary View (1)

Or a simple microservice that needs to run hot-hot & scale

Business Logic Manage local

Join various streams

Hot secondary instance

Composable Primatives

Declarative Function

Traditional DB

Work Distribution

Replication

Sharding

Query Engine

Distributed DB Distributed Systems

Membership

Global Consistency

General framework for distributed, event-driven data computation

Protection from broker failure

Protection from engine failure

Join tables & streams (in process)

Event Driven

Create views which can be queried

Framework for building a streaming data systems, just for you “~)

Find out more:

•  http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

•  https://martin.kleppmann.com/2015/02/11/database-inside-out-at-salesforce.html

•  http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

•  https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cidr07p42.pdf

•  http://highscalability.com/blog/2015/5/4/elements-of-scale-composing-and-scaling-data-platforms.html

•  https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams

•  https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

•  https://www.madewithtea.com/processing-tweets-with-kafka-streams.html

•  http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/

•  http://www.slideshare.net/zacharycox/updating-materialized-views-and-caches-using-kafka

The end

@benstopford http://benstopford.com

Streaming, Database & Distributed Systems Bridging the Divide

Technology

Transcript of Streaming, Database & Distributed Systems Bridging the Divide

Bridging the Technological Divide: Bridging the ... · Bridging the Technological Divide: Bridging the Technological Divide: ... “Our markets are a hub of innovation,” Richard

Bridge the Divide -Bridging the technical and nutritional divide

Bridging the Divide: Collaborative Staff Development

Bridging the Digital Divide in Africa: A Technology ...wireless.ictp.it/Papers/WCI2011.pdf · Bridging the Digital Divide in Africa: ... Bridging the Digital Divide in Africa: A Technology

Benefits of Bridging Digital Divide

Bridging the digital divide for e-learning students ...eprints.mdx.ac.uk/20966/1/Bridging the digital divide for e... · Bridging the digital divide for e-learning students through

Bridging the generational divide

BRIDGING THE QUALITATIVE–QUANTITATIVE DIVIDE

Open Annotation: Bridging the Divide?

OLPC Oceania: Bridging the Digital Divide

Bridging the Information Divide of the Maldives through ...niteshrijal.com.np/pub/bridging-the-information... · Bridging the Information Divide of the Maldives through Library Automation

Bridging the human-digital divide

Bridging the divide presentation final

Bridging the digital literacy divide

Bala Swecha-Bridging the Digital Divide

G20 JAPAN: BRIDGING THE DIVIDE?

Trunnell - Bridging the Social Media Divide

“Bridging the Divide”

Ericsson ConsumerLab: Bridging the Digital Divide

Bridging the digital divide in the EU - European … · Bridging the digital divide in the EU SUMMARY ... there are two aspects to the Digital Divide: ... EPRS Bridging the digital