Realtime Distributed Analysis of Datastreams

21
Realtime Distributed Analysis of Datastreams Philipp Nolte – University of Passau – January 2014 1

description

Ein Vortrag von Philipp Nolte aus dem Hauptseminar "Personalisierung mit großen Daten".

Transcript of Realtime Distributed Analysis of Datastreams

Page 1: Realtime Distributed Analysis of Datastreams

RealtimeDistributed Analysis

of Datastreams

Philipp Nolte – University of Passau – January 2014

1

Page 2: Realtime Distributed Analysis of Datastreams

Learn

Why we need fancy Big Data frameworks.

How the lambda architecture looks like.

How twitter used to do real-time analytics.

Why twitter created Storm.

How Storm works.

2

Page 3: Realtime Distributed Analysis of Datastreams

Limits

Imagine a traditional web analytics software:

Every page view incrementsthe url’s database row.

3

Page 4: Realtime Distributed Analysis of Datastreams

First Aid

Queue your writes and write in batches.

Shard your data: Partition horizontally.

4

Page 5: Realtime Distributed Analysis of Datastreams

Chronic Issues

Fault-tolerance is hard.

Applications become more and more complex.

You have to do all the work.

5

Page 6: Realtime Distributed Analysis of Datastreams

New Tools

Large scale computation systems such as Hadoop.

Scalable databases such as Casandra and Riak.

Easy to use frameworks such as Storm and Dempsy.

6

Page 7: Realtime Distributed Analysis of Datastreams

Lambda Architecture

Speed Layer

Serving Layer

Batch Layer

Theoretical, abstract architecture for working with big data.

7

Page 8: Realtime Distributed Analysis of Datastreams

Goal

Compute arbitrary functions on arbitrary data.

query = function ( all data )

8

Page 9: Realtime Distributed Analysis of Datastreams

Properties

Robust and fault-tolerant.

Low latency reads and updates.

Scalable.

Minimal maintenance.

9

Page 10: Realtime Distributed Analysis of Datastreams

Batch Layer

Stores the immutable master dataset.

Precomputes arbitrary batch views.

Home of batch processing and mapreduce systems such as Hadoop.

Speed Layer

Serving Layer

Batch Layer

10

Page 11: Realtime Distributed Analysis of Datastreams

Serving Layer

Read-only random-access to batch views.

Updated by batch layer.

Indexes batch views.

Home of real-time query systemssuch as Cloudera Impala for Hadoop.

Speed Layer

Serving Layer

Batch Layer

11

Page 12: Realtime Distributed Analysis of Datastreams

Speed Layer

Compensates for high-latency batch views.

Fast, incremental algorithms.

More complex because of random-writes.

Home of Apache HBase or Storm.

Speed Layer

Serving Layer

Batch Layer

12

Page 13: Realtime Distributed Analysis of Datastreams

Lambda Architecture

Data

Speed Layer

Serving Layer

Batch Layer

QueryBatch Views

Realtime Views

13

Page 14: Realtime Distributed Analysis of Datastreams

Available Data

Batch View Realtime View

Batch View Realtime View

Discard Realtime Viewas soon as it is represented

in the batch view.Time

14

Page 15: Realtime Distributed Analysis of Datastreams

Twitter’s Early DaysWorker

Worker

Worker

Worker

Queue

Queue

Hadoop Cassandra

Tweets

Map

URLs

Queue

Queue

Queue

Queue

Worker

Worker

Worker

Worker

15

Page 16: Realtime Distributed Analysis of Datastreams

StormGuaranteed message processing without

message brokers.

Horizontal scalability.

Fault-tolerance.

High level of abstraction.

Just works.

16

Page 17: Realtime Distributed Analysis of Datastreams

Storm Topologies

Spout

Spout

⚡️Bolt

⚡️Bolt

⚡️Bolt

⚡️Bolt

Stream

17

Page 18: Realtime Distributed Analysis of Datastreams

Parallel Tasks

Spout

Spout

⚡️Bolt

⚡️Bolt

⚡️Bolt

⚡️Bolt

StreamT

Task

T T T T

TTTTTTT

18

Page 19: Realtime Distributed Analysis of Datastreams

Demo

Storm in action

19

Page 20: Realtime Distributed Analysis of Datastreams

Know

Why we need fancy Big Data frameworks.

How the lambda architecture looks like.

How twitter used to do real-time analytics.

Why twitter created Storm.

How Storm works.

20

Page 21: Realtime Distributed Analysis of Datastreams

The End.

Questions?

21