Insight DE project

9
YeezyScore A comparison of stream processing software By: Kat Chuang @katychuang

Transcript of Insight DE project

Page 1: Insight DE project

YeezyScoreA comparison of stream

processing software

By: Kat Chuang

@katychuang

Page 2: Insight DE project

10 mins

Page 3: Insight DE project

High level overview

Kat Chuang @katychuang

Batch

Streaming

Microbatching

Storm Trident Spark Streaming

Released 2011 2010

Delivery Semantics

Exactly Once Exactly once

State Management Yes Yes

Latency Seconds Seconds

Output MapState Resilient Distributed Dataset (RDD)

Throughput 10k/nodes/sec? 400k/nodes/sec?

Page 4: Insight DE project

Test Cases Metrics

1. Does every message pass through the pipeline?

2. How fast does each message take to process?

Data

1. Timestamps

Kat Chuang @katychuang

Page 5: Insight DE project

Timestamp1 (Timestamp1, Timestamp2)

(Timestamp1, Timestamp2)

Timestamp1

Pipelines

Kat Chuang @katychuang

Page 6: Insight DE project

1. Does every message pass through the pipeline?

Kat Chuang @katychuang

This is a scatterplot

Page 7: Insight DE project

2. How fast does each message take to process?

Kat Chuang @katychuang

This is a scatterplot

Page 8: Insight DE project

Storm Trident Vs Spark StreamingStorm Trident Spark Streaming

Stream processing framework that also does micro-batching.

Great for transforming or computing as data flows in.

Complex event processing (CEP), continuous computation.

Task-Parallel Computations, i.e. reading Twitter streams

Batch processing framework that also does micro-batching.

Great for combining with historical data.

ML algos included. Requires HDFS-backed data source.

Data-Parallel Computations, i.e. offering recommendations

Page 9: Insight DE project

Kat ChuangData Engineering Fellow#DE-2015c

[email protected]: katychuangTwitter: katychuangIG: katychuang.nyc