Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

THE ABSTRACTION THAT POWERS THE BIG DATA

RAÚL CASTRO FERNÁNDEZCOMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE

Data!ows: The Abstraction that Powers Big Data

Raul Castro Fernandez Imperial College London

[email protected] @raulcfernandez

“Big Data needs Democra:za:on”

3

Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.

Democratization of Data

4

Decision makers, domain scien:sts, applica:on users, journalists, crowd workers, and everyday consumers, sales,

marke:ng…

Democratization of Data

Developers and DBAs are no longer the only ones genera:ng, processing and analyzing data.

5

+ Everyone has data

6

+ Everyone has data

+ Many have interes:ng ques:ons

7

+ Everyone has data


-‐ Not everyone knows how to analyze it

8

+ Everyone has data


-‐ Not everyone knows how to analyze it

9

Bob Local Expert

10

Bob Local Expert

11

Bob Local Expert

-‐ Barrier of human communica:on -‐ Barrier of professional rela:ons

12

Bob Local Expert

-‐ Barrier of human communica:on -‐ Barrier of professional rela:ons

The limits of my language mean the limits of my world.

Ludwig WiWgenstein “Tractatus Logico-‐Philosophicus 1922”

13

First step to democra:ze Big Data: to offer a familiar programming interface

•  Mo>va>on •  SDG: Stateful Dataflow Graphs •  Handling distributed state in SDGs •  Transla:ng Java programs to SDGs •  Checkpoint-‐based fault tolerance for SDGs •  Experimental evalua:on

14

Outline

? ?

Mutable State in a Recommender System

15

Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); Item-‐A Item-‐B

User-‐A 4 5

User-‐B 0 5

Item-‐A Item-‐B

Item-‐A 1 1

Item-‐B 1 2

User-‐Item matrix (UI)

Co-‐Occurrence matrix (CO)


16

Matrix userItem = new Matrix(); Matrix coOcc = new Matrix();

void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); }

Item-‐A Item-‐B

User-‐A 4 5

User-‐B 0 5

Item-‐A Item-‐B

Item-‐A 1 1

Item-‐B 1 2



Update with new ra:ngs


17


void addRa>ng(int user, int item, int ra>ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); }

Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; }

Item-‐A Item-‐B

User-‐A 4 5

User-‐B 0 5

Item-‐A Item-‐B

Item-‐A 1 1

Item-‐B 1 2



Update with new ra:ngs

Mul:ply for recommenda:on

User-‐B 1 2 x

18

Challenges When Executing with Big Data

Big Data Problem: Matrices

become large

> Mutable state leads to concise algorithms but complicates parallelism and fault tolerance


> Cannot lose state aRer failure

> Need to manage state to support data-‐parallelism

19

Using Current Distributed Data"ow Frameworks

Input data

Output data

> No mutable state simplifies fault tolerance

> MapReduce: Map and Reduce tasks > Storm: No support for state > Spark: Immutable RDDs

20

> Programming distributed dataflow graphs requires learning new programming models

Imperative Big Data Processing

21

Our Goal: Run Java programs with mutable state but with

performance and fault tolerance of distributed dataflow systems

> Programming distributed dataflow graphs requires learning new programming models

Imperative Big Data Processing

22

> @Annota>ons help with transla>on from Java to SDGs > Mutable distributed state in dataflow graphs

Stateful Data"ow Graphs: From Imperative Programs to Distributed Data"ows

Program.java

SDGs: Stateful Dataflow Graphs

> Checkpoint-‐based fault tolerance recovers mutable state aRer failure

•  Mo:va:on •  SDG: Stateful Dataflow Graphs •  Handling distributed state in SDGs •  Transla:ng Java programs to SDGs •  Checkpoint-‐based fault tolerance for SDGs •  Experimental evalua:on

23

Outline

Program.java

SDG: Data, State and Computation

> SDGs separate data and state to allow data and pipeline parallelism

24

Task Elements (TEs) process data

State Elements (SEs) represent state

Dataflows represent

data

> Task Elements have local access to State Elements

State Elements support two abstrac:ons for distributed mutable state –  Par>>oned SEs: task elements always access

state by key –  Par>al SEs: task elements can access

complete state

25

Distributed Mutable State

26

Distributed Mutable State: Partitioned SEs

Dataflow routed according to hash func:on

Item-‐A Item-‐B

User-‐A 4 5

User-‐B 0 5 Access by key

State par::oned according to par>>oning key

> Par>>oned SEs split into disjoint par::ons


hash(msg.id)

Key space: [0-‐N] [0-‐k]

[(k+1)-‐N]

27

Distributed Mutable State: Partial SEs

Local access: Data sent to one

Global access: Data sent to all

> Par>al SE gives nodes local state instances

> Par>al SE access by TEs can be local or global

28

Merging Distributed Mutable State

Merge logic

> Requires applica:on-‐specific merge logic

> Reading all par:al SE instances results in set of par>al values

29


Mul:ple par:al values

Merge logic



30


Mul:ple par:al values

Collect par:al values

Merge logic



31

Outline

> @Annota>ons

•  Mo:va:on •  SDG: Stateful Dataflow Graphs •  Handling distributed state in SDGs •  Transla>ng Java programs to SDGs •  Checkpoint-‐based fault tolerance for SDGs •  Experimental evalua:on

Program.java

32

From Imperative Code to Execution

SEEP

Annotated program

> SEEP: data-‐parallel processing plaborm

•  Transla:on occurs in two stages: –  Sta<c code analysis: From Java to SDG –  Bytecode rewri<ng: From SDG to SEEP [SIGMOD’13]

Program.java

Program.java

33

Extract TEs, SEs and accesses

Live variable analysis

TE and SE access code assembly

SEEP runnable

SOOT Framework

Javassist

> Extract state and state access paderns through sta:c code analysis

> Genera:on of runnable code using TE and SE connec:ons

Translation Process

Program.java

34

Extract TEs, SEs and accesses

Live variable analysis

TE and SE access code assembly

SEEP runnable

SOOT Framework

Javassist

> Extract state and state access paderns through sta:c code analysis

> Genera:on of runnable code using TE and SE connec:ons

Translation Process

Annotated Program.java

35

@Par>>oned Matrix userItem = new SeepMatrix(); Matrix coOcc = new Matrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(coOcc, userItem); } Vector getRec(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.mul:ply(userRow); return userRec; }

Partitioned State Annotation

> @Par>>on field annota>on indicates par<<oned state

hash(msg.id)

36

@Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); void addRa:ng(int user, int item, int ra:ng) { userItem.setElement(user, item, ra:ng); updateCoOccurrence(@Global coOcc, userItem); }

Partial State and Global Annotations

> @Global annotates variable to indicate access to all par:al instances

> @Par>al field annota>on indicates par<al state

37

@Par::oned Matrix userItem = new SeepMatrix(); @Par>al Matrix coOcc = new SeepMatrix(); Vector getRec(int user) { Vector userRow = userItem.getRow(user); @Par>al Vector puRec = @Global coOcc.mul:ply(userRow); Vector userRec = merge(puRec); return userRec; } Vector merge(@Collec>on Vector[] v){ /*…*/ }

Partial and Collection Annotations

> @Collec>on annota:on indicates merge logic

38

Outline

> Failures

•  Mo:va:on •  SDG: Stateful Dataflow Graphs •  Handling distributed state in SDGs •  Transla:ng Java programs to SDGs •  Checkpoint-‐Based fault tolerance for SDGs •  Experimental evalua:on

Program.java

39

Challenges of Making SDGs Fault Tolerant

Physical deployment of SDG > Node failures may lead to state loss

> Task elements access local in-‐memory state

40


RAM RAM



Physical nodes

41

RAM RAM


Checkpoin>ng State •  No updates allowed while state

is being checkpointed •  Checkpoin:ng state should not

impact data processing path


Physical nodes


42

RAM RAM

Physical deployment of SDG

•  Backups large and cannot be stored in memory

•  Large writes to disk through network have high cost

State Backup

> Node failures may lead to state loss

Checkpoin>ng State •  No updates allowed while state

is being checkpointed •  Checkpoin:ng state should not

impact data processing path


Physical nodes


43

Checkpoint Mechanism for Fault Tolerance

1.  Freeze mutable state for checkpoin:ng 2.  Dirty state supports updates concurrently 3.  Reconcile dirty state

Asynchronous, lock-‐free checkpoin>ng

Dirty state

44

Distributed M to N Checkpoint Backup

M to N distributed backup and parallel recovery

45



46



47



48



49



50



51



52



How does mutable state impact performance? How efficient are translated SDGs? What is the throughput/latency trade-‐off?

Experimental set-‐up: –  Amazon EC2 (c1 and m1 xlarge instances) –  Private cluster (4-‐core 3.4 GHz Intel Xeon servers with 8 GB RAM ) –  Sun Java 7, Ubuntu 12.04, Linux kernel 3.10

53

Evaluation of SDG Performance

54

0

5

10

15

20

1:5 1:2 1:1 2:1 5:1

100

1000

Thro

ughp

ut(1

000

requ

ests

/s)

Late

ncy

(ms)

Workload (state read/write ratio)

ThroughputLatency

Combines batch and online processing to serve fresh results over large mutable state

Processing with Large Mutable State

> addRa:ng and getRec func:ons from recommender algorithm, while changing read/write ra:o

55

0

10

20

30

40

50

60

25 50 75 100

Th

rou

gh

pu

t (G

B/s

)

Number of nodes

SDGSpark

Translated SDG achieves performance similar to non-‐mutable dataflow

> Batch-‐oriented, itera:ve logis:c regression

E#ciency of Translated SDG

56

SDGs achieve high throughput while main>ng low latency

Latency/Throughput Tradeo$

> Streaming word count query, repor:ng counts over windows

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

57




0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

Naiad-HighThroughputSDG

Streaming Spark

58




0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)


Streaming Spark0

50

100

150

200

250

10 100 1000 10000Th

rou

gh

pu

t (1

00

0 r

eq

ue

sts/

s)

Window size (ms)


Streaming SparkNaiad-LowLatency

Running Java programs with the performance of current distributed dataflow frameworks

SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access

– Checkpoint-‐based fault tolerance mechanism

59

Summary

Running Java programs with the performance of current distributed dataflow frameworks

SDG: Stateful Dataflow Graphs – Abstrac:ons for distributed mutable state – Annota>ons to disambiguate types of distributed state and state access

– Checkpoint-‐based fault tolerance mechanism

60

Summary

Thank you! Any Ques>ons?

@raulcfernandez [email protected]

hEps://github.com/lsds/Seep/ hEps://github.com/raulcf/SEEPng/

BACKUP SLIDES

61

62

0

0.5

1

1.5

2

50 100 150 200 1

10

100

1000

Th

rou

gh

pu

t (m

illio

n r

eq

ue

sts/

s)

La

ten

cy (

ms)

Aggregated memory (GB)

ThroughputLatency

Support large state without compromising throughput or latency while staying fault tolerant

Scalability on State Size and Throughput

> Increase state size in a mutated KV store

63

Itera:on in SDG

> Local itera>on supported by one node

> Itera>on across TEs requires cycle in the dataflow

•  Par::on •  Par:al •  Global •  Par:al •  Collec:on •  Data annota:ons – Batch – Stream

64

Types of Annota:ons

Overhead of SDG Fault Tolerance

65

1

10

100

1000

10000

No FT 1 2 3 4 5

Late

ncy

(ms)

State size (GB)

1

10

100

1000

2 4 6 8 10 No FT

Late

ncy

(ms)

Checkpoint frequency (s)

Fault Tolerance mechanism impact on performance and

latency is small.

State size and checkpoin>ng Frequency do not affect the performance

66

0

2

4

6

8

10

10 100 1000 2000 0

20

40

60

80

100

Thro

ughput (1

0,0

00 r

equest

s/s)

Late

ncy

(m

s)

Aggregated memory (MB)

SDGNaiad-NoDiskNaiad-DiskSDG (latency)Naiad-NoDisk (latency)

Fault Tolerance Overhead

0

5

10

15

20

25

30

35

40

1 2 4

Reco

very

tim

e (

s)

State size (GB)

1-to-1 recovery2-to-1 recovery1-to-2 recovery2-to-2 recovery

67

Recovery Times

68

0

5

10

15

20

25

30

0 10 20 30 40 50 60 0

1

2

3

4

5

Th

rou

gh

pu

t (1

00

0 r

eq

ue

st/s

)

Nu

mb

er

of

no

de

s

Time (s)

ThroughputNodes

Stragglers

69

0

50

100

150

200

250

1 2 3 40.001

0.01

0.1

1

10

Thro

ughp

ut(1

000

requ

ests

/s)

Late

ncy

(s)

State size (GB)

T'put (Sync)Latency (Sync)T'put (Async)

Fault Tolerance Sync. Vs. Async.

System Large State Mutable State Low Latency Itera>on

MapReduce n/a n/a No No

Spark n/a n/a No Yes

Storm n/a n/a Yes No

Naiad No Yes Yes Yes

SDG Yes Yes Yes Yes

70

Comparison to State-‐of-‐the-‐Art

SDGs are first stateful fault tolerant model; enabling execu:on of impera:ve code with explicit state

71

Characteris:cs of SDGs

> Run>me Data Parallelism (elas>city)

> Support for Cyclic Graphs

> Low Latency

Adapta:on to varying workloads and mechanism against stragglers

Efficiently represent itera:ve algorithms

Pipelining tasks decreases latency

72

Bob Local Expert

Hi, I have a query to run on “Big Data”

Ok, cool, tell me about it

I want to know sales per employee on Saturdays

… well … ok, come in 3 days

Well, this is actually preWy urgent…

… 2 days, I’m preWy busy

2 Days Ayer

Hi! You have the results?

Yes, here you have your sales last Saturday

My sales? I meant all employee sales, and not only last Saturday

ups, sorry for that, give me 2 days…

17TH ~ 18th NOV 2014MADRID (SPAIN)

Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Technology

Transcript of Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014