Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

74
THE ABSTRACTION THAT POWERS THE BIG DATA RAÚL CASTRO FERNÁNDEZ COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE

Transcript of Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Page 1: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

THE ABSTRACTION THAT POWERS THE BIG DATA

RAÚL CASTRO FERNÁNDEZCOMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE

Page 2: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Data!ows: The Abstraction that Powers Big Data

Raul  Castro  Fernandez  Imperial  College  London  

[email protected]  @raulcfernandez  

Page 3: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

“Big  Data  needs  Democra:za:on”  

Page 4: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

3  

Developers  and  DBAs  are  no  longer  the  only  ones  genera:ng,  processing  and  analyzing  data.  

 

Democratization of Data

Page 5: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

4  

Decision  makers,  domain  scien:sts,  applica:on  users,  journalists,  crowd  workers,  and  everyday  consumers,  sales,  

marke:ng…  

Democratization of Data

Developers  and  DBAs  are  no  longer  the  only  ones  genera:ng,  processing  and  analyzing  data.  

 

Page 6: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

5  

+  Everyone  has  data  

Page 7: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

6  

+  Everyone  has  data  

+  Many  have  interes:ng  ques:ons  

Page 8: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

7  

+  Everyone  has  data  

+  Many  have  interes:ng  ques:ons  

-­‐  Not  everyone  knows  how  to  analyze  it  

Page 9: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

8  

+  Everyone  has  data  

+  Many  have  interes:ng  ques:ons  

-­‐  Not  everyone  knows  how  to  analyze  it  

Page 10: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

9  

Bob  Local  Expert  

Page 11: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

10  

Bob  Local  Expert  

Page 12: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

11  

Bob  Local  Expert  

-­‐  Barrier  of  human  communica:on  -­‐  Barrier  of  professional  rela:ons  

Page 13: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

12  

Bob  Local  Expert  

-­‐  Barrier  of  human  communica:on  -­‐  Barrier  of  professional  rela:ons  

The  limits  of  my  language  mean  the  limits  of  my  world.  

Ludwig  WiWgenstein  “Tractatus  Logico-­‐Philosophicus  1922”  

Page 14: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

13  

First  step  to  democra:ze  Big  Data:  to  offer  a  familiar  programming  interface  

Page 15: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

•  Mo>va>on  •  SDG:  Stateful  Dataflow  Graphs  •  Handling  distributed  state  in  SDGs  •  Transla:ng  Java  programs  to  SDGs  •  Checkpoint-­‐based  fault  tolerance  for  SDGs  •  Experimental  evalua:on  

14  

Outline

?  ?  

Page 16: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Mutable State in a Recommender System

15  

Matrix  userItem  =  new  Matrix();  Matrix  coOcc  =  new  Matrix();   Item-­‐A   Item-­‐B  

User-­‐A   4   5  

User-­‐B   0   5  

Item-­‐A   Item-­‐B  

Item-­‐A   1   1  

Item-­‐B   1   2  

User-­‐Item  matrix  (UI)  

Co-­‐Occurrence  matrix  (CO)  

Page 17: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Mutable State in a Recommender System

16  

Matrix  userItem  =  new  Matrix();  Matrix  coOcc  =  new  Matrix();  

void  addRa>ng(int  user,  int  item,  int  ra>ng)  {            userItem.setElement(user,  item,  ra:ng);          updateCoOccurrence(coOcc,  userItem);  }  

Item-­‐A   Item-­‐B  

User-­‐A   4   5  

User-­‐B   0   5  

Item-­‐A   Item-­‐B  

Item-­‐A   1   1  

Item-­‐B   1   2  

User-­‐Item  matrix  (UI)  

Co-­‐Occurrence  matrix  (CO)  

Update  with  new  ra:ngs  

Page 18: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Mutable State in a Recommender System

17  

Matrix  userItem  =  new  Matrix();  Matrix  coOcc  =  new  Matrix();  

void  addRa>ng(int  user,  int  item,  int  ra>ng)  {            userItem.setElement(user,  item,  ra:ng);          updateCoOccurrence(coOcc,  userItem);  }  

Vector  getRec(int  user)  {          Vector  userRow  =  userItem.getRow(user);          Vector  userRec  =  coOcc.mul:ply(userRow);            return  userRec;  }  

Item-­‐A   Item-­‐B  

User-­‐A   4   5  

User-­‐B   0   5  

Item-­‐A   Item-­‐B  

Item-­‐A   1   1  

Item-­‐B   1   2  

User-­‐Item  matrix  (UI)  

Co-­‐Occurrence  matrix  (CO)  

Update  with  new  ra:ngs  

Mul:ply  for  recommenda:on  

User-­‐B   1   2   x  

Page 19: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

18  

Challenges When Executing with Big Data

Big  Data  Problem:  Matrices  

become  large  

>  Mutable  state  leads  to  concise  algorithms  but  complicates  parallelism  and  fault  tolerance  

Matrix  userItem  =  new  Matrix();  Matrix  coOcc  =  new  Matrix();  

>  Cannot  lose  state  aRer  failure  

>  Need  to  manage  state  to  support  data-­‐parallelism  

Page 20: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

19  

Using Current Distributed Data"ow Frameworks

Input  data  

Output  data  

>  No  mutable  state  simplifies  fault  tolerance  

>  MapReduce:  Map  and  Reduce  tasks  >  Storm:  No  support  for  state  >  Spark:  Immutable  RDDs  

Page 21: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

20  

>  Programming  distributed  dataflow  graphs  requires  learning  new  programming  models  

Imperative Big Data Processing

Page 22: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

21  

Our  Goal:  Run  Java  programs  with  mutable  state  but  with    

performance  and  fault  tolerance  of    distributed  dataflow  systems  

>  Programming  distributed  dataflow  graphs  requires  learning  new  programming  models  

Imperative Big Data Processing

Page 23: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

22  

>  @Annota>ons  help  with  transla>on  from  Java  to  SDGs  >  Mutable  distributed  state  in  dataflow  graphs  

Stateful Data"ow Graphs: From Imperative Programs to Distributed Data"ows

Program.java  

SDGs:  Stateful  Dataflow  Graphs  

>  Checkpoint-­‐based  fault  tolerance  recovers  mutable  state    aRer  failure  

Page 24: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

•  Mo:va:on  •  SDG:  Stateful  Dataflow  Graphs  •  Handling  distributed  state  in  SDGs  •  Transla:ng  Java  programs  to  SDGs  •  Checkpoint-­‐based  fault  tolerance  for  SDGs  •  Experimental  evalua:on  

23  

Outline

Program.java  

Page 25: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

SDG: Data, State and Computation

>  SDGs  separate  data  and  state  to  allow  data  and  pipeline  parallelism  

24  

Task  Elements  (TEs)  process  data  

State  Elements  (SEs)  represent  state  

Dataflows  represent    

data  

>  Task  Elements  have  local  access  to  State  Elements  

Page 26: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

State  Elements  support  two  abstrac:ons  for  distributed  mutable  state  –  Par>>oned  SEs:  task  elements  always  access  

state  by  key  –  Par>al  SEs:  task  elements  can  access    

complete  state  

25  

Distributed Mutable State

Page 27: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

26  

Distributed Mutable State: Partitioned SEs

Dataflow  routed  according  to    hash  func:on  

Item-­‐A   Item-­‐B  

User-­‐A   4   5  

User-­‐B   0   5  Access  by  key  

State  par::oned  according  to  par>>oning  key  

>  Par>>oned  SEs  split  into  disjoint  par::ons  

User-­‐Item  matrix  (UI)  

hash(msg.id)  

Key  space:  [0-­‐N]  [0-­‐k]  

[(k+1)-­‐N]  

Page 28: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

27  

Distributed Mutable State: Partial SEs

Local  access:  Data  sent  to  one  

Global  access:  Data  sent  to  all  

>  Par>al  SE  gives  nodes  local  state  instances  

>  Par>al  SE  access  by  TEs  can  be  local  or  global  

Page 29: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

28  

Merging Distributed Mutable State

Merge  logic  

>  Requires  applica:on-­‐specific  merge  logic  

>  Reading  all  par:al  SE  instances  results  in  set  of  par>al  values  

Page 30: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

29  

Merging Distributed Mutable State

Mul:ple  par:al  values  

Merge  logic  

>  Requires  applica:on-­‐specific  merge  logic  

>  Reading  all  par:al  SE  instances  results  in  set  of  par>al  values  

Page 31: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

30  

Merging Distributed Mutable State

Mul:ple  par:al  values  

Collect  par:al  values  

Merge  logic  

>  Requires  applica:on-­‐specific  merge  logic  

>  Reading  all  par:al  SE  instances  results  in  set  of  par>al  values  

Page 32: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

31  

Outline

>  @Annota>ons  

•  Mo:va:on  •  SDG:  Stateful  Dataflow  Graphs  •  Handling  distributed  state  in  SDGs  •  Transla>ng  Java  programs  to  SDGs  •  Checkpoint-­‐based  fault  tolerance  for  SDGs  •  Experimental  evalua:on  

Program.java  

Page 33: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

32  

From Imperative Code to Execution

SEEP  

Annotated    program  

>  SEEP:  data-­‐parallel  processing  plaborm  

•  Transla:on  occurs  in  two  stages:  –  Sta<c  code  analysis:  From  Java  to  SDG  –  Bytecode  rewri<ng:  From  SDG  to  SEEP  [SIGMOD’13]  

Program.java  

Page 34: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Program.java  

33  

Extract  TEs,  SEs  and  accesses  

Live  variable  analysis  

TE  and  SE  access  code  assembly  

SEEP  runnable  

SOOT  Framework  

Javassist  

>  Extract  state  and  state  access  paderns  through  sta:c  code  analysis  

>  Genera:on  of  runnable  code  using  TE  and  SE  connec:ons  

Translation Process

Page 35: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Program.java  

34  

Extract  TEs,  SEs  and  accesses  

Live  variable  analysis  

TE  and  SE  access  code  assembly  

SEEP  runnable  

SOOT  Framework  

Javassist  

>  Extract  state  and  state  access  paderns  through  sta:c  code  analysis  

>  Genera:on  of  runnable  code  using  TE  and  SE  connec:ons  

Translation Process

Annotated    Program.java  

Page 36: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

35  

@Par>>oned  Matrix  userItem  =  new  SeepMatrix();  Matrix  coOcc  =  new  Matrix();    void  addRa:ng(int  user,  int  item,  int  ra:ng)  {        userItem.setElement(user,  item,  ra:ng);      updateCoOccurrence(coOcc,  userItem);  }    Vector  getRec(int  user)  {      Vector  userRow  =  userItem.getRow(user);      Vector  userRec  =  coOcc.mul:ply(userRow);        return  userRec;  }    

Partitioned State Annotation

>  @Par>>on  field  annota>on  indicates  par<<oned  state  

hash(msg.id)  

Page 37: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

36  

@Par::oned  Matrix  userItem  =  new  SeepMatrix();  @Par>al  Matrix  coOcc  =  new  SeepMatrix();    void  addRa:ng(int  user,  int  item,  int  ra:ng)  {        userItem.setElement(user,  item,  ra:ng);      updateCoOccurrence(@Global  coOcc,  userItem);  }  

Partial State and Global Annotations

>  @Global  annotates  variable  to  indicate  access  to  all  par:al  instances  

>  @Par>al  field  annota>on  indicates  par<al  state  

Page 38: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

37  

@Par::oned  Matrix  userItem  =  new  SeepMatrix();  @Par>al  Matrix  coOcc  =  new  SeepMatrix();    Vector  getRec(int  user)  {      Vector  userRow  =  userItem.getRow(user);      @Par>al  Vector  puRec  =  @Global  coOcc.mul:ply(userRow);        Vector  userRec  =  merge(puRec);      return  userRec;  }    Vector  merge(@Collec>on  Vector[]  v){      /*…*/  }    

Partial and Collection Annotations

>  @Collec>on  annota:on  indicates  merge  logic  

Page 39: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

38  

Outline

>  Failures  

•  Mo:va:on  •  SDG:  Stateful  Dataflow  Graphs  •  Handling  distributed  state  in  SDGs  •  Transla:ng  Java  programs  to  SDGs  •  Checkpoint-­‐Based  fault  tolerance  for  SDGs  •  Experimental  evalua:on  

Program.java  

Page 40: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

39  

Challenges of Making SDGs Fault Tolerant

Physical  deployment  of  SDG   >  Node  failures  may    lead  to  state  loss  

>  Task  elements  access  local  in-­‐memory  state  

Page 41: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

40  

Challenges of Making SDGs Fault Tolerant

RAM   RAM  

Physical  deployment  of  SDG   >  Node  failures  may    lead  to  state  loss  

>  Task  elements  access  local  in-­‐memory  state  

Physical  nodes  

Page 42: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

41  

RAM   RAM  

Physical  deployment  of  SDG   >  Node  failures  may    lead  to  state  loss  

Checkpoin>ng  State  •  No  updates  allowed  while  state  

is  being  checkpointed  •  Checkpoin:ng  state  should  not  

impact  data  processing  path  

>  Task  elements  access  local  in-­‐memory  state  

Physical  nodes  

Challenges of Making SDGs Fault Tolerant

Page 43: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

42  

RAM   RAM  

Physical  deployment  of  SDG  

•  Backups  large  and  cannot  be  stored  in  memory  

•  Large  writes  to  disk  through  network  have  high  cost  

State  Backup  

>  Node  failures  may    lead  to  state  loss  

Checkpoin>ng  State  •  No  updates  allowed  while  state  

is  being  checkpointed  •  Checkpoin:ng  state  should  not  

impact  data  processing  path  

>  Task  elements  access  local  in-­‐memory  state  

Physical  nodes  

Challenges of Making SDGs Fault Tolerant

Page 44: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

43  

Checkpoint Mechanism for Fault Tolerance

1.  Freeze  mutable  state  for  checkpoin:ng  2.  Dirty  state  supports  updates  concurrently  3.  Reconcile  dirty  state  

Asynchronous,  lock-­‐free  checkpoin>ng    

Dirty  state  

Page 45: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

44  

Distributed M to N Checkpoint Backup

M  to  N  distributed  backup  and  parallel  recovery  

Page 46: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

45  

Distributed M to N Checkpoint Backup

M  to  N  distributed  backup  and  parallel  recovery  

Page 47: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

46  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 48: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

47  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 49: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

48  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 50: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

49  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 51: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

50  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 52: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

51  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 53: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

52  

M  to  N  distributed  backup  and  parallel  recovery  

Distributed M to N Checkpoint Backup

Page 54: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

How  does  mutable  state  impact  performance?  How  efficient  are  translated  SDGs?  What  is  the  throughput/latency  trade-­‐off?  

Experimental  set-­‐up:  –  Amazon  EC2  (c1  and  m1  xlarge  instances)  –  Private  cluster  (4-­‐core  3.4  GHz  Intel  Xeon  servers  with  8  GB  RAM  )  –  Sun  Java  7,  Ubuntu  12.04,  Linux  kernel  3.10  

53  

Evaluation of SDG Performance

Page 55: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

54  

0

5

10

15

20

1:5 1:2 1:1 2:1 5:1

100

1000

Thro

ughp

ut(1

000

requ

ests

/s)

Late

ncy

(ms)

Workload (state read/write ratio)

ThroughputLatency

Combines  batch  and  online  processing  to  serve  fresh  results  over  large  mutable  state  

Processing with Large Mutable State

>  addRa:ng  and  getRec  func:ons  from  recommender  algorithm,  while  changing  read/write  ra:o  

Page 56: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

55  

0

10

20

30

40

50

60

25 50 75 100

Th

rou

gh

pu

t (G

B/s

)

Number of nodes

SDGSpark

Translated  SDG  achieves  performance    similar  to  non-­‐mutable  dataflow  

>  Batch-­‐oriented,  itera:ve  logis:c  regression  

E#ciency of Translated SDG

Page 57: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

56  

SDGs  achieve  high  throughput  while  main>ng  low  latency  

Latency/Throughput Tradeo$

>  Streaming  word  count  query,  repor:ng  counts  over  windows  

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

Page 58: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

57  

SDGs  achieve  high  throughput  while  main>ng  low  latency  

Latency/Throughput Tradeo$

>  Streaming  word  count  query,  repor:ng  counts  over  windows  

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

Naiad-HighThroughputSDG

Streaming Spark

Page 59: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

58  

SDGs  achieve  high  throughput  while  main>ng  low  latency  

Latency/Throughput Tradeo$

>  Streaming  word  count  query,  repor:ng  counts  over  windows  

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

SDGNaiad-LowLatency

0

50

100

150

200

250

10 100 1000 10000Thro

ughput (1

000 r

equest

s/s)

Window size (ms)

Naiad-HighThroughputSDG

Streaming Spark0

50

100

150

200

250

10 100 1000 10000Th

rou

gh

pu

t (1

00

0 r

eq

ue

sts/

s)

Window size (ms)

Naiad-HighThroughputSDG

Streaming SparkNaiad-LowLatency

Page 60: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Running  Java  programs  with  the  performance  of  current  distributed  dataflow  frameworks  

SDG:  Stateful  Dataflow  Graphs  – Abstrac:ons  for  distributed  mutable  state  – Annota>ons  to  disambiguate  types  of  distributed  state  and  state  access  

– Checkpoint-­‐based  fault  tolerance  mechanism  

59  

Summary

Page 61: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Running  Java  programs  with  the  performance  of  current  distributed  dataflow  frameworks  

SDG:  Stateful  Dataflow  Graphs  – Abstrac:ons  for  distributed  mutable  state  – Annota>ons  to  disambiguate  types  of  distributed  state  and  state  access  

– Checkpoint-­‐based  fault  tolerance  mechanism  

60  

Summary

Thank  you!  Any  Ques>ons?  

@raulcfernandez  [email protected]  

hEps://github.com/lsds/Seep/  hEps://github.com/raulcf/SEEPng/  

Page 62: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

BACKUP  SLIDES  

61  

Page 63: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

62  

0

0.5

1

1.5

2

50 100 150 200 1

10

100

1000

Th

rou

gh

pu

t (m

illio

n r

eq

ue

sts/

s)

La

ten

cy (

ms)

Aggregated memory (GB)

ThroughputLatency

Support  large  state  without  compromising  throughput  or  latency  while  staying  fault  tolerant  

Scalability  on  State  Size  and  Throughput  

>  Increase  state  size  in  a  mutated  KV  store  

Page 64: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

63  

Itera:on  in  SDG  

>  Local  itera>on  supported  by  one  node  

>  Itera>on  across  TEs  requires  cycle  in  the  dataflow  

Page 65: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

•  Par::on  •  Par:al  •  Global  •  Par:al  •  Collec:on  •  Data  annota:ons  – Batch  – Stream  

64  

Types  of  Annota:ons  

Page 66: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

Overhead  of  SDG  Fault  Tolerance  

65  

1

10

100

1000

10000

No FT 1 2 3 4 5

Late

ncy

(ms)

State size (GB)

1

10

100

1000

2 4 6 8 10 No FT

Late

ncy

(ms)

Checkpoint frequency (s)

Fault  Tolerance  mechanism  impact  on  performance  and  

latency  is  small.  

State  size  and  checkpoin>ng  Frequency  do  not  affect  the  performance  

Page 67: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

66  

0

2

4

6

8

10

10 100 1000 2000 0

20

40

60

80

100

Thro

ughput (1

0,0

00 r

equest

s/s)

Late

ncy

(m

s)

Aggregated memory (MB)

SDGNaiad-NoDiskNaiad-DiskSDG (latency)Naiad-NoDisk (latency)

Fault  Tolerance  Overhead  

Page 68: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

0

5

10

15

20

25

30

35

40

1 2 4

Reco

very

tim

e (

s)

State size (GB)

1-to-1 recovery2-to-1 recovery1-to-2 recovery2-to-2 recovery

67  

Recovery  Times  

Page 69: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

68  

0

5

10

15

20

25

30

0 10 20 30 40 50 60 0

1

2

3

4

5

Th

rou

gh

pu

t (1

00

0 r

eq

ue

st/s

)

Nu

mb

er

of

no

de

s

Time (s)

ThroughputNodes

Stragglers  

Page 70: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

69  

0

50

100

150

200

250

1 2 3 40.001

0.01

0.1

1

10

Thro

ughp

ut(1

000

requ

ests

/s)

Late

ncy

(s)

State size (GB)

T'put (Sync)Latency (Sync)T'put (Async)

Fault  Tolerance  Sync.  Vs.  Async.  

Page 71: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

System   Large  State   Mutable  State   Low  Latency   Itera>on  

MapReduce   n/a   n/a   No   No  

Spark   n/a   n/a   No   Yes  

Storm   n/a   n/a   Yes   No  

Naiad   No   Yes   Yes   Yes  

SDG   Yes   Yes   Yes   Yes  

70  

Comparison  to  State-­‐of-­‐the-­‐Art  

SDGs  are  first  stateful  fault  tolerant  model;  enabling  execu:on  of  impera:ve  code  with  explicit  state  

Page 72: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

71  

Characteris:cs  of  SDGs  

>  Run>me  Data  Parallelism  (elas>city)  

>  Support  for  Cyclic  Graphs  

>  Low  Latency  

Adapta:on  to  varying  workloads    and  mechanism  against  stragglers  

Efficiently  represent  itera:ve    algorithms  

Pipelining  tasks  decreases    latency  

Page 73: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

72  

Bob  Local  Expert  

Hi,  I  have  a  query  to  run  on  “Big  Data”  

Ok,  cool,  tell  me  about  it  

I  want  to  know  sales  per  employee  on  Saturdays  

…  well  …  ok,  come  in  3  days  

Well,  this  is  actually  preWy  urgent…  

…  2  days,  I’m  preWy  busy  

2  Days  Ayer  

Hi!  You  have  the  results?  

Yes,  here  you  have  your  sales  last  Saturday  

My  sales?  I  meant  all  employee  sales,  and  not  only  last  Saturday  

ups,  sorry  for  that,  give  me  2  days…  

Page 74: Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at Big Data Spain 2014

17TH ~ 18th NOV 2014MADRID (SPAIN)