Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and...

25
Riza Suminto, Agung Laksono * , Anang Satria * , Thanh Do , Haryadi Gunawi *

Transcript of Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and...

Page 1: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

Riza Suminto, Agung Laksono*, Anang Satria*, ���Thanh Do†, Haryadi Gunawi

† *

Page 2: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

2SPV @ HotCloud ’15

Page 3: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q Users demand high dependability, reliability, and performance stability

q Amazon found that every 100ms of latency cost them 1% in sales

q Google found an extra 0.5 second in search page generation time dropped traffic by 20%

3SPV @ HotCloud ’15

Page 4: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC’14

4

22%

SPV @ HotCloud ’15

Page 5: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

Performance ���Bug

System Performance ���Verifier

SPV @ HotCloud ’15 5

Page 6: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

6

q Jobs take multiple times than usual to finish§  Improper speculative execution

JCH1 & TPL1 & FPL2 & FTY1

§  Unnecessary repeated recoveryTPL1 & TPL4 & FTY4 & TOP1

SPV @ HotCloud ’15

Page 7: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

7

Map read locally

Mappers and reducers ���in different nodes

All-to-All

Fault at map node

Slow NIC

DLCA

TPLA

FPLA

FTYA

JCHA

M1

M2

M3

Mappers Reducers

All reducers slow!

DLCA & TPLA & JCHA & FPLA & FTYA

No straggler = No SpecExec

SPV @ HotCloud ’15

slow!

Page 8: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q  DLCA & TPLA & JCHA & FPLA & FTYA

8

M1

M2

M3

MappersDN

DLCB = read remote

Straggler!

SPV @ HotCloud ’15

q  DLCA & TPLA & JCHA & FPLA & FTYA

M1

M2

M3

Mappers Reducers

Page 9: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q  DLCA & TPLA & JCHA & FPLA & FTYA

9

M1

M2

M3

Mappers Reducers

FPLBslow reducer =

Straggler!

SPV @ HotCloud ’15

q  DLCA & TPLA & JCHA & FPLA & FTYA

M1

M2

M3

Mappers Reducers

Page 10: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

10

Mappers and Reducers ���in different nodes

Mappers and Reducers ���in different racks

Large number of nodes per rack

Slow inter-rack switch

M

M

M

M

R

Rack 1 Rack 2

M

TPLA

TPLB

TOPA

FTYB

TPLA & TPLB & TOPA & FTYB

SPV @ HotCloud ’15

slow!

Page 11: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q  Untriggered Speculative Execution§  MR-70001 = JCH1 & TPL1 & FPL2 & FTY1

§  MR-70002 = DSR1 & DLC1 & FPL1 & FTY1

§  MR-5533 = FTY2 & FPL3 & TPL3

§  …

q  O(n) Recovery§  MR-5251 = FTY3 & FPL3 & FTM1 §  MR-5060 = TPL1 & TPL3 & FTY1 & FPL2 §  MR-1800 = TPL1 & TPL4 & FTY4 & TOP1 §  …

q  Long lock contention§  MR-9191 = FTY3 & FPL3 & FTM1 §  MR-9292 = TPL1 & TPL3 & FTY1 & FPL2 §  MR-9393 = TPL1 & TPL4 & FTY4 & TOP1 §  …

11

Scenario Type Possible Condition

DLC: Data Locality (1) Read from remote disk, (2) read from local disk, ...

DSR: Data Source (1) Some tasks read from same datanode, (2) all tasks read from different datanodes, …

JCH: Job Characteristic

Map-reduce is (1) many-to-all, (2) all-to-many, (3) large fan-in, (4) large fan-out, ...

JSZ: Job Size (1)1GBjarfile,(2)1MBjarfile,...

LSZ: Load Size (1) Thousands of tasks, (2) small number of tasks, …

FTY: Fault Type (1) Slow node/NIC, (2) Node disconnect/packet drop, (3) Disk error/out of space, (4) Rack switch, …

FPL: Fault Placement Slowdown fault injection at the (1) source datanode, (2) mapper, (3) reducer, …

FGR: Fault Ganularity (1) Single disk/NIC, (2) single node (deadnode), (3) en- tire rack (network switch), …

FTM: Fault Timing (1) During shuffling, (2) during 95% of task completion, …

TOP: Topology (1) 30 nodes per rack, (2) 3 nodes per rack, …

TPL: Task Placement (1) Mappers and reducers are in different nodes, (2) AM and reducers in different nodes, (3) Mappers are in the same node, (4) Most of reducers placed in the same rack, ...

SPV @ HotCloud ’15

Page 12: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

Performance ���Bug

System Performance ���Verifier

SPV @ HotCloud ’15 12

Page 13: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

SPV @ HotCloud ’15 13

q Benchmarking

q Hundreds benchmark for every scenario

q Injecting slowdowns and failures

q Take days to weeks!!

Page 14: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

14

q Four goals in performance verification§  Fast§  Covers many deployment scenario§  Runs in pre-deployment§  Directly checks implementation code

SPV @ HotCloud ’15

Formal modeling tools!

Page 15: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

15SPV @ HotCloud ’15

@Datapublic class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}@IOpublic HeartbeatResponse heartbeat (HeartbeatData hd){ ...}

Target system (e.g., Hadoop code)

SPV Compiler

Auto-generated model (in Colored Petri Net)

Performance Verification

20X larger than hand model

Hand model

Page 16: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

16

Tasks

NodeTask to

Run

(“T1”,map)  

A  @0   (A,“T1”,map)  @10  

input(node,task);  output(assignment);  action      let  val  (id,type)  =  task  in      (node,id,type)  end;  

@+10  

node   assignment  

task  

Schedule Task

SPV @ HotCloud ’15

Page 17: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

17

CPN Java

SPV @ HotCloud ’15

Page 18: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q Java è SysJava§  Data flattening§  Code modularization§  Annotation tagging

q SysJava è Model compiler

18SPV @ HotCloud ’15

Page 19: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q Java system states = ArrayList, Map, Tree,…

q CPN states = multisets

19

List<JobInProgress> runningJobs;

public class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}

class TaskInProgress { TaskID id; double progress; ...}

Job In Progress

Task In Progress

Job Task Mapping

[(1)]  

[(1,a),(1,b)]  

[(a,10%),(b,15%)]  

SPV @ HotCloud ’15

Page 20: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

20

private boolean processHeartbeat( TaskTrackerStatus trackerStats) {

synchronized (taskTrackers) { ... }

for (TaskStatus ts: trackerStats) { tasks.get(ts.id).updateStatus(ts); }

...}

@ProcessStateprivate void initCheck() { synchronized (taskTrackers) { ... }}

@ForEachprivate void updateStatuses( TaskTrackerStatus trackerStats) { for (TaskStatus ts: trackerStats) { ... }}

@GetStateprivate TaskInProgress getTask(TaskID id) { tasks.get(ts.id);}@UpdateStateprivate void tipUpdate(TaskInProgress tip, TaskStatus ts) { tip.updateStatus(ts);}

Modular function

Control Flow logic

CRUD Logic

SPV @ HotCloud ’15

Page 21: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q Assist compiler

q Annotation Category:§  Data Structure§  I/O§  CRUD & Process§  Miscellaneous

21

public HeartbeatResponse heartbeat (HeartbeatData hd) { ...}

public class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}

SPV @ HotCloud ’15

@Data

@IO

Page 22: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q  SPV Compiler è Executable XML

q  Define configurations, assertions, and specifications

q  Explore every non-deterministic choices§  Task to node mapping

22

Tasks

Node

Task to Run

(“T1”,map)  

B    

(A,“T1”,map)  

Tasks

Node

Task to Run

(“T1”,map)  

B    

T1 on A T1 on B

A    

Schedule Task

A    

(B,“T1”,map)  

Schedule Task

SPV @ HotCloud ’15

Page 23: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

q  5305 lines of code on top of WALA & Access/CPN

q  Hadoop MapReduce 1.2.1, with 1067 lines code change

q  20x larger than hand-made model

q  34 scenario, 30 assertion violation, 4 performance bug

q  1.5 hour model checking

23

Configuration ValueWorker Node Node A, B

Data Node Node A, B, C

Tasks 2 Task

Fault Type Slow Data Node

Fault Placement Node C

SPV @ HotCloud ’15

Page 24: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

24

http://ucare.cs.uchicago.edu

SPV @ HotCloud ’15

Page 25: Riza Suminto, Agung Laksono , Anang Satria Thanh Do ... demand high dependability, reliability, and performance stability! Amazon found that every 100ms of latency cost them 1% in

25

q Is it time for pre-deployment detection of performance bugs?

q Bridging system code and formal methods

q Future of data-centric languages

q Beyond Hadoop

q Root cause anatomy of performance bugs

q Beyond performance bugs

SPV @ HotCloud ’15