Performance evaluation between checkpoint services in multi tier stateful

Post on 15-Apr-2017

35 views 0 download

Transcript of Performance evaluation between checkpoint services in multi tier stateful

Performance Evaluation Between Checkpoint Services in Multi-tier Stateful

ApplicationsDemis Gomes

Advisor: Glauco GonçalvesCo-Advisor: Patricia Endo

2

INTRODUCTION

3

Introduction

• Plataform-as-a-Service (PaaS)

DeveloperPaaS

Application

User

PaaS Provider

4

Introduction

• Multi-tier stateful applications

5

Introduction

• It is important keep an application in a PaaS running as long as possible

• A downtime causes many financial losses

6

Introduction

• The average cost of a critical application failure per hour is $500,000 to $1 million.

Source: https://devops.com/2015/02/11/real-cost-downtime/ . Last access 11 out. 2016

Checkpoint Services!

7

IntroductionDevelopers Users

CheckpointService

PaaS Providers

8

Background

• A checkpoint service is divided into three mechanisms– Checkpoint saving– Failure detection– Failover

9

Background

• Checkpoint Service

AppActiveStandb

y

Checkpoint Service

AppState

AppState

AppState

Failover

CheckpointSaving

App

FailureDetection

10

Background

• Service Availability Forum (SAF)

• Three different implementations:– Non-collocated– Collocated warm– Collocated hot

11

Background

12

Checkpoint ServicesCS Application-level CS System-level

App

Agent

State-aware application

App

Agent

HA-agnostic application

ContainerCheckpoint

Manager CheckpointManager

13

Motivation

• Works presented either app-lvl [1] or sys-lvl [2]

• Lack of consistent comparison between these services

• No implementation in accordance with the SAF standard

14

Motivation

• Carry out a performance evaluation between system and application checkpoint services, where these models follow the SAF standard and evaluate the impact of different recovery modes in time and resource consumption

15

Answer three questions

• System-level ~= App-level?• Impact of changing from non-

collocated to collocated?• Bottlenecks of the system-level

and application-level?

16

CHECKPOINT SERVICES

17

Application

• State-aware application • A multi-tier stateful chat– Frontend: provides interface and

saves user’s data– Backend: saves room messages– Database: stores information related

to rooms and users

App AgentGET /state

200 OK

18

Application

• State provided via JSON (backend)

19

CS System-level

• We used well-known tools:– LXC as container–NFS as file system– rsync to transfer files between

instances– CRIU to establish checkpoint and

restore containers

CS: Checkpoint Service! :D

20

CS System-level

• We did not implement collocated hot because CRIU does not allow restore in a running instance

21

CS System-level

• Checkpoint in non-collocatedApp

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

Container

Container

22

CS System-level

• Checkpoint in collocated warmApp

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

rsync

Container

Container

23

Container

CS System-level

• Failover in non-collocatedApp

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

Container

24

Container

CS System-level

• Failover in collocated warmApp

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

Container

rsync

25

CS App-level

• CS at application-level was developed from scratch for this work

• REST resources

Remember, CS: Checkpoint Service! :D

GET http://{manager_ip}:{manager_port}/config

RESPONSE 200 OK Content-type: application/json

26

CS App-level

• Checkpoint at Application-level

App

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

State-aware application Non-collocated

Collocatedwarm

Collocatedhot

27

CS App-level

• Failover in non-collocated

App

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

28

CS App-level

• Failover in collocated warm

App

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

29

CS App-level

• Failover in collocated hot

App

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

30

EVALUATION

31

Evaluation

• Two evaluations were conducted– Evaluation I: Failover time

comparison – Evaluation II: Checkpoint time and

resources consumption comparison

32

Evaluation

Physical Machines: 16 GB RAM, 8 cores, Gigabit Interface

33

Evaluation I

• Methodology– Backend with 1, 5,10,15,20 and 25

MB of state sizes– Experiment Manager starts the

experiment and generates a failure alert

– Failover process is executed– Failover time is collected

34

Failover time – Non collocated

Application-level has a greater failover time

The growth is linear

35

Failover time – Non collocated

We estimate the failover time with state size increasing until 100 MB

App lvl would be 66% faster

36

Failover time – Collocated

Application-level collocated warm is greatly impacted with increase of state size

The values of app lvl collocated hot and sys lvl collocated warm are very similar

37

Failover time – Collocated

Linear regression shows:

High increase of app lvl collocated warm

Slight increase on sys lvl collocated warm

Constant values to collocated hot

38

Evaluation II

• Methodology– Similarly to the previous experiment,

states are saved in same state sizes– Experiment Manager triggers a

checkpoint process– Checkpoint time is collected– Resources consumption are

evaluated

39

Evaluation II

• Methodology– Resources consumption metrics

Metrics Measured inCheckpoint Time s

CPU Load %

Memory Occupation %

Network I/O Throughput Mbps

Disk I/O Throughput b/s

40

Evaluation IICheckpoint times

41

Evaluation II – Active InstanceAt 25MB CPU Memory Network (I/O) Disk (W)

Sys-lvl collocated

warm

6,8% 9,4% 0/59,8 Mbps 1300 b/s

App-lvl collocated

warm

2,7% 9,1% 0/8,8 Mbps 9220 b/s

App-lvl collocated hot

2,53% 9,5% 0/8,64 Mbps 8340 b/s

At 25MB CPU Memory Network (I/O)

Disk (W)

Sys-lvl non-collocated

6% 9,1% 0/81 Mbps 1780b/s

App-lvl non-collocated

2% 8,92% 0/11,6 Mbps

2410 b/s

42

Evaluation II – Standby InstanceAt 25 MB CPU Memory Network (I/O) Disk (W)

Sys-lvl collocated

warm

1,8% 10,3% 5,1/0 Mbps 12500 b/s

App-lvl collocated

warm

2,5% 11,9% 8,5/8,5 Mbps 7280 b/s

App-lvl collocated

hot

4,1% 12,4% 8,35/8,35 Mbps

6900 b/s

At 25 MB CPU Memory Network (I/O)

Disk (W)

Sys-lvl non-collocated

0,16% 9,8% 0/0 Mbps 800 b/s

App-lvl non-collocated

0,2% 11,4% 0/0 Mbps 2600 b/s

43

Discussion

• Availability Analysis in a year• Mean Time To Recovery (MTTR) as

failover time• Mean Time To Failure (MTTF) as

Apache Server (788.4h/year) [3]• Assuming that the failover time is 50

times greater• High Availability (HA) = 99.999%

(five nines)

44

Discussion

MTTR in25 MB (s)

MTTR in 25 MB with

factor 50 (s)

MTTF(s) Availability with factor 50 (%)

System-levelcollocated warm

0.38636 19.318 2838240 99.9993

Application-level collocated warm

1.27823 63.9115 2838240 99.997

Application-levelcollocated hot

0.25802 12.901 2838240 99.9995

System-levelnon-collocated

3.5441 177.205 2838240 99.9937

Application-level non-collocated

1.38795 69.3975 2838240 99.997

Availability analysis (25 MB)

45

Discussion

MTTR in100 MB

(s)

MTTR in 100 MB with

factor 50 (s)

MTTF(s) Availability with factor 50 (%)

System-levelcollocated warm

0.5902 29.51 2838240 99.9989

Application-level collocated warm

3.8621 193.1 2838240 99.993

Application-levelcollocated hot

0.2677 13.385 2838240 99.9995

System-levelnon-collocated

9.7999 498.995 2838240 99.9824

Application-level non-collocated

4.321 216.05 2838240 99.9923

Availability analysis (prediction until 100 MB)

46

CONCLUSIONS AND FUTURE WORKS

47

Conclusions

Answering the questions• System-level ~= App-level?

Yes! In collocated warm

48

Conclusions

• Impact of change from non-collocated to collocated?– Failover: great decrease– Checkpoint: great increase– Resources Consumption: Similar,

except of CPU and disk (greater on collocated)

49

Conclusions

• Bottlenecks of the system-level and application-level?

– App : disk, CPU in standby (hot) and development time

– Sys: CPU, network and NFS

50

Conclusions

• CS Application-level– Private PaaS – App with large state size and high

rate of checkpoints (massive online applications)

51

Conclusions

• CS System-level– PaaS with legacy applications– App with less state size and higher

checkpoint intervals

52

Conclusions

• PaaS Business Model– Non-collocated: Free plans– Collocated: Premium plans

53

Contributions

• Short paper approved with results of Experiment I, entitled:

“Failover Time Evaluation Between Checkpoint Services in Multi-tier Stateful Applications”

IM-2017, Exp. Session (Qualis B1)

54

Future Works

As future works, we will study• Scalability of services• Resources consumption on

Experiment Instance

55

Acknowledgments

• Thanks!

#CatãoEterno

56

THANKS!

Demis Gomesdemismg72@gmail.comdemis.gomes@ufrpe.br

57

References• [1] KANSO, Ali; LEMIEUX, Yves. Achieving High Availability

at the Application Level in the Cloud. In: 2013 IEEE Sixth International Conference on Cloud Computing. IEEE, 2013. p. 778-785.

• [2] LI, Wubin; KANSO, Ali; GHERBI, Abdelouahed. Leveraging linux containers to achieve high availability for cloud services. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on. IEEE, 2015. p. 76-83

• [3] MELO, R. M. D. et al. Redundant vod streaming service in a private cloud: availability modeling and sensitivity analysis. Mathematical Problems in Engineering, Hindawi Publishing Corporation, v. 2014, 2014

58

BACKUP

59

Agenda

• Introduction• Checkpoint Services• Evaluation– Experiment I– Experiment II

• Conclusion and Future Works• Acknowledgments

60

Introduction

• PaaS contains several challenges, where one is the availability of your services

• Multi-tier stateful applications

61

Introduction

• Many PaaS not have a mechanism that handles failures on application

• Some offers a backup but is not transparent

62

Introduction

Tsuru only restarts application, not saving your last state

63

VM x Container

• VMs • Containerization

64

Objectives• General– Carry out a consistent comparison between

checkpoint in system and application levels• Specifics– Develop the two modes following SAF

standard– Compare the services among following

metrics:• Failover time• Checkpoint time• Load generated in application

65

Application

• Application generates new base states if– threshold defined by developer has

reached– A time limit has reached

App 20 new messages!

App 120 seconds without updates!

66

CS System-level

67

CS System-level

• Checkpoint/Restore In Userspace (CRIU)

• Saves memory context• Freezes processes reading

memory• Restores processes in machines

with same filesystem

68

CS System-level

• Phoenix!

69

Checkpoint Services Implementation

• URLS implemented by chat

70

Checkpoint Services

• CS Application-level

App

CheckpointManager

Agent

App

Agent

Standby Instance

ActiveInstance

State-aware application Non-collocated

Collocatedwarm

Collocatedhot

71VM/Container

Checkpoint Services

• CS System-levelApp

CheckpointManager

Agent

App

AgentStandby Instance

ActiveInstance

HA-agnosticapplication

Non-collocated

Collocatedwarm

Collocatedhot

VM/Container

72

CS System-level

• LXC must be configured to allow CRIU make checkpoint and restore

73

Evaluation II

• Methodology– Checkpoint time is presented as

means with 95% Confidence Interval (CI)

– Resource consumption are means with 95% CI related to active and standby instances

74

CS System-level

• Checkpoint process is established in non-collocated– saving container via CRIU and storing

your memory context in a shared file system between Manager and Agent

• In collocated:– saving container via CRIU and send

state via rsync to all standby instances

75

CS System-level

76

CS System-level

• Failover process (non-collocated)

77

CS System-level

• Failover process (collocated warm)

78

CS App-level

79

CS App-level

• In failover process (non-collocated)

80

CS App-level

• In failover process (collocated warm)

81

CS App-level

• In failover process (collocated hot)

82

Evaluation I

• T-test between app collocated hot and sys collocated warm

83

Evaluation IINetwork received (collocated modes)

84

Evaluation IINetwork received (non-collocated)

85

Evaluation IICPU Load (collocated modes)

86

Evaluation IICPU Load (non-collocated)

87

Evaluation IIMemory occupation (collocated modes)

88

Evaluation IIMemory occupation (non-collocated)

89

Evaluation IINetwork sent (collocated modes)

90

Evaluation IINetwork sent (non-collocated)

91

Evaluation IIDisk written (collocated modes)

92

Evaluation IIDisk written (non-collocated)

93

Acknowledgments

• Family• Friends• Creators• UFRPE• Advisors (the bests)• CNPq and FACEPE