Performance evaluation between checkpoint services in multi tier stateful

Performance Evaluation Between Checkpoint Services in Multi-tier Stateful

ApplicationsDemis Gomes

Advisor: Glauco GonçalvesCo-Advisor: Patricia Endo

INTRODUCTION

Introduction

• Plataform-as-a-Service (PaaS)

DeveloperPaaS

Application

PaaS Provider

Introduction

• Multi-tier stateful applications

Introduction

• It is important keep an application in a PaaS running as long as possible

• A downtime causes many financial losses

Introduction

• The average cost of a critical application failure per hour is $500,000 to $1 million.

Source: https://devops.com/2015/02/11/real-cost-downtime/ . Last access 11 out. 2016

Checkpoint Services!

IntroductionDevelopers Users

CheckpointService

PaaS Providers

Background

• A checkpoint service is divided into three mechanisms– Checkpoint saving– Failure detection– Failover

Background

• Checkpoint Service

AppActiveStandb

Checkpoint Service

AppState

Failover

CheckpointSaving

FailureDetection

Background

• Service Availability Forum (SAF)

• Three different implementations:– Non-collocated– Collocated warm– Collocated hot

Background

Checkpoint ServicesCS Application-level CS System-level

State-aware application

HA-agnostic application

ContainerCheckpoint

Manager CheckpointManager

Motivation

• Works presented either app-lvl [1] or sys-lvl [2]

• Lack of consistent comparison between these services

• No implementation in accordance with the SAF standard

Motivation

• Carry out a performance evaluation between system and application checkpoint services, where these models follow the SAF standard and evaluate the impact of different recovery modes in time and resource consumption

Answer three questions

• System-level ~= App-level?• Impact of changing from non-

collocated to collocated?• Bottlenecks of the system-level

and application-level?

CHECKPOINT SERVICES

Application

• State-aware application • A multi-tier stateful chat– Frontend: provides interface and

saves user’s data– Backend: saves room messages– Database: stores information related

to rooms and users

App AgentGET /state

200 OK

Application

• State provided via JSON (backend)

CS System-level

• We used well-known tools:– LXC as container–NFS as file system– rsync to transfer files between

instances– CRIU to establish checkpoint and

restore containers

CS: Checkpoint Service! :D

CS System-level

• We did not implement collocated hot because CRIU does not allow restore in a running instance

CS System-level

• Checkpoint in non-collocatedApp

CheckpointManager

Standby Instance

ActiveInstance

Container

CS System-level

• Checkpoint in collocated warmApp

CheckpointManager

Standby Instance

ActiveInstance

Container

CS System-level

• Failover in non-collocatedApp

CheckpointManager

Standby Instance

ActiveInstance

Container

CS System-level

• Failover in collocated warmApp

CheckpointManager

Standby Instance

ActiveInstance

Container

CS App-level

• CS at application-level was developed from scratch for this work

• REST resources

Remember, CS: Checkpoint Service! :D

GET http://{manager_ip}:{manager_port}/config

RESPONSE 200 OK Content-type: application/json

CS App-level

• Checkpoint at Application-level

CheckpointManager

Standby Instance

ActiveInstance

State-aware application Non-collocated

Collocatedwarm

Collocatedhot

CS App-level

• Failover in non-collocated

CheckpointManager

Standby Instance

ActiveInstance

CS App-level

• Failover in collocated warm

CheckpointManager

Standby Instance

ActiveInstance

CS App-level

• Failover in collocated hot

CheckpointManager

Standby Instance

ActiveInstance

EVALUATION

Evaluation

• Two evaluations were conducted– Evaluation I: Failover time

comparison – Evaluation II: Checkpoint time and

resources consumption comparison

Evaluation

Physical Machines: 16 GB RAM, 8 cores, Gigabit Interface

Evaluation I

• Methodology– Backend with 1, 5,10,15,20 and 25

MB of state sizes– Experiment Manager starts the

experiment and generates a failure alert

– Failover process is executed– Failover time is collected

Failover time – Non collocated

Application-level has a greater failover time

The growth is linear

Failover time – Non collocated

We estimate the failover time with state size increasing until 100 MB

App lvl would be 66% faster

Failover time – Collocated

Application-level collocated warm is greatly impacted with increase of state size

The values of app lvl collocated hot and sys lvl collocated warm are very similar

Failover time – Collocated

Linear regression shows:

High increase of app lvl collocated warm

Slight increase on sys lvl collocated warm

Constant values to collocated hot

Evaluation II

• Methodology– Similarly to the previous experiment,

states are saved in same state sizes– Experiment Manager triggers a

checkpoint process– Checkpoint time is collected– Resources consumption are

evaluated

Evaluation II

• Methodology– Resources consumption metrics

Metrics Measured inCheckpoint Time s

CPU Load %

Memory Occupation %

Network I/O Throughput Mbps

Disk I/O Throughput b/s

Evaluation IICheckpoint times

Evaluation II – Active InstanceAt 25MB CPU Memory Network (I/O) Disk (W)

Sys-lvl collocated

6,8% 9,4% 0/59,8 Mbps 1300 b/s

App-lvl collocated

2,7% 9,1% 0/8,8 Mbps 9220 b/s

App-lvl collocated hot

2,53% 9,5% 0/8,64 Mbps 8340 b/s

At 25MB CPU Memory Network (I/O)

Disk (W)

Sys-lvl non-collocated

6% 9,1% 0/81 Mbps 1780b/s

App-lvl non-collocated

2% 8,92% 0/11,6 Mbps

2410 b/s

Evaluation II – Standby InstanceAt 25 MB CPU Memory Network (I/O) Disk (W)

Sys-lvl collocated

1,8% 10,3% 5,1/0 Mbps 12500 b/s

App-lvl collocated

2,5% 11,9% 8,5/8,5 Mbps 7280 b/s

App-lvl collocated

4,1% 12,4% 8,35/8,35 Mbps

6900 b/s

At 25 MB CPU Memory Network (I/O)

Disk (W)

Sys-lvl non-collocated

0,16% 9,8% 0/0 Mbps 800 b/s

App-lvl non-collocated

0,2% 11,4% 0/0 Mbps 2600 b/s

Discussion

• Availability Analysis in a year• Mean Time To Recovery (MTTR) as

failover time• Mean Time To Failure (MTTF) as

Apache Server (788.4h/year) [3]• Assuming that the failover time is 50

times greater• High Availability (HA) = 99.999%

(five nines)

Discussion

MTTR in25 MB (s)

MTTR in 25 MB with

factor 50 (s)

MTTF(s) Availability with factor 50 (%)

System-levelcollocated warm

0.38636 19.318 2838240 99.9993

Application-level collocated warm

1.27823 63.9115 2838240 99.997

Application-levelcollocated hot

0.25802 12.901 2838240 99.9995

System-levelnon-collocated

3.5441 177.205 2838240 99.9937

Application-level non-collocated

1.38795 69.3975 2838240 99.997

Availability analysis (25 MB)

Discussion

MTTR in100 MB

MTTR in 100 MB with

factor 50 (s)

MTTF(s) Availability with factor 50 (%)

System-levelcollocated warm

0.5902 29.51 2838240 99.9989

Application-level collocated warm

3.8621 193.1 2838240 99.993

Application-levelcollocated hot

0.2677 13.385 2838240 99.9995

System-levelnon-collocated

9.7999 498.995 2838240 99.9824

Application-level non-collocated

4.321 216.05 2838240 99.9923

Availability analysis (prediction until 100 MB)

CONCLUSIONS AND FUTURE WORKS

Conclusions

Answering the questions• System-level ~= App-level?

Yes! In collocated warm

Conclusions

• Impact of change from non-collocated to collocated?– Failover: great decrease– Checkpoint: great increase– Resources Consumption: Similar,

except of CPU and disk (greater on collocated)

Conclusions

• Bottlenecks of the system-level and application-level?

– App : disk, CPU in standby (hot) and development time

– Sys: CPU, network and NFS

Conclusions

• CS Application-level– Private PaaS – App with large state size and high

rate of checkpoints (massive online applications)

Conclusions

• CS System-level– PaaS with legacy applications– App with less state size and higher

checkpoint intervals

Conclusions

• PaaS Business Model– Non-collocated: Free plans– Collocated: Premium plans

Contributions

• Short paper approved with results of Experiment I, entitled:

“Failover Time Evaluation Between Checkpoint Services in Multi-tier Stateful Applications”

IM-2017, Exp. Session (Qualis B1)

Future Works

As future works, we will study• Scalability of services• Resources consumption on

Experiment Instance

Acknowledgments

• Thanks!

#CatãoEterno

THANKS!

Demis Gomesdemismg72@gmail.comdemis.gomes@ufrpe.br

References• [1] KANSO, Ali; LEMIEUX, Yves. Achieving High Availability

at the Application Level in the Cloud. In: 2013 IEEE Sixth International Conference on Cloud Computing. IEEE, 2013. p. 778-785.

• [2] LI, Wubin; KANSO, Ali; GHERBI, Abdelouahed. Leveraging linux containers to achieve high availability for cloud services. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on. IEEE, 2015. p. 76-83

• [3] MELO, R. M. D. et al. Redundant vod streaming service in a private cloud: availability modeling and sensitivity analysis. Mathematical Problems in Engineering, Hindawi Publishing Corporation, v. 2014, 2014

BACKUP

Agenda

• Introduction• Checkpoint Services• Evaluation– Experiment I– Experiment II

• Conclusion and Future Works• Acknowledgments

Introduction

• PaaS contains several challenges, where one is the availability of your services

• Multi-tier stateful applications

Introduction

• Many PaaS not have a mechanism that handles failures on application

• Some offers a backup but is not transparent

Introduction

Tsuru only restarts application, not saving your last state

VM x Container

• VMs • Containerization

Objectives• General– Carry out a consistent comparison between

checkpoint in system and application levels• Specifics– Develop the two modes following SAF

standard– Compare the services among following

metrics:• Failover time• Checkpoint time• Load generated in application

Application

• Application generates new base states if– threshold defined by developer has

reached– A time limit has reached

App 20 new messages!

App 120 seconds without updates!

CS System-level

• Checkpoint/Restore In Userspace (CRIU)

• Saves memory context• Freezes processes reading

memory• Restores processes in machines

with same filesystem

CS System-level

• Phoenix!

Checkpoint Services Implementation

• URLS implemented by chat

Checkpoint Services

• CS Application-level

CheckpointManager

Standby Instance

ActiveInstance

State-aware application Non-collocated

Collocatedwarm

Collocatedhot

71VM/Container

Checkpoint Services

• CS System-levelApp

CheckpointManager

AgentStandby Instance

ActiveInstance

HA-agnosticapplication

Non-collocated

Collocatedwarm

Collocatedhot

VM/Container

CS System-level

• LXC must be configured to allow CRIU make checkpoint and restore

Evaluation II

• Methodology– Checkpoint time is presented as

means with 95% Confidence Interval (CI)

– Resource consumption are means with 95% CI related to active and standby instances

CS System-level

• Checkpoint process is established in non-collocated– saving container via CRIU and storing

your memory context in a shared file system between Manager and Agent

• In collocated:– saving container via CRIU and send

state via rsync to all standby instances

CS System-level

• Failover process (non-collocated)

CS System-level

• Failover process (collocated warm)

CS App-level

• In failover process (non-collocated)

CS App-level

• In failover process (collocated warm)

CS App-level

• In failover process (collocated hot)

Evaluation I

• T-test between app collocated hot and sys collocated warm

Evaluation IINetwork received (collocated modes)

Evaluation IINetwork received (non-collocated)

Evaluation IICPU Load (collocated modes)

Evaluation IICPU Load (non-collocated)

Evaluation IIMemory occupation (collocated modes)

Evaluation IIMemory occupation (non-collocated)

Evaluation IINetwork sent (collocated modes)

Evaluation IINetwork sent (non-collocated)

Evaluation IIDisk written (collocated modes)

Evaluation IIDisk written (non-collocated)

Acknowledgments

• Family• Friends• Creators• UFRPE• Advisors (the bests)• CNPq and FACEPE

Performance evaluation between checkpoint services in multi tier stateful

Software

Transcript of Performance evaluation between checkpoint services in multi tier stateful

Setting up pfSense as a Stateful Bridging Firewall. Contentsusers.ox.ac.uk/~clas0415/...pfSense-as-a-Stateful-Bridging-Firewall... · Setting up pfSense as a Stateful Bridging Firewall.

Stateful Breakpoints - Bodden

Packet Filtering and Stateful Firewalls

draft RFP · Web viewThe production network is protected with two UNIX servers running Checkpoint FireWall-1 software with stateful inspection and redundant failover capability.

Optimizing Stateful Serverless Computing

Stateful Connection Tracking & Stateful NAT - Open …openvswitch.org/support/ovscon2014/17/1030-conntrack_nat.pdfStateful Connection Tracking & Stateful NAT Justin Pettit VMware Thomas

Exploiting Stateful Firewalls - Taylor University

Stateful Applications User's Guide

Stateful Data Delivery

Linux 2.4 stateful firewall designcarfield.com.hk/document/security/Linux+Stateful+Firewall.pdf · Linux 2.4 stateful firewall design ... Defining rules 6 4. Stateful firewalls 8

Stateful Web Services - Full Report

IPv6 Address autoconfiguration stateless & stateful.

Stateful Firewalls - Pearson | Higher Education

Modelling Fog Ofﬂoading Performance - arXivthe container image. Stateful migration can be achieved by using the CRIU (Checkpoint/Restore in Userspace)3 approach. 1) Stateless Techniques:

Scalable Veriﬁcation of Stateful Networks

Stateful Containers: Flocker on CoreOS

Stateful Traits - Portal

Stateful Functions @ BOSS

Mesos Go Stateful - events.static.linuxfound.org · Analysis of Different Stateful Workload MySql Kafka ETCD PostgreSql Redis ... Mesos Go Stateful Mesos Go Framework Executor Executor

Stateful Connection Tracking & Stateful NAT