Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous...

19
Continuous availability: from the shift paradigm to unmanned operation. Pietro Tiberi 17 January 2018 – TIPS Contact Group

Transcript of Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous...

Page 1: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

Continuous availability: from the shift paradigm

to unmanned operation.

Pietro Tiberi

17 January 2018 – TIPS Contact Group

Page 2: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

2

Agenda

Continuous availability: from the shift paradigm to unmanned operation

1

Introduction

2

Continuous

Availability

3

Results

4

Conclusions and perspective

Page 3: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

3

Introduction TIPS Non functional requirements - Reliability / Availability

(RPO=0)

(RTO=15 minutes)

Transactions Lost

Downtime

99.9%

Continuous availability: from the shift paradigm to unmanned operation

Page 4: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

4

Introduction Datacenter Operations

Continuous availability: from the shift paradigm to unmanned operation

Human based

(on shifts) Unmanned

Page 5: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

5

CONTINUOUS OPERATION

Continuous availability: from the shift paradigm to unmanned operation

Page 6: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

6

Continuous Availability From high availability to continuous availability

Continuous availability: from the shift paradigm to unmanned operation

o Redundancy

o Fault Tolerance

o Clustering

o Active Active configuration

o Proactive

monitoring

o Continuous

delivery

o Automatic

remediation

o Dynamic capacity

management

Page 7: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

7

Continuous Availability Proactive Monitoring

Continuous availability: from the shift paradigm to unmanned operation

o Infrastructure monitoring

o Application monitoring o Detect events

before failures

o Trigger automatic

actions

o Analyze the event

Page 8: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

8

Continuous Availability IT Automation

Continuous availability: from the shift paradigm to unmanned operation

Page 9: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

9

Continuous Availability From Agile to Devops

Continuous availability: from the shift paradigm to unmanned operation

Page 10: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

10

Continuous Availability DevOps - Everything as Code

Continuous availability: from the shift paradigm to unmanned operation

Code

Virtual Infrastructure

Page 11: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

11

Continuous Availability Dynamic Capacity Management

Continuous availability: from the shift paradigm to unmanned operation

o Consumption

trend analysis

o Resource utilization

rate optimization o What if scenarios

o Predict future

requirements and

trends

Page 12: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

12 Continuous availability: from the shift paradigm to unmanned operation

Page 13: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

13

Test Plant Architecture

Continuous availability: from the shift paradigm to unmanned operation

Message Layer

Database Layer

User A User B

Message Router

Message Processor

Message Router

Kafka Broker

Aerospike Database

write

store store

write

write read

put

get

get

put

Application Layer

Page 14: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

14

Results Test Architecture

Specific tests to verify the relevant

domain functions.

Common simulation layer to

reproduce real operational

environment.

executed on

Continuous availability: from the shift paradigm to unmanned operation

Page 15: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

15

Results Simulation – continous delivery (1)

Normal traffic condition (500 msg/s), timeout = 10.000 ms

Kafka cluster rolling update

0 messages lost

0 timeout expired

Continuous availability: from the shift paradigm to unmanned operation

SIMUL.APP.01 : message latency (1 sec average)

Page 16: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

16

Results

Continuous availability: from the shift paradigm to unmanned operation 07 November 2017 – CMG Impact 2017

SIMUL.APP.02 : message latency (1 sec average)

Simulation – continous delivery (2)

Heavy traffic condition (2000 msg/s), timeout = 10.000 ms

Kafka cluster rolling update

0 messages lost

some timeout expired

Page 17: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

17

Results Simulation – proactive monitoring

Continuous availability: from the shift paradigm to unmanned operation

Normal traffic condition (500 msg/s)

average E2E processing time = 45 ms

High vCPU load added to Message Processor nodes.

T0-T1 below threshold

T2-T3 exceed threshold

Page 18: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

18

Conclusions and perspective

Phased

Approach Bi-modal

Data Center

Tool

Continuous availability: from the shift paradigm to unmanned operation

Page 19: Continuous availability: from the shift paradigm to ... · 1/17/2018  · Simulation – continous delivery (1) Normal traffic condition (500 msg/s), timeout = 10.000 ms Kafka cluster

Continuous availability: from the shift paradigm

to unmanned operation.

Pietro Tiberi ([email protected])

Thanks for your attention