Resilience from Theory to Practice

46
Resilience From Theory to Practice by: Efim Dimenstein - Chief Architect Ori Cohen - Lead Resilience Engineer Jan 2016

Transcript of Resilience from Theory to Practice

Page 1: Resilience from Theory to Practice

ResilienceFrom Theory to

Practiceby:

Efim Dimenstein - Chief ArchitectOri Cohen - Lead Resilience Engineer

Jan 2016

Page 2: Resilience from Theory to Practice

What is Liveperson

Liveperson transforms the connection between brands

and consumers.

Page 3: Resilience from Theory to Practice

1.5 M Visits concurrent

3BN Visits/month 200BN API calls/month 2 PB data

Our Scale

Page 4: Resilience from Theory to Practice

99.97% Uptime

6 Data Centers1000+ physical servers6000+ VMs

Our Production

Page 5: Resilience from Theory to Practice

Fast release cycle

~250 people R&DConstant InnovationMultiple Technologies

Our Engineering

Page 6: Resilience from Theory to Practice

interruptions per month

on average

33 :)

Page 7: Resilience from Theory to Practice

The Past

Page 8: Resilience from Theory to Practice

The Past

Page 9: Resilience from Theory to Practice

The Present

Page 10: Resilience from Theory to Practice

LiveEngage Platform

Composable

~100 servicesWe keep splittingMuch easier to scale

Page 11: Resilience from Theory to Practice

LiveEngage PlatformServices are grouped into typesThe platform is divided into layers

Page 12: Resilience from Theory to Practice

LiveEngage Platform

Page 13: Resilience from Theory to Practice

Everything That Can Go

Wrong Will Go Wrong

Page 14: Resilience from Theory to Practice
Page 15: Resilience from Theory to Practice

Resilience PyramidDCHW

SERVICECOMPONENT

CODE

Page 16: Resilience from Theory to Practice

DC Resilience - Global

Page 17: Resilience from Theory to Practice

DC Resilience

PrimarySecondary

Page 18: Resilience from Theory to Practice

Service

Nod

e 1

Nod

e N

Nod

e 2

Nod

e 3

...

Service X

Page 19: Resilience from Theory to Practice

Service

Nod

e 1

Nod

e N

Nod

e 2

Nod

e 3

...

Service X

HA Functionality

Page 20: Resilience from Theory to Practice

Service GroupingA

dmin

istr

atio

n &

C

onfig

urat

ion

Real Time

Near Real Time

Offline

Page 21: Resilience from Theory to Practice
Page 22: Resilience from Theory to Practice

Components

Solve once - reuse

The GlueLevel of abstractionIsolates common problems

Page 23: Resilience from Theory to Practice

Components - GuidelinesRetries

Fallback

Cache

Page 24: Resilience from Theory to Practice
Page 25: Resilience from Theory to Practice

@ ground level

Page 26: Resilience from Theory to Practice

trust compan

y

Page 27: Resilience from Theory to Practice

trust enginee

rs

Page 28: Resilience from Theory to Practice

and still evaluate

Page 29: Resilience from Theory to Practice

knowledge is power

Page 30: Resilience from Theory to Practice

tooling

Page 31: Resilience from Theory to Practice

testing

Page 32: Resilience from Theory to Practice

deployment

Page 33: Resilience from Theory to Practice

metrics

Page 34: Resilience from Theory to Practice

logs

Page 35: Resilience from Theory to Practice

E2E

Page 36: Resilience from Theory to Practice

ALERTING

Page 37: Resilience from Theory to Practice

untested ==

unreliable

Page 38: Resilience from Theory to Practice

but… ?

Page 39: Resilience from Theory to Practice

cost effective

Page 40: Resilience from Theory to Practice

visibility

Page 41: Resilience from Theory to Practice

incidentinjectiontesting

Page 42: Resilience from Theory to Practice

process

Page 43: Resilience from Theory to Practice

opt-in

Page 44: Resilience from Theory to Practice

resilience @ scale● multi layered solution

● requires monitoring and testing● ingrained in the company culture● keep things simple● trust and empower your engineers● break stuff

Page 45: Resilience from Theory to Practice

Thankyou!

Page 46: Resilience from Theory to Practice

Q&A