Design (Cloud systems) for Failures

Design for Failures(and for Availability)

IV Jornadas de Cloud Computing & Big Data

Rodolfo KohnCloud Architect

Intel [email protected]

Original Agenda

Remembering “Distributed System Design” and availability

Introduction to Design for Failures

• Failure modes

• Redundancy (process and data)

• Failure detection

• Failure recovery

• Cascade failures and recovery

Redundancy and high availability in AWS

Eventual consistency problems

Performance and scalability problems

Operations monitoring

• Techniques to avoid false positives

Logs and counters

Design software for failures

Testing availability

Measuring availability

Education

7/10/20162

Agenda

Remembering “Distributed System Design” and availability

Introduction to Design for Failures

• Redundancy – Process

– Data: Replication (multi-master, master-slave)

– Flat groups and hierarchical groups

• Synchronization Model

• Stateful vs Stateless

• Eventual consistency

• CAP Theorem

• Failure detection

• Failure recovery

• Cascade failures and recovery

7/10/20163

(Cloud or Distributed) Applications are Complex

7/10/20164

DNSServer

.com Root

GLB

Auth

Datacenter-1

GLB

Auth

Datacenter-2

Service

Cache

Cache

Cache

Cache

DNS

Disk

Network

SMTP

CDN

NoSQL

SQL

Monitoring Logs Configuration Management

Multiple Opportunities for Unexpected FailuresBrittle Systems shall not Survive

Load bursts &Response time deterioration

Micro-services dependencies

In distributed systems, and cloud systems, there are complex dependencies between systems such that failure of one component can bring down the whole system

7/10/20165

What is Availability?

Distributed Systems: Principles and Paradigms (2nd Edition), Andrew Tanenbaum, Maarten Van Steen

“Availability is defined as the property that a system is ready to be used immediately. In general, it refers to the probabilitythat the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.”

3/4/5 9’s of Availability: see Wikipedia :)

7/10/20166

The system is always running correctlyWhen users access it, they have it

Systems fail …

7/10/20167

http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north-virginia-affect-heroku-reddit-and-others-heroku-still-down/

What started as a small issue affecting some instances of Amazon’s Elastic Cloud Compute (EC2) in North Virginia became a full-blown outage of AWS in North Virginia. Major services, such as Reddit, Foursquare, Minecraft and Heroku, are down. GitHub, imgur, Pocket, HipChat, Coursera and others are affected …

And DOWNTIME COMES …

http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north-virginia-affect-heroku-reddit-and-others-heroku-still-down/

Consequences of Unavailability

7/10/20168

http://blog.smartbear.com/news/motorolas-site-collapses-under-cyber-monday-traffic/

Talk about failures

7/10/20169

We don’t avoid failures, we live with them

Design for Failures is about focusingon the Error Path

7/10/201610

PAINFUL AND TIME CONSUMIG

Failures affecting Availability

Different types of failures• Infrastructure failures• Software failures• Operations failures• Deployment failures

System updates or upgrades may affect availability if they require downtime

Bad response time affects availability• Unacceptable response time = system unavailable• Bad scalability eventually affects response time– Vulnerability to load peaks

Manual Path to Production affects availability

Neglected business/process situations affect availability

7/10/201611

Valid for all business

As core business moves to the Internet, downtime means money

More possibilities of failure:

• (Cloud) systems are becoming increasingly complex

• Software undergoes stringent conditions

• There is a demand for excellent user experience

• In the cloud applications run in commodity hardware

7/10/201612

It’s about the whole big machinery

7/10/201613

Product/Service Requirements

DevelopmentDeployment

and Operation

Path to production

PDM and CXD must think about alternative paths on error conditions

Architects design for Availability(Software and Infrastructure)

Agile teamsDistributed Systems SkillsAvailability, Scalability, Performance mindset

Fast, automated, error free

DevOps, Monitoring,Operations Automation

From Architecture to Development

Architecture:

redundancy model and management, dependency

management, state model, synchronization model, failure

detection, recovery, scalability model,

administration/configuration management

Design: logging design, monitoring design, dependency handling, state management design (stateful and stateless),

consistency, fallback actions on failures per operation…

Development: consistency handling, retries, error analysis,

logging, error path (if ... else …), …

Topics

Redundancy (process and data)

Flat (P2P) Groups vs Hierarchical groups

State: stateless vs stateful

Replication

Synchronization: asynchonous vs. synchronous

Eventual Consistency

CAP

Failure detection

Recovery actions

Cascade Failures

Client recovery in client/server

7/10/201615

Redundancy

It is about provisioning in excess, replicating hardware or software components or data

It allows masking failures as a mechanism of fault tolerance

Additional hardware equipment or software processes are provided

When a component fails another one in the group takes over its work

Data replication, associated with a component replication, keeps data safe in face of a component failure

7/10/201616

Redundancy and groups

Process redundancy implies the creation of groups of replicated processes

The group is seen by other processes as a single process

• Replication is abstracted to be seen as one entity

• The same happens with hardware

7/10/201617

Two types of groups

7/10/201618

Flat group or peer-to-peer Hierarchical group

Coordinator

Worker

Design Considerations

Group creation and destroy

• Group bootstrapping

Group membership

• Processes can join and leave a group

Decision making

• Task distribution, synchronization, consistency, etc.

7/10/201619

Different challenges

Hierarchical group

• The coordinator, primary, or master knows and controls all workers

• Simpler control and management

• If coordinator fails a group crashes

Flat group or peer-to-peer

• There is need of agreement or consensus algorithms– For Coordinator election

– For consistency

– Synchronization

– For faulty process detection

– Membership change detection

• Data distribution

• If any member crashes the group continuous working, just shrinks

7/10/201620

Hierarchical group:Pool of servers controlled by a Load Balancer

7/10/201621

Load balancer detects unresponsive server and removes it

A new server is added to the pool.Manually or automatically.

All other processes/applications/systems sending requests to this group see it as just one process

The LB distributes work and controls workers

Faulty process and server detection

Load balancer sends health checks to servers in the pool detecting failing servers

• It can monitor at different stack layers

– In the case of AWS ELB: TCP, SSL, HTTP, HTTPS

– F5 can also test at different stack layers

• Failing servers can be automatically de-registered

• New healthy servers can be added to the pool

7/10/201622

Flat group: Cassandra

A cluster of Cassandra nodes

• Information is transmitted with a gossip protocol

• If a node detects a new node or a faulty node It transmits information through a gossip protocol

• Heartbeats with other nodes to detect faulty nodes with Phi Accrual Failure Detectors

7/10/201623

Flat group: Cassandra

A cluster of Cassandra nodes

7/10/201624

Flat group: OSPF

I would say OSPF routers form a flat group

• Routers use link-state routing protocol to transmit connectivity information

• Routers can detect neighbor failures through Hello protocol and transmit the data as links states

7/10/201625

Data Redundancy

Data stores may be replicated for high availability

• Database replication

• Disk replication

Data redundancy is also found at other levels

• RAID disks

• In communications: CDMA uses Hamming code to recover from errors

We focus on higher level failures that affect operations: a database, SAN, whole platform, datacenter

7/10/201626

Data Redundancy

SQL and NoSQL Databases allow different replication models

• Master-Master

– All replicas can be read and written

• Master-Slave

– All replicas read, only master can be written

– In case of master failure, a slave must take over

7/10/201627

Database Replication (1)

Replication: Data is replicated in all instances

Partitioning: Data is partitioned across different instances

• This is not replication

Data Data Data

Data Data

Clients from America

Clients from Europe


Replication Master-Slave: Write in one instance, Read from all instances

DataData

Data

WRITE

READ

REPLICATION


Replication Master-Master or Multi-master or peer-to-peer: Write in all instances, Read from all instancesPossibility of conflicts in asynchronous mode:

• Same row updated in different replicas

• Two inserts in different replicas

• Delete and insert/update

DataData

Data

WRITE READ

REPLICATION

Synchronous vs. Asynchronous Replication

Synchronous replication assures a write will occur in all instances at the same time

• Either multi-master or master-slave

In asynchronous replication write is sent to one node and then replicated to other nodes

• Either multi-master or master-slave

• There is a lag in write replication

• At a point in time data might not be the same in all nodes (eventual consistency)

Synchronous Replication

Synchronous replication assure a write will occur in all instances at the same time

• All servers (both masters and slaves) have up-to-date data (A and C in ACID)

• Provides ACID capabilities

• High availability

• Simpler for developers

• Implementation through Two-phase commit or distributed lock which may turn system slow

• No write scalability

• Performance might be affected

• Possibility of deadlocks

Galera cluster for MySQL

http://galeracluster.com/

Galera Replication is a synchronous multi-master replication plug-in for InnoDB

http://galeracluster.com/

Asynchronous Replication

Write occurs in one node and then it replicates to other nodes

• Less complex (no two-phase commit or distributed locks)

• High availability across datacenters

• Better write scalability

• Eventual consistency

• Write conflicts among masters

• Loss of synchronization is a problem to solve

• More difficult for developers (eventual consistency, write conflicts)

This type of replication is the basic one offered by MySQL, PostgresSQL and MariaDB (and SQL Server???)

Multi-master with Cassandra

Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Asynchronous replication

Tunable consistency

P2P Database Solutions

• Dynamo DB

– http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

• Cassandra

– https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

• Netflix’s Dynomite (Redis and Memcached)

– http://techblog.netflix.com/2014/11/introducing-dynomite.html

– https://github.com/Netflix/dynomite

7/10/201635

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

http://techblog.netflix.com/2014/11/introducing-dynomite.html

https://github.com/Netflix/dynomite

Consistency and Design for Failures

When working with asynchronous replication you need to deal with eventual consistency

• With asynchronous processes in general it is possible that when a process goes to read something that should be there it is not there yet

It could take milliseconds or many seconds

Under heavy load it turns worse

Write conflicts are another issue you need to deal with

• Need to have alarm and repair scripts if an automated solution is not possible

Asynchronous, Fire and forget, Future, Let it be …

7/10/201636

Eventual consistency

Applications

Data

Applications Applications

Data

Load Balancer

Applications

Replicationafter some time

1-WRITE

•Eventually both DBinstances have the same data

2

3

4

Eventual consistency problem

Applications

Data

Applications Applications

Data

Load Balancer

Applications

Replicationafter some time

1-WRITE4-READ

•Read-after-write problem

•Specific solution for each case

•Cannot trust replication will occur after some time

2

3

5

6

7

From Architecture to Development

Designers and developers must understand the consequences of each architecture

Typical questions/comments that predict issues in distributed systems (100% certainty)

• By comparing operations’ time I can determine order

• How long does it take to replicate data?

• We tested it and it is replicating very fast, no problems

• It’s fast. It’s just fire and forget (asynchronous): check if there is a subsequent read associated

7/10/201639

Asynchronous Replication in Active-Active

Network partitioning

7/10/201640

DNSServer

.com Root

GLB

Auth

Datacenter-1

GLB

Auth

Datacenter-2

Service

Cache

Cache

Cache

Cache

DNS

Disk

Disk

Why the hassle of P2P/flat

Best solution for high availability

Self-managed system

Best horizontal and dynamic scalability

Usually, can still write after network partition

7/10/201641

Brewer’s Conjecture and CAP Theorem

• Consistency, Availability, and Partition Tolerance are all desired features of database systems.

• However it is not possible to have all of them: pick only two.

42

A

C P

Availability:Each client can always read and write

Consistency:All clients always have the same view of the data

Partition Tolerance:System works well despite physical network partitions

CA: RDBMS

AP: Dynamo, Cassandra

CP: MongoDB, BigTable

MongoDB

7/10/201643

Source: https://docs.mongodb.com/manual/core/replica-set-elections/

MongoDB

7/10/201644

Source: https://docs.mongodb.com/manual/core/replica-set-elections/

Design (Cloud systems) for Failures

Software

Transcript of Design (Cloud systems) for Failures