Resilent Cloud Applications

Resilent Cloud ApplicationsMark Simms (@mabsimms)Principal Program ManagerWindows Azure Customer Advisory Team

Session ObjectivesDesigning resilient large-scale services requires careful design and architecture choices

This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples

Interactivity rocks -> please ask questions throughout!

Setting the Stage

Setting the stageScalability

AvailabilityInsight

Setting the stageMaximize service availability for consumersEnsure customers (and client devices) can access and use the service

Minimize impact of failure on consumersDegrade gracefully, isolate faults, fallback to alternate delivery paths

Maximize performance and capacityServices that are “live”, but cannot handle desired/required demand are not available

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless”

Load Balancer

Web Servers

AppServers

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless” Separate logic from

data (state) Leverage specialized

external state services Cache, load balancer,

relational database, document database, key/value store, etc

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

Musings on application design No service is an island Dependencies on

other internal and external services

Trading time-to-market and agility for control

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

External Services (SendGrid, Twitter, Facebook, etc)

What’s in a workload?#1: without the relational database the application

cannot fulfill any workloads

#2: the relational database is an external

service, subject to partial availability

Designing for Failure

Decompose by WorkloadApplications are compromised of one or more workloadsProducts like SharePoint and Windows Server are designed with this principle in mindEach with different profiles, requirements and boundariesManagement, Availability, Operational, Cost, Health, Security, Capacity, etc.Decomposition allows for workload specific optimizationTechnology selections, scalability and availability approaches, etc.

What are the “9”sAvailability % Downtime per year Downtime per month* Downtime per week

90% ("one nine") 36.5 days 72 hours 16.8 hours

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

12

• Study Windows Azure Platform SLAs:• Compute External Connectivity: 99.95% (2 or more instances)• Compute Instance Availability: 99.9% (2 or more instances)• Storage Availability: 99.9%• SQL Azure Availability: 99.9%

The Truth About 9s

SLA = *

Define Your SLAs

Design for FailureGiven enough scale, time and pressure all components or services will fail

Your application will experience 1..N failuresHow will your application behave?

Gracefully handle failure modes, continue to deliver value Not so gracefully …

Fault types: Transient. Temporary service interruptions, self-healing Enduring. Require intervention.

Failure ScopeRegion

Service

NodeIndividual Nodes May FailConnectivity Issues (transient failures), hardware failures,

Entire Services May FailService dependencies (internal and external), configuration and code issues

Regions may become unavailableConnectivity Issues, acts of nature

Handling Transient and Enduring Failures Use fault-handling

frameworks that recognize transient errors Make it part of the background ”noise”

Appropriate retry and backoff policies

Handling Transient and Enduring Failures

Handling Transient and Enduring Failures

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290

50000100000150000200000250000300000350000400000450000

Web Request Response Latency

Avg Latency Response latency

• At some point, your request is blocking the line

• Fail gracefully, and get out of the queue!

• Anti-patterns:• Too much trust in

downstream services and client proxies

• Not bounding non-deterministic calls

• Blocking synchronous operations

Sample Retry PoliciesPlatform Context Sample Target

e2e latency max“Fast First”

Retry Count

Delay Backoff

SQL Database

Synchronous (e.g. render web page)

200 ms Yes 3 50 ms Linear

Asynchronous (e.g. process queue item)

60 seconds No 4 5 s Exponential

Azure Cache

Synchronous (e.g. render web page)

100 ms Yes 3 10 ms Linear

Asynchronous (e.g. process queue item)

500 ms Yes 3 100 ms

Exponential

Circuit Breaker at NetflixA request to a remote service times out

Thread pool and bounded task queue used to interact with a service dependency are at 100%

Client library used to interact with a service dependency throws an exception

On

Off

Error RateThresholdCriteria

Circuit Breaker at Netflix - Fallbacks

Deployment Redundancy

Failure PointsFocus on identifying design elements that are subject to external change. For example:

Database connection Website connection Configuration file Registry key

Categories of common Failure Points: ACLs, Database access, External web site/service access,

Transactions, Configuration, Capacity, Network

definition: design elements that can cause an outage.

Failure ModesExamples of failure modes:

Configuration file is not in correct location Too much traffic overusing resources Database reaches maximum capacity

The following would not be considered a failure mode: Product bugs Symptoms of problems Informational occurrences

definition: a predictable root cause of the outage that occurs at a Failure Point.

Failure Mode Example

27

public int GetBusinessData(string[] parameters){ try {

var config = Config.Open(_configPath);var conn = ConnectToDB(config.ConnectString);var data = conn.GetData(_sproc, parameters);return data;

} catch (Exception e) {

WriteEventLogEvent(100, E_ExceptionInDal);throw;

}}

Potential Failure Points: Database Server Database Table Configuration File

Potential Failure Modes: DB Server not responding DB offline DB access denied Sproc execute denied DB doesn’t exist DB timeout on connect Index corrupt Database corrupt Table doesn’t exist Table corrupt Config file missing or

invalid

Design for operations

Running a Live Site Service

Running without Insight / Telemetry

Capturing Insight Log all internal/external “transactions” (database, web services, etc) Application context (module/component) Host context (server/role/instance/process) Timing information (start/stop/duration) Activity identifier

Consolidate logs to central system / dashboard for health monitoring and troubleshooting

MICROSOFT CONF IDENT IAL – INTERNAL ONLY

Capturing Insight Capture timing and context information

through helper delegates (background noise)

Capture contextual errors (inner exceptions, etc) on

error

Logging library is asynchronous (fire-and-forget) to avoid blocking

Many Options

Windows Azure Diagnostics

Designing for InsightInstrument for production loggingIf you didn’t capture it, it didn’t happen

Implement inter-service monitoring and alertingCapture and quantify inter-service behavior and activity

Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time

Define ALM

Dev Fabric

Code Unit Test

Run

Check In

Build

Automated Test

Run

Test

Deploy

Dev on Azure

CI

Stage

Deploy

TestMonitor

QA/Pre-release on Azure

Production Release on

Azure

Log Defect

Defect Feature Triage

Plan Fixes Updates

Plan

Design

Scope

Updating Configuration For a production service configuration == code

Need rigorous ALM process for rolling out (and rolling back) updates to both.

Updating Services“We want global, simultaneous production rollouts of our new code”Are you sure about that?

Production rollouts: Running N, N+1 concurrently Rolling load over to N+1, ability to fallback

What is a health model?

Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist

Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.

AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only

Operational Condition

Troubleshooting WorkflowDetectionIs there a problem?

ClassificationWhat’s not working, how bad is it?

DiagnosisWhy is there a problem?

RecoveryWhat needs to be done to fix it?

VerificationIs the problem really gone?

Resources Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services

(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)

Designing and Deploying Internet Scale Services

https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

http://msdn.microsoft.com/en-us/library/jj853352.aspx

http://msdn.microsoft.com/en-us/library/jj853352.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx





Design for Scale

Scale

Resources

Demands

Unit of ScaleWorkloads

Scale by Units

Workload 1

Workload 2

Bottom Ramp Peek


Data Partitioning

Understanding the 3Vs


Understanding Queryability


Horizontal Partitioning


Vertical Partitioning


Hybrid Partitioning

Data – to cache or not to cache….

52

Microsoft ConfidentialPush vs. Pull

Load Balanced PushSync and good for sequential processingDependent on downstream servicesThrottling vs. Performance

Managed Pull/ThroughputAsynchronous and event driven processingEasy Parallelisation and PipeliningExtending logic is easy

Logic based• Priority• Date• Amount• Etc.

Time based• ASAP• Gradually• Periodically• On-Demand

Volume based• Single• In Batches

53

Microsoft ConfidentialData on the inside – Data on the outside

http://msdn.microsoft.com/en-us/library/ms954587.aspx

•Immutable (versions)•Requires open schema for interopReference Data

•Low concurrency updates (e.g. shopping basket)Activity Data

•Highly concurrent update (e.g. inventory)•Should live in worker role

Resource (shared) Data

54

Microsoft Confidential“Query Ready” Cache

Query patternsPush the data close to where it is queried– Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” dataLight/cheap data production is OK

Pure and Idempotent operations are usually good candidatesDuplication is OK

Same data in a different formatSame data in multiple places

This requires processing data before it is queried - NOT at the query timeAll data can be cachedSome data can be cached:Frequently usedProcess Heavy, Expensive dataBuild as you Go

55

Microsoft ConfidentialDistributed Caching

Simple to administerNo need to manage and host a distributed cache yourself.

Integrates easily into existing applicationsASP.NET session state and output cache providers enable no-code integration.

Same managed interfaces as Windows Server AppFabric Cache

On-Premises App Windows Azure App

Core Logic

AppF

abric

Ca

che

APIs Windows

Server AppFabric

CacheCore Logic

AppF

abric

Ca

che

APIs

Windows Azure AppFabric Caching


Data Resiliency


Backup and Restore


Backing Up Table and Blob Storage

Source Replica

Log

Log Replica

01100100 01100001 01110100 01100001


Managing Backed Up Data


CDN

pic1.jpgpic1.jpg

Content Delivery Network

Blob Service

EdgeLocation

EdgeLocation

EdgeLocation

pic1.jpg

Resilent Cloud Applications

Documents

Transcript of Resilent Cloud Applications