Resilent Cloud Applications
description
Transcript of Resilent Cloud Applications
Resilent Cloud ApplicationsMark Simms (@mabsimms)Principal Program ManagerWindows Azure Customer Advisory Team
Session ObjectivesDesigning resilient large-scale services requires careful design and architecture choices
This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples
Interactivity rocks -> please ask questions throughout!
Setting the Stage
Setting the stageScalability
AvailabilityInsight
Setting the stageMaximize service availability for consumersEnsure customers (and client devices) can access and use the service
Minimize impact of failure on consumersDegrade gracefully, isolate faults, fallback to alternate delivery paths
Maximize performance and capacityServices that are “live”, but cannot handle desired/required demand are not available
Musings on application design Traditional web service
design (N-tier) Make “everything
stateless”
Load Balancer
Web Servers
AppServers
Musings on application design Traditional web service
design (N-tier) Make “everything
stateless” Separate logic from
data (state) Leverage specialized
external state services Cache, load balancer,
relational database, document database, key/value store, etc
Load Balancer
Web Servers
AppServers
Database
DistributedCache
Doc Store
...
Musings on application design No service is an island Dependencies on
other internal and external services
Trading time-to-market and agility for control
Load Balancer
Web Servers
AppServers
Database
DistributedCache
Doc Store
...
External Services (SendGrid, Twitter, Facebook, etc)
What’s in a workload?#1: without the relational database the application
cannot fulfill any workloads
#2: the relational database is an external
service, subject to partial availability
Designing for Failure
Decompose by WorkloadApplications are compromised of one or more workloadsProducts like SharePoint and Windows Server are designed with this principle in mindEach with different profiles, requirements and boundariesManagement, Availability, Operational, Cost, Health, Security, Capacity, etc.Decomposition allows for workload specific optimizationTechnology selections, scalability and availability approaches, etc.
What are the “9”sAvailability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
12
• Study Windows Azure Platform SLAs:• Compute External Connectivity: 99.95% (2 or more instances)• Compute Instance Availability: 99.9% (2 or more instances)• Storage Availability: 99.9%• SQL Azure Availability: 99.9%
The Truth About 9s
SLA = *
Define Your SLAs
Design for FailureGiven enough scale, time and pressure all components or services will fail
Your application will experience 1..N failuresHow will your application behave?
Gracefully handle failure modes, continue to deliver value Not so gracefully …
Fault types: Transient. Temporary service interruptions, self-healing Enduring. Require intervention.
Failure ScopeRegion
Service
NodeIndividual Nodes May FailConnectivity Issues (transient failures), hardware failures,
Entire Services May FailService dependencies (internal and external), configuration and code issues
Regions may become unavailableConnectivity Issues, acts of nature
Handling Transient and Enduring Failures Use fault-handling
frameworks that recognize transient errors Make it part of the background ”noise”
Appropriate retry and backoff policies
Handling Transient and Enduring Failures
Handling Transient and Enduring Failures
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290
50000100000150000200000250000300000350000400000450000
Web Request Response Latency
Avg Latency Response latency
• At some point, your request is blocking the line
• Fail gracefully, and get out of the queue!
• Anti-patterns:• Too much trust in
downstream services and client proxies
• Not bounding non-deterministic calls
• Blocking synchronous operations
Sample Retry PoliciesPlatform Context Sample Target
e2e latency max“Fast First”
Retry Count
Delay Backoff
SQL Database
Synchronous (e.g. render web page)
200 ms Yes 3 50 ms Linear
Asynchronous (e.g. process queue item)
60 seconds No 4 5 s Exponential
Azure Cache
Synchronous (e.g. render web page)
100 ms Yes 3 10 ms Linear
Asynchronous (e.g. process queue item)
500 ms Yes 3 100 ms
Exponential
Circuit Breaker at NetflixA request to a remote service times out
Thread pool and bounded task queue used to interact with a service dependency are at 100%
Client library used to interact with a service dependency throws an exception
On
Off
Error RateThresholdCriteria
Circuit Breaker at Netflix - Fallbacks
Deployment Redundancy
Failure PointsFocus on identifying design elements that are subject to external change. For example:
Database connection Website connection Configuration file Registry key
Categories of common Failure Points: ACLs, Database access, External web site/service access,
Transactions, Configuration, Capacity, Network
definition: design elements that can cause an outage.
Failure ModesExamples of failure modes:
Configuration file is not in correct location Too much traffic overusing resources Database reaches maximum capacity
The following would not be considered a failure mode: Product bugs Symptoms of problems Informational occurrences
definition: a predictable root cause of the outage that occurs at a Failure Point.
Failure Mode Example
27
public int GetBusinessData(string[] parameters){ try {
var config = Config.Open(_configPath);var conn = ConnectToDB(config.ConnectString);var data = conn.GetData(_sproc, parameters);return data;
} catch (Exception e) {
WriteEventLogEvent(100, E_ExceptionInDal);throw;
}}
Potential Failure Points: Database Server Database Table Configuration File
Potential Failure Modes: DB Server not responding DB offline DB access denied Sproc execute denied DB doesn’t exist DB timeout on connect Index corrupt Database corrupt Table doesn’t exist Table corrupt Config file missing or
invalid
Design for operations
Running a Live Site Service
Running without Insight / Telemetry
Capturing Insight Log all internal/external “transactions” (database, web services, etc) Application context (module/component) Host context (server/role/instance/process) Timing information (start/stop/duration) Activity identifier
Consolidate logs to central system / dashboard for health monitoring and troubleshooting
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Capturing Insight Capture timing and context information
through helper delegates (background noise)
Capture contextual errors (inner exceptions, etc) on
error
Logging library is asynchronous (fire-and-forget) to avoid blocking
Many Options
Windows Azure Diagnostics
Designing for InsightInstrument for production loggingIf you didn’t capture it, it didn’t happen
Implement inter-service monitoring and alertingCapture and quantify inter-service behavior and activity
Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time
Define ALM
Dev Fabric
Code Unit Test
Run
Check In
Build
Automated Test
Run
Test
Deploy
Dev on Azure
CI
Stage
Deploy
TestMonitor
QA/Pre-release on Azure
Production Release on
Azure
Log Defect
Defect Feature Triage
Plan Fixes Updates
Plan
Design
Scope
Updating Configuration For a production service configuration == code
Need rigorous ALM process for rolling out (and rolling back) updates to both.
Updating Services“We want global, simultaneous production rollouts of our new code”Are you sure about that?
Production rollouts: Running N, N+1 concurrently Rolling load over to N+1, ability to fallback
What is a health model?
Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist
Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.
AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only
Operational Condition
Troubleshooting WorkflowDetectionIs there a problem?
ClassificationWhat’s not working, how bad is it?
DiagnosisWhy is there a problem?
RecoveryWhat needs to be done to fix it?
VerificationIs the problem really gone?
Resources Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)
Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services
(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)
Designing and Deploying Internet Scale Services
https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
Design for Scale
Scale
Resources
Demands
Unit of ScaleWorkloads
Scale by Units
Workload 1
Workload 2
Bottom Ramp Peek
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Data Partitioning
Understanding the 3Vs
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Understanding Queryability
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Horizontal Partitioning
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Vertical Partitioning
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Hybrid Partitioning
Data – to cache or not to cache….
52
Microsoft ConfidentialPush vs. Pull
Load Balanced PushSync and good for sequential processingDependent on downstream servicesThrottling vs. Performance
Managed Pull/ThroughputAsynchronous and event driven processingEasy Parallelisation and PipeliningExtending logic is easy
Logic based• Priority• Date• Amount• Etc.
Time based• ASAP• Gradually• Periodically• On-Demand
Volume based• Single• In Batches
53
Microsoft ConfidentialData on the inside – Data on the outside
http://msdn.microsoft.com/en-us/library/ms954587.aspx
•Immutable (versions)•Requires open schema for interopReference Data
•Low concurrency updates (e.g. shopping basket)Activity Data
•Highly concurrent update (e.g. inventory)•Should live in worker role
Resource (shared) Data
54
Microsoft Confidential“Query Ready” Cache
Query patternsPush the data close to where it is queried– Example: BING Maps
Process, structure, produce, format etc. data and cache “query ready” dataLight/cheap data production is OK
Pure and Idempotent operations are usually good candidatesDuplication is OK
Same data in a different formatSame data in multiple places
This requires processing data before it is queried - NOT at the query timeAll data can be cachedSome data can be cached:Frequently usedProcess Heavy, Expensive dataBuild as you Go
55
Microsoft ConfidentialDistributed Caching
Simple to administerNo need to manage and host a distributed cache yourself.
Integrates easily into existing applicationsASP.NET session state and output cache providers enable no-code integration.
Same managed interfaces as Windows Server AppFabric Cache
On-Premises App Windows Azure App
Core Logic
AppF
abric
Ca
che
APIs Windows
Server AppFabric
CacheCore Logic
AppF
abric
Ca
che
APIs
Windows Azure AppFabric Caching
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Data Resiliency
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Backup and Restore
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Backing Up Table and Blob Storage
Source Replica
Log
Log Replica
01100100 01100001 01110100 01100001
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Managing Backed Up Data
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
CDN
pic1.jpgpic1.jpg
Content Delivery Network
Blob Service
EdgeLocation
EdgeLocation
EdgeLocation
pic1.jpg